Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the SCAN Redis command #2

Closed
hengestone opened this issue Nov 7, 2017 · 4 comments
Closed

Implement the SCAN Redis command #2

hengestone opened this issue Nov 7, 2017 · 4 comments
Labels
kind/enhancement This is an enhancement of an existing feature priority/low Low priority
Projects

Comments

@hengestone
Copy link

In order to make searching more useful.

@kmuthukk kmuthukk added the kind/enhancement This is an enhancement of an existing feature label Nov 7, 2017
@hectorgcr hectorgcr self-assigned this Nov 7, 2017
@hectorgcr
Copy link
Contributor

Thanks for the input @hengestone

I will be writing a design document for this feature and will share it here once it's ready.

@rkarthik007 rkarthik007 added this to To Do in YCQL via automation Jan 26, 2018
rkarthik007 added a commit that referenced this issue Jan 31, 2018
* Created a docs directory (#2)

Docs directory for design docs, discussions, etc. Also added a basic README.

* Removed extra underline in the README doc.
yugabyte-ci pushed a commit that referenced this issue Feb 2, 2018
… memtable

Summary:
There was a crash during one of our performance integration tests that was caused by Frontiers() not being set on a memtable. That could only possibly happen if the memtable is empty, and it is still not clear how an empty memtable could get into the list of immutable memtables. Regardless of that, instead of crashing, we should just flush that memtable and log an error message.

```
#0  operator() (memtable=..., __closure=0x7f2e454b67b0) at ../../../../../src/yb/tablet/tablet_peer.cc:178
#1  std::_Function_handler<bool(const rocksdb::MemTable&), yb::tablet::TabletPeer::InitTabletPeer(const std::shared_ptr<yb::tablet::enterprise::Tablet>&, const std::shared_future<std::shared_ptr<yb::client::YBClient> >&, const scoped_refptr<yb::server::Clock>&, const std::shared_ptr<yb::rpc::Messenger>&, const scoped_refptr<yb::log::Log>&, const scoped_refptr<yb::MetricEntity>&, yb::ThreadPool*)::<lambda()>::<lambda(const rocksdb::MemTable&)> >::_M_invoke(const std::_Any_data &, const rocksdb::MemTable &) (__functor=..., __args#0=...)  at /n/jenkins/linuxbrew/linuxbrew_2018-01-09T08_28_02/Cellar/gcc/5.5.0/include/c++/5.5.0/functional:1857
#2  0x00007f2f7346a70e in operator() (__args#0=..., this=0x7f2e454b67b0) at /n/jenkins/linuxbrew/linuxbrew_2018-01-09T08_28_02/Cellar/gcc/5.5.0/include/c++/5.5.0/functional:2267
#3  rocksdb::MemTableList::PickMemtablesToFlush(rocksdb::autovector<rocksdb::MemTable*, 8ul>*, std::function<bool (rocksdb::MemTable const&)> const&) (this=0x7d02978, ret=ret@entry=0x7f2e454b6370, filter=...)
    at ../../../../../src/yb/rocksdb/db/memtable_list.cc:259
#4  0x00007f2f7345517f in rocksdb::FlushJob::Run (this=this@entry=0x7f2e454b6750, file_meta=file_meta@entry=0x7f2e454b68d0) at ../../../../../src/yb/rocksdb/db/flush_job.cc:143
#5  0x00007f2f7341b7c3 in rocksdb::DBImpl::FlushMemTableToOutputFile (this=this@entry=0x89d2400, cfd=cfd@entry=0x7d02300, mutable_cf_options=..., made_progress=made_progress@entry=0x7f2e454b709e,
    job_context=job_context@entry=0x7f2e454b70b0, log_buffer=0x7f2e454b7280) at ../../../../../src/yb/rocksdb/db/db_impl.cc:1586
#6  0x00007f2f7341c19f in rocksdb::DBImpl::BackgroundFlush (this=this@entry=0x89d2400, made_progress=made_progress@entry=0x7f2e454b709e, job_context=job_context@entry=0x7f2e454b70b0,
    log_buffer=log_buffer@entry=0x7f2e454b7280) at ../../../../../src/yb/rocksdb/db/db_impl.cc:2816
#7  0x00007f2f7342539b in rocksdb::DBImpl::BackgroundCallFlush (this=0x89d2400) at ../../../../../src/yb/rocksdb/db/db_impl.cc:2838
#8  0x00007f2f735154c3 in rocksdb::ThreadPool::BGThread (this=0x3b0bb20, thread_id=0) at ../../../../../src/yb/rocksdb/util/thread_posix.cc:133
#9  0x00007f2f73515558 in rocksdb::BGThreadWrapper (arg=0xd970a20) at ../../../../../src/yb/rocksdb/util/thread_posix.cc:157
#10 0x00007f2f6c964694 in start_thread (arg=0x7f2e454b8700) at pthread_create.c:333
```

Test Plan: Jenkins

Reviewers: hector, sergei

Reviewed By: hector, sergei

Subscribers: sergei, bogdan, bharat, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D4044
@rahuldesirazu
Copy link
Contributor

rahuldesirazu commented Apr 6, 2018

SCAN cursor [MATCH pattern] [COUNT count]

https://redis.io/commands/scan

Overall Notes

In a nutshell, SCAN is a cursor based function that returns all redis keys in a cluster. Here are a few important pieces from the spec:

A full iteration always retrieves all the elements that were present in the collection from the start to the end of a full iteration. This means that if a given element is inside the collection when an iteration is started, and is still there when an iteration terminates, then at some point SCAN returned it to the user.

A full iteration only returns elements that were present at some point during the iteration. So if an element was removed before the start of an iteration, and is never added back to the collection for all the time an iteration lasts, SCAN ensures that this element will never be returned.

Calling SCAN with a broken, negative, out of range, or otherwise invalid cursor, will result into undefined behavior but never into a crash. What will be undefined is that the guarantees about the returned elements can no longer be ensured by the SCAN implementation.

It is important to note that the Match filter is applied after elements are retrieved from the collection, just before returning data to the client.

Count is just a hint for the implementation, however generally speaking this is what you could expect most of the times from the implementation.

Count

The Redis default is 10, and we will default to the same value. The specified count cannot be too large, or else the proxy might end up processing too many keys at once. Therefore, we should have some threshold gflag such that if count is larger than this number, we default to the threshold.

Match

Most importantly, the match filter is applied after all the elements have been retrieved and right before the set is returned to the client. With this in mind, we will designate the tserver to apply the filter, and return as part of the response to the proxy, both the number of total keys seen, and the list of keys that match the filter. This will result in less memory used on the proxy for processing, especially if the filter matches very few keys and the count is high.

Cursor

The return value of the cursor is a string representing an unsigned 64 bit number. Some redis clients take a string parameter for the cursor, while others take an unsigned long. For the current implementation, we plan to designate the first 2 bytes for the hash code, and the next 6 bytes representing the first 6 characters of the next key to start at. When we iterate using the cursor, we use to the first 2 bytes to find the appropriate hash partition and then seek to the smallest key starting with those 6 bytes. This could potentially end up repeated keys if multiple keys with the same 6 byte prefix

In a future implementation, we could have a string cursor implementation that returns : so that we can seek directly to the key just larger than the last key and not repeat any elements. Some clients only support long cursors, so depending on the format received by the redis server, the server will either do the long SCAN as described above or the string SCAN which would directly seek to the next largest key. The string cursor for the start of the collection would be represented as “0:” instead of “0”.

In the case of a corrupted cursor being used, we should error out to the client.

We need to worry about SCAN not terminating if one hash value has more keys with a certain prefix than the count specified in the query, since it can keep repeating the same keys on each iteration without moving on the next hash value. Since count is an implementation hint and not a requirement, to work around this we can process more keys than count until we are at a prefix greater than what we started at. As long as the cursor changes between calls, we are making progress and SCAN will eventually return to 0.

@mbautin
Copy link
Collaborator

mbautin commented Apr 6, 2018

@rahuldesirazu: I think we could do some optimizations for the pattern case. We could restrict the range of primary keys to scan in case the pattern does not start with a *. For each tablet, we can start scanning, find an existing hash value, and immediately seek to the earliest key with that hash value that could match the pattern. When we scan all the keys with this hash value that match the pattern, we will seek to the incremented hash value.

@rkarthik007 rkarthik007 modified the milestone: v1.0 Apr 11, 2018
yugabyte-ci pushed a commit that referenced this issue Nov 30, 2018
…EANUP

Summary:
We were failing to check the return code of the function `LookupTablePeerOrRespond` when CLEANUP request is received by tablet service.
This was causing the following FATAL right after restart during software upgrade on a cluster with SecondaryIndex workload.

```#0  yb::tserver::TabletServiceImpl::CheckMemoryPressure<yb::tserver::UpdateTransactionResponsePB> (this=this@entry=0x24c2e00, tablet=tablet@entry=0x0,
    resp=resp@entry=0x14d3d410, context=context@entry=0x7f55b1eb5600) at ../../src/yb/tserver/tablet_service.cc:222
#1  0x00007f55d4c8a881 in yb::tserver::TabletServiceImpl::UpdateTransaction (this=this@entry=0x24c2e00, req=req@entry=0x1057aa90, resp=resp@entry=0x14d3d410, context=...)
    at ../../src/yb/tserver/tablet_service.cc:431
#2  0x00007f55d273f28a in yb::tserver::TabletServerServiceIf::Handle (this=0x24c2e00, call=...) at src/yb/tserver/tserver_service.service.cc:267
#3  0x00007f55cff0a3ea in yb::rpc::ServicePoolImpl::Handle (this=0x27ca540, incoming=...) at ../../src/yb/rpc/service_pool.cc:214```

Changed LookupTablePeerOrRespond to return complete result using return value.

Test Plan: Update xdc-user-identity and check that is does not crash and workload is stable.

Reviewers: robert, hector, mikhail, kannan

Reviewed By: mikhail, kannan

Subscribers: kannan, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D5772
@rahuldesirazu
Copy link
Contributor

@hengestone Thanks for your patience. For now, we're no longer adding new Redis functionality to YugaByte. However, we do have support for the KEYS command.

https://docs.yugabyte.com/latest/yedis/api/keys/

@rahuldesirazu rahuldesirazu removed their assignment Jun 5, 2019
@rahuldesirazu rahuldesirazu added the priority/low Low priority label Jun 5, 2019
mbautin pushed a commit that referenced this issue Jun 20, 2019
* Client drivers for YSQL

* Client drivers for YSQL

* Client Drivers for YSQL / Java

* Client driver for YSQL / Go

* Addressed review comments
yugabyte-ci pushed a commit that referenced this issue Jun 24, 2019
Summary:
New commit includes fix for https://github.com/YugaByte/yugabyte-installation/issues/9

Dupe of https://phabricator.dev.yugabyte.com/D6792 due to commit getting pulled back in https://phabricator.dev.yugabyte.com/D6795

Test Plan: Jenkins: skip

Reviewers: mikhail, bogdan

Reviewed By: bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D6803
yugabyte-ci pushed a commit that referenced this issue Jun 27, 2019
…data

Summary:
Originally issue has been discovered as `RaftConsensusITest.TestAddRemoveVoter` random test failure in TSAN mode due to a data race.
```
WARNING: ThreadSanitizer: data race (pid=11050)
1663	[ts-2]	   Write of size 8 at 0x7b4c000603a8 by thread T51 (mutexes: write M3613):
...
1674	[ts-2]	     #10 yb::tablet::KvStoreInfo::LoadTablesFromPB(google::protobuf::RepeatedPtrField<yb::tablet::TableInfoPB>, string) src/yb/tablet/tablet_metadata.cc:170
1675	[ts-2]	     #11 yb::tablet::KvStoreInfo::LoadFromPB(yb::tablet::KvStoreInfoPB const&, string) src/yb/tablet/tablet_metadata.cc:189:10
1676	[ts-2]	     #12 yb::tablet::RaftGroupMetadata::LoadFromSuperBlock(yb::tablet::RaftGroupReplicaSuperBlockPB const&) src/yb/tablet/tablet_metadata.cc:508:5
1677	[ts-2]	     #13 yb::tablet::RaftGroupMetadata::ReplaceSuperBlock(yb::tablet::RaftGroupReplicaSuperBlockPB const&) src/yb/tablet/tablet_metadata.cc:545:3
1678	[ts-2]	     #14 yb::tserver::RemoteBootstrapClient::Finish() src/yb/tserver/remote_bootstrap_client.cc:486:3
...
   Previous read of size 4 at 0x7b4c000603a8 by thread T16:
1697	[ts-2]	     #0 yb::tablet::RaftGroupMetadata::schema_version() const src/yb/tablet/tablet_metadata.h:251:34
1698	[ts-2]	     #1 yb::tserver::TSTabletManager::CreateReportedTabletPB(std::__1::shared_ptr<yb::tablet::TabletPeer> const&, yb::master::ReportedTabletPB*) src/yb/tserver/ts_tablet_manager.cc:1323:71
1699	[ts-2]	     #2 yb::tserver::TSTabletManager::GenerateIncrementalTabletReport(yb::master::TabletReportPB*) src/yb/tserver/ts_tablet_manager.cc:1359:5
1700	[ts-2]	     #3 yb::tserver::Heartbeater::Thread::TryHeartbeat() src/yb/tserver/heartbeater.cc:371:32
1701	[ts-2]	     #4 yb::tserver::Heartbeater::Thread::DoHeartbeat() src/yb/tserver/heartbeater.cc:531:19
```

The reason is that although `RaftGroupMetadata::schema_version()` is getting `TableInfo` pointer from `primary_table_info()` under mutex lock, but then it accesses its field without lock.

Added `RaftGroupMetadata::primary_table_info_guarded()` private method which returns a pair of `TableInfo*` and `std::unique_lock` and used it in `RaftGroupMetadata::schema_version()` and other `RaftGroupMetadata` functions accessing primary table info fields.

Test Plan: `ybd tsan --sj --cxx-test integration-tests_raft_consensus-itest --gtest_filter RaftConsensusITest.TestAddRemoveVoter -n 1000`

Reviewers: bogdan, sergei

Reviewed By: sergei

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D6813
mbautin added a commit that referenced this issue Jul 11, 2019
…ed to the

earlier commit 864e72b

Original commit message:

ENG-2793 Do not fail when deciding if we can flush an empty immutable memtable

Summary:
There was a crash during one of our performance integration tests that was caused by Frontiers() not being set on a memtable. That could only possibly happen if the memtable is empty, and it is still not clear how an empty memtable could get into the list of immutable memtables. Regardless of that, instead of crashing, we should just flush that memtable and log an error message.

```
#0  operator() (memtable=..., __closure=0x7f2e454b67b0) at ../../../../../src/yb/tablet/tablet_peer.cc:178
#1  std::_Function_handler<bool(const rocksdb::MemTable&), yb::tablet::TabletPeer::InitTabletPeer(const std::shared_ptr<yb::tablet::enterprise::Tablet>&, const std::shared_future<std::shared_ptr<yb::client::YBClient> >&, const scoped_refptr<yb::server::Clock>&, const std::shared_ptr<yb::rpc::Messenger>&, const scoped_refptr<yb::log::Log>&, const scoped_refptr<yb::MetricEntity>&, yb::ThreadPool*)::<lambda()>::<lambda(const rocksdb::MemTable&)> >::_M_invoke(const std::_Any_data &, const rocksdb::MemTable &) (__functor=..., __args#0=...)  at /n/jenkins/linuxbrew/linuxbrew_2018-01-09T08_28_02/Cellar/gcc/5.5.0/include/c++/5.5.0/functional:1857
#2  0x00007f2f7346a70e in operator() (__args#0=..., this=0x7f2e454b67b0) at /n/jenkins/linuxbrew/linuxbrew_2018-01-09T08_28_02/Cellar/gcc/5.5.0/include/c++/5.5.0/functional:2267
#3  rocksdb::MemTableList::PickMemtablesToFlush(rocksdb::autovector<rocksdb::MemTable*, 8ul>*, std::function<bool (rocksdb::MemTable const&)> const&) (this=0x7d02978, ret=ret@entry=0x7f2e454b6370, filter=...)
    at ../../../../../src/yb/rocksdb/db/memtable_list.cc:259
#4  0x00007f2f7345517f in rocksdb::FlushJob::Run (this=this@entry=0x7f2e454b6750, file_meta=file_meta@entry=0x7f2e454b68d0) at ../../../../../src/yb/rocksdb/db/flush_job.cc:143
#5  0x00007f2f7341b7c3 in rocksdb::DBImpl::FlushMemTableToOutputFile (this=this@entry=0x89d2400, cfd=cfd@entry=0x7d02300, mutable_cf_options=..., made_progress=made_progress@entry=0x7f2e454b709e,
    job_context=job_context@entry=0x7f2e454b70b0, log_buffer=0x7f2e454b7280) at ../../../../../src/yb/rocksdb/db/db_impl.cc:1586
#6  0x00007f2f7341c19f in rocksdb::DBImpl::BackgroundFlush (this=this@entry=0x89d2400, made_progress=made_progress@entry=0x7f2e454b709e, job_context=job_context@entry=0x7f2e454b70b0,
    log_buffer=log_buffer@entry=0x7f2e454b7280) at ../../../../../src/yb/rocksdb/db/db_impl.cc:2816
#7  0x00007f2f7342539b in rocksdb::DBImpl::BackgroundCallFlush (this=0x89d2400) at ../../../../../src/yb/rocksdb/db/db_impl.cc:2838
#8  0x00007f2f735154c3 in rocksdb::ThreadPool::BGThread (this=0x3b0bb20, thread_id=0) at ../../../../../src/yb/rocksdb/util/thread_posix.cc:133
#9  0x00007f2f73515558 in rocksdb::BGThreadWrapper (arg=0xd970a20) at ../../../../../src/yb/rocksdb/util/thread_posix.cc:157
#10 0x00007f2f6c964694 in start_thread (arg=0x7f2e454b8700) at pthread_create.c:333
```

Test Plan: Jenkins

Reviewers: hector, sergei

Reviewed By: hector, sergei

Subscribers: sergei, bogdan, bharat, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D4044
mbautin pushed a commit that referenced this issue Jul 11, 2019
…ed to the

earlier commit 566d6d2

Original commit message:

ENG-4240: #613: Fix checking of tablet presence during transaction CLEANUP

Summary:
We were failing to check the return code of the function `LookupTablePeerOrRespond` when CLEANUP request is received by tablet service.
This was causing the following FATAL right after restart during software upgrade on a cluster with SecondaryIndex workload.

```#0  yb::tserver::TabletServiceImpl::CheckMemoryPressure<yb::tserver::UpdateTransactionResponsePB> (this=this@entry=0x24c2e00, tablet=tablet@entry=0x0,
    resp=resp@entry=0x14d3d410, context=context@entry=0x7f55b1eb5600) at ../../src/yb/tserver/tablet_service.cc:222
#1  0x00007f55d4c8a881 in yb::tserver::TabletServiceImpl::UpdateTransaction (this=this@entry=0x24c2e00, req=req@entry=0x1057aa90, resp=resp@entry=0x14d3d410, context=...)
    at ../../src/yb/tserver/tablet_service.cc:431
#2  0x00007f55d273f28a in yb::tserver::TabletServerServiceIf::Handle (this=0x24c2e00, call=...) at src/yb/tserver/tserver_service.service.cc:267
#3  0x00007f55cff0a3ea in yb::rpc::ServicePoolImpl::Handle (this=0x27ca540, incoming=...) at ../../src/yb/rpc/service_pool.cc:214```

Changed LookupTablePeerOrRespond to return complete result using return value.

Test Plan: Update xdc-user-identity and check that is does not crash and workload is stable.

Reviewers: robert, hector, mikhail, kannan

Reviewed By: mikhail, kannan

Subscribers: kannan, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D5772
amannijhawan added a commit that referenced this issue Dec 9, 2023
…rt -1

Summary:
Original commit: 651a2e8 / D29938
One of the first tasks that are kicked off during an edit universe is for DiskResizing.
This change makes the createResizeDiskTask function idempotent.
It will only create the disk resize tasks if the size specified is different from the current
volume on the pod.

Test Plan:
Tested by making task abortable and retryable, retried the edit kubernetes task after aborting disk resize in the middle.

```
YW 2023-11-03T06:04:18.533Z [INFO] 9ec7f5dd-bdcd-4917-868e-2d7bf85e4f9e from EditKubernetesUniverse in TaskPool-6 - Creating task for disk size change from 100 to 200
YW 2023-11-03T06:04:18.587Z [INFO] 9ec7f5dd-bdcd-4917-868e-2d7bf85e4f9e from ShellProcessHandler in TaskPool-6 - Starting proc (abbrev cmd) - kubectl --namespace yb-admin-test1 get pvc -l app.kubernetes.io/name=yb-tserver,release=ybtest1-us-west1-a-twed -o json
YW 2023-11-03T06:04:18.587Z [INFO] 9ec7f5dd-bdcd-4917-868e-2d7bf85e4f9e from ShellProcessHandler in TaskPool-6 - Starting proc (full cmd) - 'kubectl' '--namespace' 'yb-admin-test1' 'get' 'pvc' '-l' 'app.kubernetes.io/name=yb-tserver,release=ybtest1-us-west1-a-twed' '-o' 'json' - logging stdout=/tmp/shell_process_out5549963527175565003tmp, stderr=/tmp/shell_process_err5819548201875528501tmp
YW 2023-11-03T06:04:19.095Z [INFO] 9ec7f5dd-bdcd-4917-868e-2d7bf85e4f9e from ShellProcessHandler in TaskPool-6 - Completed proc 'kubectl --namespace yb-admin-test1 get pvc -l app.kubernetes.io/name=yb-tserver,release=ybtest1-us-west1-a-twed -o json' status=success [ 508 ms ]
YW 2023-11-03T06:04:19.104Z [INFO] 9ec7f5dd-bdcd-4917-868e-2d7bf85e4f9e from PlacementInfoUtil in TaskPool-6 - Incrementing RF for us-west1-a to: 1
YW 2023-11-03T06:04:19.105Z [INFO] 9ec7f5dd-bdcd-4917-868e-2d7bf85e4f9e from PlacementInfoUtil in TaskPool-6 - Number of nodes in us-west1-a: 1
YW 2023-11-03T06:04:19.105Z [INFO] 9ec7f5dd-bdcd-4917-868e-2d7bf85e4f9e from TaskExecutor in TaskPool-6 - Adding task #0: KubernetesCheckVolumeExpansion
YW 2023-11-03T06:04:19.105Z [DEBUG] 9ec7f5dd-bdcd-4917-868e-2d7bf85e4f9e from TaskExecutor in TaskPool-6 - Details for task #0: KubernetesCheckVolumeExpansion details= {"platformVersion":"2.21.0.0-PRE_RELEASE","config":{"KUBECONFIG_PULL_SECRET":"/opt/yugaware/keys/7ae205f4-95ee-4aa5-b2f5-edb9ce793554/anijhawan_quay_pull_secret","KUBECONFIG":"/opt/yugaware/keys/7ae205f4-95ee-4aa5-b2f5-edb9ce793554/kubeconfig-202301","STORAGE_CLASS":"yb-standard","KUBECONFIG_PROVIDER":"gke","KUBECONFIG_IMAGE_PULL_SECRET_NAME":"anijhawan-pull-secret","KUBECONFIG_IMAGE_REGISTRY":"quay.io/yugabyte/yugabyte-itest"},"newNamingStyle":true,"namespace":"yb-admin-test1","providerUUID":"7ae205f4-95ee-4aa5-b2f5-edb9ce793554","helmReleaseName":"ybtest1-us-west1-a-twed"}
YW 2023-11-03T06:04:19.105Z [INFO] 9ec7f5dd-bdcd-4917-868e-2d7bf85e4f9e from TaskExecutor in TaskPool-6 - Adding SubTaskGroup #0: KubernetesVolumeInfo
YW 2023-11-03T06:04:19.108Z [INFO] 9ec7f5dd-bdcd-4917-868e-2d7bf85e4f9e from AbstractTaskBase in TaskPool-6 - Executor name: task
YW 2023-11-03T06:04:19.108Z [INFO] 9ec7f5dd-bdcd-4917-868e-2d7bf85e4f9e from TaskExecutor in TaskPool-6 - Adding task #0: KubernetesCommandExecutor(347eb7be-88b5-44ed-b519-1052487e5ced)
YW 2023-11-03T06:04:19.110Z [DEBUG] 9ec7f5dd-bdcd-4917-868e-2d7bf85e4f9e from TaskExecutor in TaskPool-6 - Details for task #0: KubernetesCommandExecutor(347eb7be-88b5-44ed-b519-1052487e5ced) details= {"platformVersion":"2.21.0.0-PRE_RELEASE","sleepAfterMasterRestartMillis":180000,"sleepAfterTServerRestartMillis":180000,"nodeExporterUser":"prometheus","universeUUID":"347eb7be-88b5-44ed-b519-1052487e5ced","enableYbc":false,"installYbc":false,"ybcInstalled":false,"encryptionAtRestConfig":{"encryptionAtRestEnabled":false,"opType":"UNDEFINED","type":"DATA_KEY"},"communicationPorts":{"masterHttpPort":7000,"masterRpcPort":7100,"tserverHttpPort":9000,"tserverRpcPort":9100,"ybControllerHttpPort":14000,"ybControllerrRpcPort":18018,"redisServerHttpPort":11000,"redisServerRpcPort":6379,"yqlServerHttpPort":12000,"yqlServerRpcPort":9042,"ysqlServerHttpPort":13000,"ysqlServerRpcPort":5433,"nodeExporterPort":9300},"extraDependencies":{"installNodeExporter":true},"providerUUID":"7ae205f4-95ee-4aa5-b2f5-edb9ce793554","universeName":"test1","commandType":"STS_DELETE","helmReleaseName":"ybtest1-us-west1-a-twed","namespace":"yb-admin-test1","isReadOnlyCluster":false,"ybSoftwareVersion":"2.19.3.0-b80","enableNodeToNodeEncrypt":true,"enableClientToNodeEncrypt":true,"serverType":"TSERVER","tserverPartition":0,"masterPartition":0,"newDiskSize":"200Gi","masterAddresses":"ybtest1-us-west1-a-twed-yb-master-0.ybtest1-us-west1-a-twed-yb-masters.yb-admin-test1.svc.cluster.local:7100,ybtest1-us-west1-b-uwed-yb-master-0.ybtest1-us-west1-b-uwed-yb-masters.yb-admin-test1.svc.cluster.local:7100,ybtest1-us-west1-c-vwed-yb-master-0.ybtest1-us-west1-c-vwed-yb-masters.yb-admin-test1.svc.cluster.local:7100","placementInfo":{"cloudList":[{"uuid":"7ae205f4-95ee-4aa5-b2f5-edb9ce793554","code":"kubernetes","regionList":[{"uuid":"80f07c68-f739-45b5-a91a-e8f8f4b0fc6d","code":"us-west1","name":"Oregon","azList":[{"uuid":"42b3fd5a-2c30-48c5-9335-d71dc60a773f","name":"us-west1-a","replicationFactor":1,"numNodesInAZ":1,"isAffinitized":true}]}]}]},"updateStrategy":"RollingUpdate","config":{"KUBECONFIG_PULL_SECRET":"/opt/yugaware/keys/7ae205f4-95ee-4aa5-b2f5-edb9ce793554/anijhawan_quay_pull_secret","KUBECONFIG":"/opt/yugaware/keys/7ae205f4-95ee-4aa5-b2f5-edb9ce793554/kubeconfig-202301","STORAGE_CLASS":"yb-standard","KUBECONFIG_PROVIDER":"gke","KUBECONFIG_IMAGE_PULL_SECRET_NAME":"anijhawan-pull-secret","KUBECONFIG_IMAGE_REGISTRY":"quay.io/yugabyte/yugabyte-itest"},"azCode":"us-west1-a","targetXClusterConfigs":[],"sourceXClusterConfigs":[]}
YW 2023-11-03T06:04:19.111Z [INFO] 9ec7f5dd-bdcd-4917-868e-2d7bf85e4f9e from TaskExecutor in TaskPool-6 - Adding SubTaskGroup #1: ResizingDisk
YW 2023-11-03T06:04:19.113Z [INFO] 9ec7f5dd-bdcd-4917-868e-2d7bf85e4f9e from TaskExecutor in TaskPool-6 - Setting subtask(ResizingDisk) group type to Provisioning
YW 2023-11-03T06:04:19.115Z [INFO] 9ec7f5dd-bdcd-4917-868e-2d7bf85e4f9e from TaskExecutor in TaskPool-6 - Adding task #0: KubernetesCommandExecutor(347eb7be-88b5-44ed-b519-1052487e5ced)
YW 2023-11-03T06:04:19.117Z [DEBUG] 9ec7f5dd-bdcd-4917-868e-2d7bf85e4f9e from TaskExecutor in TaskPool-6 - Details for task #0: KubernetesCommandExecutor(347eb7be-88b5-44ed-b519-1052487e5ced) details= {"platformVersion":"2.21.0.0-PRE_RELEASE","sleepAfterMasterRestartMillis":180000,"sleepAfterTServerRestartMillis":180000,"nodeExporterUser":"prometheus","universeUUID":"347eb7be-88b5-44ed-b519-1052487e5ced","enableYbc":false,"installYbc":false,"ybcInstalled":false,"encryptionAtRestConfig":{"encryptionAtRestEnabled":false,"opType":"UNDEFINED","type":"DATA_KEY"},"communicationPorts":{"masterHttpPort":7000,"masterRpcPort":7100,"tserverHttpPort":9000,"tserverRpcPort":9100,"ybControllerHttpPort":14000,"ybControllerrRpcPort":18018,"redisServerHttpPort":11000,"redisServerRpcPort":6379,"yqlServerHttpPort":12000,"yqlServerRpcPort":9042,"ysqlServerHttpPort":13000,"ysqlServerRpcPort":5433,"nodeExporterPort":9300},"extraDependencies":{"installNodeExporter":true},"providerUUID":"7ae205f4-95ee-4aa5-b2f5-edb9ce793554","universeName":"test1","commandType":"PVC_EXPAND_SIZE","helmReleaseName":"ybtest1-us-west1-a-twed","namespace":"yb-admin-test1","isReadOnlyCluster":false,"ybSoftwareVersion":"2.19.3.0-b80","enableNodeToNodeEncrypt":true,"enableClientToNodeEncrypt":true,"serverType":"TSERVER","tserverPartition":0,"masterPartition":0,"newDiskSize":"200Gi","masterAddresses":"ybtest1-us-west1-a-twed-yb-master-0.ybtest1-us-west1-a-twed-yb-masters.yb-admin-test1.svc.cluster.local:7100,ybtest1-us-west1-b-uwed-yb-master-0.ybtest1-us-west1-b-uwed-yb-masters.yb-admin-test1.svc.cluster.local:7100,ybtest1-us-west1-c-vwed-yb-master-0.ybtest1-us-west1-c-vwed-yb-masters.yb-admin-test1.svc.cluster.local:7100","placementInfo":{"cloudList":[{"uuid":"7ae205f4-95ee-4aa5-b2f5-edb9ce793554","code":"kubernetes","regionList":[{"uuid":"80f07c68-f739-45b5-a91a-e8f8f4b0fc6d","code":"us-west1","name":"Oregon","azList":[{"uuid":"42b3fd5a-2c30-48c5-9335-d71dc60a773f","name":"us-west1-a","replicationFactor":1,"numNodesInAZ":1,"isAffinitized":true}]}]}]},"updateStrategy":"RollingUpdate","config":{"KUBECONFIG_PULL_SECRET":"/opt/yugaware/keys/7ae205f4-95ee-4aa5-b2f5-edb9ce793554/anijhawan_quay_pull_secret","KUBECONFIG":"/opt/yugaware/keys/7ae205f4-95ee-4aa5-b2f5-edb9ce793554/kubeconfig-202301","STORAGE_CLASS":"yb-standard","KUBECONFIG_PROVIDER":"gke","KUBECONFIG_IMAGE_PULL_SECRET_NAME":"anijhawan-pull-secret","KUBECONFIG_IMAGE_REGISTRY":"quay.io/yugabyte/yugabyte-itest"},"azCode":"us-west1-a","targetXClusterConfigs":[],"sourceXClusterConfigs":[]}
YW 2023-11-03T06:04:19.119Z [INFO] 9ec7f5dd-bdcd-4917-868e-2d7bf85e4f9e from TaskExecutor in TaskPool-6 - Adding SubTaskGroup #2: ResizingDisk
YW 2023-11-03T06:04:19.120Z [INFO] 9ec7f5dd-bdcd-4917-868e-2d7bf85e4f9e from TaskExecutor in TaskPool-6 - Setting subtask(ResizingDisk) group type to Provisioning
...
```
Verified disk size was increased

```
[centos@dev-server-anijhawan-4 managed]$ kubectl -n yb-admin-test1  get pvc ybtest1-us-west1-b-uwed-datadir0-ybtest1-us-west1-b-uwed-yb-tserver-0  ybtest1-us-west1-a-twed-datadir0-ybtest1-us-west1-a-twed-yb-tserver-0  ybtest1-us-west1-c-vwed-datadir0-ybtest1-us-west1-c-vwed-yb-tserver-0  -o yaml | grep storage
      volume.beta.kubernetes.io/storage-provisioner: pd.csi.storage.gke.io
      volume.kubernetes.io/storage-provisioner: pd.csi.storage.gke.io
      volume.kubernetes.io/storage-resizer: pd.csi.storage.gke.io
        storage: 200Gi
    storageClassName: yb-standard
      storage: 200Gi
      volume.beta.kubernetes.io/storage-provisioner: pd.csi.storage.gke.io
      volume.kubernetes.io/storage-provisioner: pd.csi.storage.gke.io
      volume.kubernetes.io/storage-resizer: pd.csi.storage.gke.io
        storage: 200Gi
    storageClassName: yb-standard
      storage: 200Gi
      volume.beta.kubernetes.io/storage-provisioner: pd.csi.storage.gke.io
      volume.kubernetes.io/storage-provisioner: pd.csi.storage.gke.io
      volume.kubernetes.io/storage-resizer: pd.csi.storage.gke.io
        storage: 200Gi
    storageClassName: yb-standard
      storage: 200Gi

```

Retry logs we can see function was invoke but task creation was skipped.

```
YW 2023-11-03T06:07:10.173Z [DEBUG] ab2e48ec-a204-4449-af99-dd1db5cb15d8 from TaskExecutor in TaskPool-7 - Invoking run() of task EditKubernetesUniverse(347eb7be-88b5-44ed-b519-1052487e5ced)
YW 2023-11-03T06:07:10.173Z [INFO] ab2e48ec-a204-4449-af99-dd1db5cb15d8 from CustomerTaskController in application-akka.actor.default-dispatcher-2292 - Saved task uuid 66611664-a25f-4ad2-93aa-e40a7db67654 in customer tasks table for target 347eb7be-88b5-44ed-b519-1052487e5ced:test1
YW 2023-11-03T06:07:10.322Z [DEBUG] ab2e48ec-a204-4449-af99-dd1db5cb15d8 from TransactionUtil in TaskPool-7 - Trying(1)...
YW 2023-11-03T06:07:10.333Z [DEBUG] ab2e48ec-a204-4449-af99-dd1db5cb15d8 from UniverseTaskBase in TaskPool-7 - Cancelling any active health-checks for universe 347eb7be-88b5-44ed-b519-1052487e5ced
YW 2023-11-03T06:07:10.379Z [INFO] ab2e48ec-a204-4449-af99-dd1db5cb15d8 from EditKubernetesUniverse in TaskPool-7 - Creating task for disk size change from 100 to 200
YW 2023-11-03T06:07:10.436Z [INFO] ab2e48ec-a204-4449-af99-dd1db5cb15d8 from ShellProcessHandler in TaskPool-7 - Starting proc (abbrev cmd) - kubectl --namespace yb-admin-test1 get pvc -l app.kubernetes.io/name=yb-tserver,release=ybtest1-us-west1-a-twed -o json
YW 2023-11-03T06:07:10.436Z [INFO] ab2e48ec-a204-4449-af99-dd1db5cb15d8 from ShellProcessHandler in TaskPool-7 - Starting proc (full cmd) - 'kubectl' '--namespace' 'yb-admin-test1' 'get' 'pvc' '-l' 'app.kubernetes.io/name=yb-tserver,release=ybtest1-us-west1-a-twed' '-o' 'json' - logging stdout=/tmp/shell_process_out15761747450556728945tmp, stderr=/tmp/shell_process_err16162390392062292532tmp
YW 2023-11-03T06:07:10.941Z [INFO] ab2e48ec-a204-4449-af99-dd1db5cb15d8 from ShellProcessHandler in TaskPool-7 - Completed proc 'kubectl --namespace yb-admin-test1 get pvc -l app.kubernetes.io/name=yb-tserver,release=ybtest1-us-west1-a-twed -o json' status=success [ 505 ms ]
YW 2023-11-03T06:07:10.982Z [INFO] ab2e48ec-a204-4449-af99-dd1db5cb15d8 from ShellProcessHandler in TaskPool-7 - Starting proc (abbrev cmd) - kubectl --namespace yb-admin-test1 get pvc -l app.kubernetes.io/name=yb-tserver,release=ybtest1-us-west1-b-uwed -o json
YW 2023-11-03T06:07:10.982Z [INFO] ab2e48ec-a204-4449-af99-dd1db5cb15d8 from ShellProcessHandler in TaskPool-7 - Starting proc (full cmd) - 'kubectl' '--namespace' 'yb-admin-test1' 'get' 'pvc' '-l' 'app.kubernetes.io/name=yb-tserver,release=ybtest1-us-west1-b-uwed' '-o' 'json' - logging stdout=/tmp/shell_process_out16328458040940971014tmp, stderr=/tmp/shell_process_err9595293916813332432tmp
YW 2023-11-03T06:07:11.487Z [INFO] ab2e48ec-a204-4449-af99-dd1db5cb15d8 from ShellProcessHandler in TaskPool-7 - Completed proc 'kubectl --namespace yb-admin-test1 get pvc -l app.kubernetes.io/name=yb-tserver,release=ybtest1-us-west1-b-uwed -o json' status=success [ 505 ms ]
YW 2023-11-03T06:07:11.526Z [INFO] ab2e48ec-a204-4449-af99-dd1db5cb15d8 from ShellProcessHandler in TaskPool-7 - Starting proc (abbrev cmd) - kubectl --namespace yb-admin-test1 get pvc -l app.kubernetes.io/name=yb-tserver,release=ybtest1-us-west1-c-vwed -o json
YW 2023-11-03T06:07:11.527Z [INFO] ab2e48ec-a204-4449-af99-dd1db5cb15d8 from ShellProcessHandler in TaskPool-7 - Starting proc (full cmd) - 'kubectl' '--namespace' 'yb-admin-test1' 'get' 'pvc' '-l' 'app.kubernetes.io/name=yb-tserver,release=ybtest1-us-west1-c-vwed' '-o' 'json' - logging stdout=/tmp/shell_process_out11035907328384396246tmp, stderr=/tmp/shell_process_err3826067280996541352tmp
YW 2023-11-03T06:07:12.031Z [INFO] ab2e48ec-a204-4449-af99-dd1db5cb15d8 from ShellProcessHandler in TaskPool-7 - Completed proc 'kubectl --namespace yb-admin-test1 get pvc -l app.kubernetes.io/name=yb-tserver,release=ybtest1-us-west1-c-vwed -o json' status=success [ 505 ms ]
```

Reviewers: sanketh, nsingh, sneelakantan, dshubin, cwang, nbhatia

Reviewed By: cwang, nbhatia

Subscribers: cwang, nbhatia, yugaware

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D30901
amannijhawan added a commit that referenced this issue Dec 14, 2023
…part -1

Summary:
Original commit: 98de5da / D30901
One of the first tasks that are kicked off during an edit universe is for DiskResizing.
This change makes the createResizeDiskTask function idempotent.
It will only create the disk resize tasks if the size specified is different from the current
volume on the pod.

Test Plan:
Tested by making task abortable and retryable, retried the edit kubernetes task after aborting disk resize in the middle.

```
YW 2023-11-03T06:04:18.533Z [INFO] 9ec7f5dd-bdcd-4917-868e-2d7bf85e4f9e from EditKubernetesUniverse in TaskPool-6 - Creating task for disk size change from 100 to 200
YW 2023-11-03T06:04:18.587Z [INFO] 9ec7f5dd-bdcd-4917-868e-2d7bf85e4f9e from ShellProcessHandler in TaskPool-6 - Starting proc (abbrev cmd) - kubectl --namespace yb-admin-test1 get pvc -l app.kubernetes.io/name=yb-tserver,release=ybtest1-us-west1-a-twed -o json
YW 2023-11-03T06:04:18.587Z [INFO] 9ec7f5dd-bdcd-4917-868e-2d7bf85e4f9e from ShellProcessHandler in TaskPool-6 - Starting proc (full cmd) - 'kubectl' '--namespace' 'yb-admin-test1' 'get' 'pvc' '-l' 'app.kubernetes.io/name=yb-tserver,release=ybtest1-us-west1-a-twed' '-o' 'json' - logging stdout=/tmp/shell_process_out5549963527175565003tmp, stderr=/tmp/shell_process_err5819548201875528501tmp
YW 2023-11-03T06:04:19.095Z [INFO] 9ec7f5dd-bdcd-4917-868e-2d7bf85e4f9e from ShellProcessHandler in TaskPool-6 - Completed proc 'kubectl --namespace yb-admin-test1 get pvc -l app.kubernetes.io/name=yb-tserver,release=ybtest1-us-west1-a-twed -o json' status=success [ 508 ms ]
YW 2023-11-03T06:04:19.104Z [INFO] 9ec7f5dd-bdcd-4917-868e-2d7bf85e4f9e from PlacementInfoUtil in TaskPool-6 - Incrementing RF for us-west1-a to: 1
YW 2023-11-03T06:04:19.105Z [INFO] 9ec7f5dd-bdcd-4917-868e-2d7bf85e4f9e from PlacementInfoUtil in TaskPool-6 - Number of nodes in us-west1-a: 1
YW 2023-11-03T06:04:19.105Z [INFO] 9ec7f5dd-bdcd-4917-868e-2d7bf85e4f9e from TaskExecutor in TaskPool-6 - Adding task #0: KubernetesCheckVolumeExpansion
YW 2023-11-03T06:04:19.105Z [DEBUG] 9ec7f5dd-bdcd-4917-868e-2d7bf85e4f9e from TaskExecutor in TaskPool-6 - Details for task #0: KubernetesCheckVolumeExpansion details= {"platformVersion":"2.21.0.0-PRE_RELEASE","config":{"KUBECONFIG_PULL_SECRET":"/opt/yugaware/keys/7ae205f4-95ee-4aa5-b2f5-edb9ce793554/anijhawan_quay_pull_secret","KUBECONFIG":"/opt/yugaware/keys/7ae205f4-95ee-4aa5-b2f5-edb9ce793554/kubeconfig-202301","STORAGE_CLASS":"yb-standard","KUBECONFIG_PROVIDER":"gke","KUBECONFIG_IMAGE_PULL_SECRET_NAME":"anijhawan-pull-secret","KUBECONFIG_IMAGE_REGISTRY":"quay.io/yugabyte/yugabyte-itest"},"newNamingStyle":true,"namespace":"yb-admin-test1","providerUUID":"7ae205f4-95ee-4aa5-b2f5-edb9ce793554","helmReleaseName":"ybtest1-us-west1-a-twed"}
YW 2023-11-03T06:04:19.105Z [INFO] 9ec7f5dd-bdcd-4917-868e-2d7bf85e4f9e from TaskExecutor in TaskPool-6 - Adding SubTaskGroup #0: KubernetesVolumeInfo
YW 2023-11-03T06:04:19.108Z [INFO] 9ec7f5dd-bdcd-4917-868e-2d7bf85e4f9e from AbstractTaskBase in TaskPool-6 - Executor name: task
YW 2023-11-03T06:04:19.108Z [INFO] 9ec7f5dd-bdcd-4917-868e-2d7bf85e4f9e from TaskExecutor in TaskPool-6 - Adding task #0: KubernetesCommandExecutor(347eb7be-88b5-44ed-b519-1052487e5ced)
YW 2023-11-03T06:04:19.110Z [DEBUG] 9ec7f5dd-bdcd-4917-868e-2d7bf85e4f9e from TaskExecutor in TaskPool-6 - Details for task #0: KubernetesCommandExecutor(347eb7be-88b5-44ed-b519-1052487e5ced) details= {"platformVersion":"2.21.0.0-PRE_RELEASE","sleepAfterMasterRestartMillis":180000,"sleepAfterTServerRestartMillis":180000,"nodeExporterUser":"prometheus","universeUUID":"347eb7be-88b5-44ed-b519-1052487e5ced","enableYbc":false,"installYbc":false,"ybcInstalled":false,"encryptionAtRestConfig":{"encryptionAtRestEnabled":false,"opType":"UNDEFINED","type":"DATA_KEY"},"communicationPorts":{"masterHttpPort":7000,"masterRpcPort":7100,"tserverHttpPort":9000,"tserverRpcPort":9100,"ybControllerHttpPort":14000,"ybControllerrRpcPort":18018,"redisServerHttpPort":11000,"redisServerRpcPort":6379,"yqlServerHttpPort":12000,"yqlServerRpcPort":9042,"ysqlServerHttpPort":13000,"ysqlServerRpcPort":5433,"nodeExporterPort":9300},"extraDependencies":{"installNodeExporter":true},"providerUUID":"7ae205f4-95ee-4aa5-b2f5-edb9ce793554","universeName":"test1","commandType":"STS_DELETE","helmReleaseName":"ybtest1-us-west1-a-twed","namespace":"yb-admin-test1","isReadOnlyCluster":false,"ybSoftwareVersion":"2.19.3.0-b80","enableNodeToNodeEncrypt":true,"enableClientToNodeEncrypt":true,"serverType":"TSERVER","tserverPartition":0,"masterPartition":0,"newDiskSize":"200Gi","masterAddresses":"ybtest1-us-west1-a-twed-yb-master-0.ybtest1-us-west1-a-twed-yb-masters.yb-admin-test1.svc.cluster.local:7100,ybtest1-us-west1-b-uwed-yb-master-0.ybtest1-us-west1-b-uwed-yb-masters.yb-admin-test1.svc.cluster.local:7100,ybtest1-us-west1-c-vwed-yb-master-0.ybtest1-us-west1-c-vwed-yb-masters.yb-admin-test1.svc.cluster.local:7100","placementInfo":{"cloudList":[{"uuid":"7ae205f4-95ee-4aa5-b2f5-edb9ce793554","code":"kubernetes","regionList":[{"uuid":"80f07c68-f739-45b5-a91a-e8f8f4b0fc6d","code":"us-west1","name":"Oregon","azList":[{"uuid":"42b3fd5a-2c30-48c5-9335-d71dc60a773f","name":"us-west1-a","replicationFactor":1,"numNodesInAZ":1,"isAffinitized":true}]}]}]},"updateStrategy":"RollingUpdate","config":{"KUBECONFIG_PULL_SECRET":"/opt/yugaware/keys/7ae205f4-95ee-4aa5-b2f5-edb9ce793554/anijhawan_quay_pull_secret","KUBECONFIG":"/opt/yugaware/keys/7ae205f4-95ee-4aa5-b2f5-edb9ce793554/kubeconfig-202301","STORAGE_CLASS":"yb-standard","KUBECONFIG_PROVIDER":"gke","KUBECONFIG_IMAGE_PULL_SECRET_NAME":"anijhawan-pull-secret","KUBECONFIG_IMAGE_REGISTRY":"quay.io/yugabyte/yugabyte-itest"},"azCode":"us-west1-a","targetXClusterConfigs":[],"sourceXClusterConfigs":[]}
YW 2023-11-03T06:04:19.111Z [INFO] 9ec7f5dd-bdcd-4917-868e-2d7bf85e4f9e from TaskExecutor in TaskPool-6 - Adding SubTaskGroup #1: ResizingDisk
YW 2023-11-03T06:04:19.113Z [INFO] 9ec7f5dd-bdcd-4917-868e-2d7bf85e4f9e from TaskExecutor in TaskPool-6 - Setting subtask(ResizingDisk) group type to Provisioning
YW 2023-11-03T06:04:19.115Z [INFO] 9ec7f5dd-bdcd-4917-868e-2d7bf85e4f9e from TaskExecutor in TaskPool-6 - Adding task #0: KubernetesCommandExecutor(347eb7be-88b5-44ed-b519-1052487e5ced)
YW 2023-11-03T06:04:19.117Z [DEBUG] 9ec7f5dd-bdcd-4917-868e-2d7bf85e4f9e from TaskExecutor in TaskPool-6 - Details for task #0: KubernetesCommandExecutor(347eb7be-88b5-44ed-b519-1052487e5ced) details= {"platformVersion":"2.21.0.0-PRE_RELEASE","sleepAfterMasterRestartMillis":180000,"sleepAfterTServerRestartMillis":180000,"nodeExporterUser":"prometheus","universeUUID":"347eb7be-88b5-44ed-b519-1052487e5ced","enableYbc":false,"installYbc":false,"ybcInstalled":false,"encryptionAtRestConfig":{"encryptionAtRestEnabled":false,"opType":"UNDEFINED","type":"DATA_KEY"},"communicationPorts":{"masterHttpPort":7000,"masterRpcPort":7100,"tserverHttpPort":9000,"tserverRpcPort":9100,"ybControllerHttpPort":14000,"ybControllerrRpcPort":18018,"redisServerHttpPort":11000,"redisServerRpcPort":6379,"yqlServerHttpPort":12000,"yqlServerRpcPort":9042,"ysqlServerHttpPort":13000,"ysqlServerRpcPort":5433,"nodeExporterPort":9300},"extraDependencies":{"installNodeExporter":true},"providerUUID":"7ae205f4-95ee-4aa5-b2f5-edb9ce793554","universeName":"test1","commandType":"PVC_EXPAND_SIZE","helmReleaseName":"ybtest1-us-west1-a-twed","namespace":"yb-admin-test1","isReadOnlyCluster":false,"ybSoftwareVersion":"2.19.3.0-b80","enableNodeToNodeEncrypt":true,"enableClientToNodeEncrypt":true,"serverType":"TSERVER","tserverPartition":0,"masterPartition":0,"newDiskSize":"200Gi","masterAddresses":"ybtest1-us-west1-a-twed-yb-master-0.ybtest1-us-west1-a-twed-yb-masters.yb-admin-test1.svc.cluster.local:7100,ybtest1-us-west1-b-uwed-yb-master-0.ybtest1-us-west1-b-uwed-yb-masters.yb-admin-test1.svc.cluster.local:7100,ybtest1-us-west1-c-vwed-yb-master-0.ybtest1-us-west1-c-vwed-yb-masters.yb-admin-test1.svc.cluster.local:7100","placementInfo":{"cloudList":[{"uuid":"7ae205f4-95ee-4aa5-b2f5-edb9ce793554","code":"kubernetes","regionList":[{"uuid":"80f07c68-f739-45b5-a91a-e8f8f4b0fc6d","code":"us-west1","name":"Oregon","azList":[{"uuid":"42b3fd5a-2c30-48c5-9335-d71dc60a773f","name":"us-west1-a","replicationFactor":1,"numNodesInAZ":1,"isAffinitized":true}]}]}]},"updateStrategy":"RollingUpdate","config":{"KUBECONFIG_PULL_SECRET":"/opt/yugaware/keys/7ae205f4-95ee-4aa5-b2f5-edb9ce793554/anijhawan_quay_pull_secret","KUBECONFIG":"/opt/yugaware/keys/7ae205f4-95ee-4aa5-b2f5-edb9ce793554/kubeconfig-202301","STORAGE_CLASS":"yb-standard","KUBECONFIG_PROVIDER":"gke","KUBECONFIG_IMAGE_PULL_SECRET_NAME":"anijhawan-pull-secret","KUBECONFIG_IMAGE_REGISTRY":"quay.io/yugabyte/yugabyte-itest"},"azCode":"us-west1-a","targetXClusterConfigs":[],"sourceXClusterConfigs":[]}
YW 2023-11-03T06:04:19.119Z [INFO] 9ec7f5dd-bdcd-4917-868e-2d7bf85e4f9e from TaskExecutor in TaskPool-6 - Adding SubTaskGroup #2: ResizingDisk
YW 2023-11-03T06:04:19.120Z [INFO] 9ec7f5dd-bdcd-4917-868e-2d7bf85e4f9e from TaskExecutor in TaskPool-6 - Setting subtask(ResizingDisk) group type to Provisioning
...
```
Verified disk size was increased

```
[centos@dev-server-anijhawan-4 managed]$ kubectl -n yb-admin-test1  get pvc ybtest1-us-west1-b-uwed-datadir0-ybtest1-us-west1-b-uwed-yb-tserver-0  ybtest1-us-west1-a-twed-datadir0-ybtest1-us-west1-a-twed-yb-tserver-0  ybtest1-us-west1-c-vwed-datadir0-ybtest1-us-west1-c-vwed-yb-tserver-0  -o yaml | grep storage
      volume.beta.kubernetes.io/storage-provisioner: pd.csi.storage.gke.io
      volume.kubernetes.io/storage-provisioner: pd.csi.storage.gke.io
      volume.kubernetes.io/storage-resizer: pd.csi.storage.gke.io
        storage: 200Gi
    storageClassName: yb-standard
      storage: 200Gi
      volume.beta.kubernetes.io/storage-provisioner: pd.csi.storage.gke.io
      volume.kubernetes.io/storage-provisioner: pd.csi.storage.gke.io
      volume.kubernetes.io/storage-resizer: pd.csi.storage.gke.io
        storage: 200Gi
    storageClassName: yb-standard
      storage: 200Gi
      volume.beta.kubernetes.io/storage-provisioner: pd.csi.storage.gke.io
      volume.kubernetes.io/storage-provisioner: pd.csi.storage.gke.io
      volume.kubernetes.io/storage-resizer: pd.csi.storage.gke.io
        storage: 200Gi
    storageClassName: yb-standard
      storage: 200Gi

```

Retry logs we can see function was invoke but task creation was skipped.

```
YW 2023-11-03T06:07:10.173Z [DEBUG] ab2e48ec-a204-4449-af99-dd1db5cb15d8 from TaskExecutor in TaskPool-7 - Invoking run() of task EditKubernetesUniverse(347eb7be-88b5-44ed-b519-1052487e5ced)
YW 2023-11-03T06:07:10.173Z [INFO] ab2e48ec-a204-4449-af99-dd1db5cb15d8 from CustomerTaskController in application-akka.actor.default-dispatcher-2292 - Saved task uuid 66611664-a25f-4ad2-93aa-e40a7db67654 in customer tasks table for target 347eb7be-88b5-44ed-b519-1052487e5ced:test1
YW 2023-11-03T06:07:10.322Z [DEBUG] ab2e48ec-a204-4449-af99-dd1db5cb15d8 from TransactionUtil in TaskPool-7 - Trying(1)...
YW 2023-11-03T06:07:10.333Z [DEBUG] ab2e48ec-a204-4449-af99-dd1db5cb15d8 from UniverseTaskBase in TaskPool-7 - Cancelling any active health-checks for universe 347eb7be-88b5-44ed-b519-1052487e5ced
YW 2023-11-03T06:07:10.379Z [INFO] ab2e48ec-a204-4449-af99-dd1db5cb15d8 from EditKubernetesUniverse in TaskPool-7 - Creating task for disk size change from 100 to 200
YW 2023-11-03T06:07:10.436Z [INFO] ab2e48ec-a204-4449-af99-dd1db5cb15d8 from ShellProcessHandler in TaskPool-7 - Starting proc (abbrev cmd) - kubectl --namespace yb-admin-test1 get pvc -l app.kubernetes.io/name=yb-tserver,release=ybtest1-us-west1-a-twed -o json
YW 2023-11-03T06:07:10.436Z [INFO] ab2e48ec-a204-4449-af99-dd1db5cb15d8 from ShellProcessHandler in TaskPool-7 - Starting proc (full cmd) - 'kubectl' '--namespace' 'yb-admin-test1' 'get' 'pvc' '-l' 'app.kubernetes.io/name=yb-tserver,release=ybtest1-us-west1-a-twed' '-o' 'json' - logging stdout=/tmp/shell_process_out15761747450556728945tmp, stderr=/tmp/shell_process_err16162390392062292532tmp
YW 2023-11-03T06:07:10.941Z [INFO] ab2e48ec-a204-4449-af99-dd1db5cb15d8 from ShellProcessHandler in TaskPool-7 - Completed proc 'kubectl --namespace yb-admin-test1 get pvc -l app.kubernetes.io/name=yb-tserver,release=ybtest1-us-west1-a-twed -o json' status=success [ 505 ms ]
YW 2023-11-03T06:07:10.982Z [INFO] ab2e48ec-a204-4449-af99-dd1db5cb15d8 from ShellProcessHandler in TaskPool-7 - Starting proc (abbrev cmd) - kubectl --namespace yb-admin-test1 get pvc -l app.kubernetes.io/name=yb-tserver,release=ybtest1-us-west1-b-uwed -o json
YW 2023-11-03T06:07:10.982Z [INFO] ab2e48ec-a204-4449-af99-dd1db5cb15d8 from ShellProcessHandler in TaskPool-7 - Starting proc (full cmd) - 'kubectl' '--namespace' 'yb-admin-test1' 'get' 'pvc' '-l' 'app.kubernetes.io/name=yb-tserver,release=ybtest1-us-west1-b-uwed' '-o' 'json' - logging stdout=/tmp/shell_process_out16328458040940971014tmp, stderr=/tmp/shell_process_err9595293916813332432tmp
YW 2023-11-03T06:07:11.487Z [INFO] ab2e48ec-a204-4449-af99-dd1db5cb15d8 from ShellProcessHandler in TaskPool-7 - Completed proc 'kubectl --namespace yb-admin-test1 get pvc -l app.kubernetes.io/name=yb-tserver,release=ybtest1-us-west1-b-uwed -o json' status=success [ 505 ms ]
YW 2023-11-03T06:07:11.526Z [INFO] ab2e48ec-a204-4449-af99-dd1db5cb15d8 from ShellProcessHandler in TaskPool-7 - Starting proc (abbrev cmd) - kubectl --namespace yb-admin-test1 get pvc -l app.kubernetes.io/name=yb-tserver,release=ybtest1-us-west1-c-vwed -o json
YW 2023-11-03T06:07:11.527Z [INFO] ab2e48ec-a204-4449-af99-dd1db5cb15d8 from ShellProcessHandler in TaskPool-7 - Starting proc (full cmd) - 'kubectl' '--namespace' 'yb-admin-test1' 'get' 'pvc' '-l' 'app.kubernetes.io/name=yb-tserver,release=ybtest1-us-west1-c-vwed' '-o' 'json' - logging stdout=/tmp/shell_process_out11035907328384396246tmp, stderr=/tmp/shell_process_err3826067280996541352tmp
YW 2023-11-03T06:07:12.031Z [INFO] ab2e48ec-a204-4449-af99-dd1db5cb15d8 from ShellProcessHandler in TaskPool-7 - Completed proc 'kubectl --namespace yb-admin-test1 get pvc -l app.kubernetes.io/name=yb-tserver,release=ybtest1-us-west1-c-vwed -o json' status=success [ 505 ms ]
```

Reviewers: sanketh, nsingh, sneelakantan, dshubin, cwang, nbhatia

Reviewed By: nsingh, dshubin

Subscribers: yugaware, nbhatia, cwang

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D31095
jasonyb pushed a commit that referenced this issue Dec 15, 2023
Summary:
YB Seq Scan code path is not hit because Foreign Scan is the default and
pg hint plan does not work.  Upcoming merge with YB master will bring in
master commit 465ee2c which changes the
default to YB Seq Scan.

To test YB Seq Scan, a temporary patch is needed (see the test plan).
With that, two bugs are encountered: fix them.

1. FailedAssertion("TTS_IS_VIRTUAL(slot)"

   On simple test case

       create table t (i int primary key, j int);
       select * from t;

   get

       TRAP: FailedAssertion("TTS_IS_VIRTUAL(slot)", File: "../../../../../../../src/postgres/src/backend/access/yb_access/yb_scan.c", Line: 3473, PID: 2774450)

   Details:

       #0  0x00007fd52616eacf in raise () from /lib64/libc.so.6
       #1  0x00007fd526141ea5 in abort () from /lib64/libc.so.6
       #2  0x0000000000af33ad in ExceptionalCondition (conditionName=conditionName@entry=0xc2938d "TTS_IS_VIRTUAL(slot)", errorType=errorType@entry=0xc01498 "FailedAssertion",
           fileName=fileName@entry=0xc28f18 "../../../../../../../src/postgres/src/backend/access/yb_access/yb_scan.c", lineNumber=lineNumber@entry=3473)
           at ../../../../../../../src/postgres/src/backend/utils/error/assert.c:69
       #3  0x00000000005c26bd in ybFetchNext (handle=0x2600ffc43680, slot=slot@entry=0x2600ff6c2980, relid=16384)
           at ../../../../../../../src/postgres/src/backend/access/yb_access/yb_scan.c:3473
       #4  0x00000000007de444 in YbSeqNext (node=0x2600ff6c2778) at ../../../../../../src/postgres/src/backend/executor/nodeYbSeqscan.c:156
       #5  0x000000000078b3c6 in ExecScanFetch (node=node@entry=0x2600ff6c2778, accessMtd=accessMtd@entry=0x7de2b9 <YbSeqNext>, recheckMtd=recheckMtd@entry=0x7de26e <YbSeqRecheck>)
           at ../../../../../../src/postgres/src/backend/executor/execScan.c:133
       #6  0x000000000078b44e in ExecScan (node=0x2600ff6c2778, accessMtd=accessMtd@entry=0x7de2b9 <YbSeqNext>, recheckMtd=recheckMtd@entry=0x7de26e <YbSeqRecheck>)
           at ../../../../../../src/postgres/src/backend/executor/execScan.c:182
       #7  0x00000000007de298 in ExecYbSeqScan (pstate=<optimized out>) at ../../../../../../src/postgres/src/backend/executor/nodeYbSeqscan.c:191
       #8  0x00000000007871ef in ExecProcNodeFirst (node=0x2600ff6c2778) at ../../../../../../src/postgres/src/backend/executor/execProcnode.c:480
       #9  0x000000000077db0e in ExecProcNode (node=0x2600ff6c2778) at ../../../../../../src/postgres/src/include/executor/executor.h:285
       #10 ExecutePlan (execute_once=<optimized out>, dest=0x2600ff6b1a10, direction=<optimized out>, numberTuples=0, sendTuples=true, operation=CMD_SELECT,
           use_parallel_mode=<optimized out>, planstate=0x2600ff6c2778, estate=0x2600ff6c2128) at ../../../../../../src/postgres/src/backend/executor/execMain.c:1650
       #11 standard_ExecutorRun (queryDesc=0x2600ff675128, direction=<optimized out>, count=0, execute_once=<optimized out>)
           at ../../../../../../src/postgres/src/backend/executor/execMain.c:367
       #12 0x000000000077dbfe in ExecutorRun (queryDesc=queryDesc@entry=0x2600ff675128, direction=direction@entry=ForwardScanDirection, count=count@entry=0, execute_once=<optimized out>)
           at ../../../../../../src/postgres/src/backend/executor/execMain.c:308
       #13 0x0000000000982617 in PortalRunSelect (portal=portal@entry=0x2600ff90e128, forward=forward@entry=true, count=0, count@entry=9223372036854775807, dest=dest@entry=0x2600ff6b1a10)
           at ../../../../../../src/postgres/src/backend/tcop/pquery.c:954
       #14 0x000000000098433c in PortalRun (portal=portal@entry=0x2600ff90e128, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true,
           dest=dest@entry=0x2600ff6b1a10, altdest=altdest@entry=0x2600ff6b1a10, qc=0x7fffc14a13c0) at ../../../../../../src/postgres/src/backend/tcop/pquery.c:786
       #15 0x000000000097e65b in exec_simple_query (query_string=0x2600ffdc6128 "select * from t;") at ../../../../../../src/postgres/src/backend/tcop/postgres.c:1321
       #16 yb_exec_simple_query_impl (query_string=query_string@entry=0x2600ffdc6128) at ../../../../../../src/postgres/src/backend/tcop/postgres.c:5060
       #17 0x000000000097b7a5 in yb_exec_query_wrapper_one_attempt (exec_context=exec_context@entry=0x2600ffdc6000, restart_data=restart_data@entry=0x7fffc14a1640,
           functor=functor@entry=0x97e033 <yb_exec_simple_query_impl>, functor_context=functor_context@entry=0x2600ffdc6128, attempt=attempt@entry=0, retry=retry@entry=0x7fffc14a15ff)
           at ../../../../../../src/postgres/src/backend/tcop/postgres.c:5028
       #18 0x000000000097d077 in yb_exec_query_wrapper (exec_context=exec_context@entry=0x2600ffdc6000, restart_data=restart_data@entry=0x7fffc14a1640,
           functor=functor@entry=0x97e033 <yb_exec_simple_query_impl>, functor_context=functor_context@entry=0x2600ffdc6128)
           at ../../../../../../src/postgres/src/backend/tcop/postgres.c:5052
       #19 0x000000000097d0ca in yb_exec_simple_query (query_string=query_string@entry=0x2600ffdc6128 "select * from t;", exec_context=exec_context@entry=0x2600ffdc6000)
           at ../../../../../../src/postgres/src/backend/tcop/postgres.c:5075
       #20 0x000000000097fe8a in PostgresMain (dbname=<optimized out>, username=<optimized out>) at ../../../../../../src/postgres/src/backend/tcop/postgres.c:5794
       #21 0x00000000008c8354 in BackendRun (port=0x2600ff8423c0) at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:4791
       #22 BackendStartup (port=0x2600ff8423c0) at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:4491
       #23 ServerLoop () at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1878
       #24 0x00000000008caa55 in PostmasterMain (argc=argc@entry=25, argv=argv@entry=0x2600ffdc01a0) at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1533
       #25 0x0000000000804ba8 in PostgresServerProcessMain (argc=25, argv=0x2600ffdc01a0) at ../../../../../../src/postgres/src/backend/main/main.c:208
       #26 0x0000000000804bc8 in main ()

       3469    ybFetchNext(YBCPgStatement handle,
       3470                            TupleTableSlot *slot, Oid relid)
       3471    {
       3472            Assert(slot != NULL);
       3473            Assert(TTS_IS_VIRTUAL(slot));

       (gdb) p *slot
       $2 = {type = T_TupleTableSlot, tts_flags = 18, tts_nvalid = 0, tts_ops = 0xeaf5e0 <TTSOpsHeapTuple>, tts_tupleDescriptor = 0x2600ff6416c0, tts_values = 0x2600ff6c2a00, tts_isnull = 0x2600ff6c2a10, tts_mcxt = 0x2600ff6c2000, tts_tid = {ip_blkid = {bi_hi = 0, bi_lo = 0}, ip_posid = 0, yb_item = {ybctid = 0}}, tts_tableOid = 0, tts_yb_insert_oid = 0}

   Fix by making YB Seq Scan always use virtual slot.  This is similar
   to what is done for YB Foreign Scan.

2. segfault in ending scan

   Same simple test case gives segfault at a later stage.

   Details:

       #0  0x00000000007de762 in table_endscan (scan=0x3debfe3ab88) at ../../../../../../src/postgres/src/include/access/tableam.h:997
       #1  ExecEndYbSeqScan (node=node@entry=0x3debfe3a778) at ../../../../../../src/postgres/src/backend/executor/nodeYbSeqscan.c:298
       #2  0x0000000000787a75 in ExecEndNode (node=0x3debfe3a778) at ../../../../../../src/postgres/src/backend/executor/execProcnode.c:649
       #3  0x000000000077ffaf in ExecEndPlan (estate=0x3debfe3a128, planstate=<optimized out>) at ../../../../../../src/postgres/src/backend/executor/execMain.c:1489
       #4  standard_ExecutorEnd (queryDesc=0x2582fdc88928) at ../../../../../../src/postgres/src/backend/executor/execMain.c:503
       #5  0x00000000007800f8 in ExecutorEnd (queryDesc=queryDesc@entry=0x2582fdc88928) at ../../../../../../src/postgres/src/backend/executor/execMain.c:474
       #6  0x00000000006f140c in PortalCleanup (portal=0x2582ff900128) at ../../../../../../src/postgres/src/backend/commands/portalcmds.c:305
       #7  0x0000000000b3c36a in PortalDrop (portal=portal@entry=0x2582ff900128, isTopCommit=isTopCommit@entry=false)
           at ../../../../../../../src/postgres/src/backend/utils/mmgr/portalmem.c:514
       #8  0x000000000097e667 in exec_simple_query (query_string=0x2582ffdc6128 "select * from t;") at ../../../../../../src/postgres/src/backend/tcop/postgres.c:1331
       #9  yb_exec_simple_query_impl (query_string=query_string@entry=0x2582ffdc6128) at ../../../../../../src/postgres/src/backend/tcop/postgres.c:5060
       #10 0x000000000097b79a in yb_exec_query_wrapper_one_attempt (exec_context=exec_context@entry=0x2582ffdc6000, restart_data=restart_data@entry=0x7ffc81c0e7d0,
           functor=functor@entry=0x97e028 <yb_exec_simple_query_impl>, functor_context=functor_context@entry=0x2582ffdc6128, attempt=attempt@entry=0, retry=retry@entry=0x7ffc81c0e78f)
           at ../../../../../../src/postgres/src/backend/tcop/postgres.c:5028
       #11 0x000000000097d06c in yb_exec_query_wrapper (exec_context=exec_context@entry=0x2582ffdc6000, restart_data=restart_data@entry=0x7ffc81c0e7d0,
           functor=functor@entry=0x97e028 <yb_exec_simple_query_impl>, functor_context=functor_context@entry=0x2582ffdc6128)
           at ../../../../../../src/postgres/src/backend/tcop/postgres.c:5052
       #12 0x000000000097d0bf in yb_exec_simple_query (query_string=query_string@entry=0x2582ffdc6128 "select * from t;", exec_context=exec_context@entry=0x2582ffdc6000)
           at ../../../../../../src/postgres/src/backend/tcop/postgres.c:5075
       #13 0x000000000097fe7f in PostgresMain (dbname=<optimized out>, username=<optimized out>) at ../../../../../../src/postgres/src/backend/tcop/postgres.c:5794
       #14 0x00000000008c8349 in BackendRun (port=0x2582ff8403c0) at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:4791
       #15 BackendStartup (port=0x2582ff8403c0) at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:4491
       #16 ServerLoop () at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1878
       #17 0x00000000008caa4a in PostmasterMain (argc=argc@entry=25, argv=argv@entry=0x2582ffdc01a0) at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1533
       #18 0x0000000000804b9d in PostgresServerProcessMain (argc=25, argv=0x2582ffdc01a0) at ../../../../../../src/postgres/src/backend/main/main.c:208
       #19 0x0000000000804bbd in main ()

       294             /*
       295              * close heap scan
       296              */
       297             if (tsdesc != NULL)
       298                     table_endscan(tsdesc);

   Reason is initial merge 55782d5
   incorrectly merges end of ExecEndYbSeqScan.  Upstream PG
   9ddef36278a9f676c07d0b4d9f33fa22e48ce3b5 removes code, but initial
   merge duplicates lines.  Remove those lines.

Test Plan:
Apply the following patch to activate YB Seq Scan:

    diff --git a/src/postgres/src/backend/optimizer/path/allpaths.c b/src/postgres/src/backend/optimizer/path/allpaths.c
    index 8a4c38a965..854d84a648 100644
    --- a/src/postgres/src/backend/optimizer/path/allpaths.c
    +++ b/src/postgres/src/backend/optimizer/path/allpaths.c
    @@ -576,7 +576,7 @@ set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
                     else
                     {
                         /* Plain relation */
    -                    if (IsYBRelationById(rte->relid))
    +                    if (false)
                         {
                             /*
                              * Using a foreign scan which will use the YB FDW by

On almalinux 8,

    ./yb_build.sh fastdebug --gcc11
    pg15_tests/run_all_tests.sh fastdebug --gcc11 --sj --sp --scb

fails the following tests:

- test_D29546
- test_pg15_regress: yb_pg15
- test_types_geo: yb_pg_box
- test_hash_in_queries: yb_hash_in_queries

Manually check to see that they are due to YB Seq Scan explain output
differences.

Reviewers: aagrawal, tfoucher

Reviewed By: tfoucher

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D31139
charleswang234 added a commit that referenced this issue Jan 9, 2024
…on releaseInstance and before AnsibleDestroyServer subtask

Summary:
Add a new pre-check subtask to `ReleaseInstanceFromUniverse` and as a subtask right before any ansibleDestroyServer subtask called `CheckNodeSafeToDelete`.

`CheckNodeSafeToDelete` checks that there are no tserver tablets assigned to this node if applicable and there is no master on this node that is in the universe quorum. We base these checks off of the ip of the node.

We will fail the `ReleaseInstanceFromUniverse` task if this pre-check fails. In addition, we also fail if we are taking down the server and tserver still has tablets or the server's master process is still part of the quorum. This is used as an extra pre-caution but is a short-term fix until we have a cluster whitelist implemented from the db side.

Test Plan:
Create a 4 node rf3 universe (2 1 1 for the azs):

Perform a remove node -> release node for one of the nodes in the az that has 2 nodes. Make sure that on the release node, the task succeeds.

Test #2
Perform a remove node. Manually start up the tserver on the node and remove the blacklist using ybadmin command. Perform a release node from universe. Check that the release node task fails.

Test #3
Perform a remove node. Manually start up the master process on the node we just removed. Add the node back to the quorum using yb-admin. Perform a release node from the universe. Check that the release node task fails.

(Can also just change the node state of the node to `Removed` and then run the ReleaseNodeFromUniverse task)

Other testing:
- Check that we are able to do a full move, delete a universe.

Reviewers: sanketh, nsingh, hzare

Reviewed By: nsingh

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D30103
d-uspenskiy added a commit that referenced this issue Jan 12, 2024
…ction

Summary:
The are several unit tests which suffers from tsan data race warning with the following stack:

```
WARNING: ThreadSanitizer: data race (pid=38656)
  Read of size 8 at 0x7f6f2a44b038 by thread T21:
    #0 memcpy /opt/yb-build/llvm/yb-llvm-v17.0.2-yb-1-1696896765-6a83e4b2-almalinux8-x86_64-build/src/llvm-project/compiler-rt/lib/tsan/rtl/../../sanitizer_common/sanitizer_common_interceptors_memintrinsics.inc:115:5 (pg_ddl_concurrency-test+0x9e197)
    #1 <null> <null> (libnss_sss.so.2+0x72ef) (BuildId: a17afeaa37369696ec2457ab7a311139707fca9b)
    #2 pqGetpwuid ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/thread.c:99:9 (libpq.so.5+0x4a8c9)
    #3 pqGetHomeDirectory ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:6674:9 (libpq.so.5+0x2d3c7)
    #4 connectOptions2 ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:1150:8 (libpq.so.5+0x2d3c7)
    #5 PQconnectStart ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:791:7 (libpq.so.5+0x2c2fe)
    #6 PQconnectdb ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:647:20 (libpq.so.5+0x2c279)
    #7 yb::pgwrapper::PGConn::Connect(string const&, std::chrono::time_point<yb::CoarseMonoClock, std::chrono::duration<long long, std::ratio<1l, 1000000000l>>>, bool, string const&) ${BUILD_ROOT}/../../src/yb/yql/pgwrapper/libpq_utils.cc:278:24 (libpq_utils.so+0x11d6b)
...

  Previous write of size 8 at 0x7f6f2a44b038 by thread T20 (mutexes: write M0):
    #0 mmap64 /opt/yb-build/llvm/yb-llvm-v17.0.2-yb-1-1696896765-6a83e4b2-almalinux8-x86_64-build/src/llvm-project/compiler-rt/lib/tsan/rtl/../../sanitizer_common/sanitizer_common_interceptors.inc:7485:3 (pg_ddl_concurrency-test+0xda204)
    #1 <null> <null> (libnss_sss.so.2+0x7169) (BuildId: a17afeaa37369696ec2457ab7a311139707fca9b)
    #2 pqGetpwuid ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/thread.c:99:9 (libpq.so.5+0x4a8c9)
    #3 pqGetHomeDirectory ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:6674:9 (libpq.so.5+0x2d3c7)
    #4 connectOptions2 ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:1150:8 (libpq.so.5+0x2d3c7)
    #5 PQconnectStart ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:791:7 (libpq.so.5+0x2c2fe)
    #6 PQconnectdb ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:647:20 (libpq.so.5+0x2c279)
    #7 yb::pgwrapper::PGConn::Connect(string const&, std::chrono::time_point<yb::CoarseMonoClock, std::chrono::duration<long long, std::ratio<1l, 1000000000l>>>, bool, string const&) ${BUILD_ROOT}/../../src/yb/yql/pgwrapper/libpq_utils.cc:278:24 (libpq_utils.so+0x11d6b)
...

  Location is global '??' at 0x7f6f2a44b000 (passwd+0x38)

  Mutex M0 (0x7f6f2af29380) created at:
    #0 pthread_mutex_lock /opt/yb-build/llvm/yb-llvm-v17.0.2-yb-1-1696896765-6a83e4b2-almalinux8-x86_64-build/src/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:1339:3 (pg_ddl_concurrency-test+0xa464b)
    #1 <null> <null> (libnss_sss.so.2+0x70d6) (BuildId: a17afeaa37369696ec2457ab7a311139707fca9b)
    #2 pqGetpwuid ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/thread.c:99:9 (libpq.so.5+0x4a8c9)
    #3 pqGetHomeDirectory ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:6674:9 (libpq.so.5+0x2d3c7)
    #4 connectOptions2 ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:1150:8 (libpq.so.5+0x2d3c7)
    #5 PQconnectStart ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:791:7 (libpq.so.5+0x2c2fe)
...
```

All failing tests has common feature - all of them creates connection to postgres from multiple threads at same time.
On creating new connection the `libpq` library calls the `getpwuid_r` standard function internally. This function is thread safe and tsan warning is not expected there.

Solution is to suppress warning in the `getpwuid_r` function.
**Note:** because there is no `getpwuid_r` function name in the tsan warning stack the warning for the caller function `pqGetpwuid` is suppressed.
Jira: DB-9523

Test Plan: Jenkins

Reviewers: sergei, bogdan

Reviewed By: sergei

Subscribers: yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D31646
arpang added a commit that referenced this issue Jan 30, 2024
Summary:
The function `yb_single_row_update_or_delete_path` was reworked for PG15 in D27692. There are a few more issues in the function that this revision fixes:

  # With PG commit 86dc90056dfdbd9d1b891718d2e5614e3e432f35,
  ## the target list returned by `build_path_tlist(root, subpath)` only contains the modified and junk columns. So, there is no need to ignore the unspecified columns.
  ## "tableoid" junk column is added for partitioned tables. Ignore it, along with other junk cols, when iterating over `build_path_tlist(root, subpath)`.
  ## `RelOptInfo` entries in `simple_rel_array` corresponding to the non-leaf relations of a partitioned table are NOT NULL. Ignore these when checking the number of relations being updated.
  ## When updating a partitioned table, the child of the UPDATE node is an APPEND node. This append node is skipped in the final plan if it has only one child. Take this into account when applying conditions to the `ModifyTablePath.subpath` by passing the subpath through `get_singleton_append_subpath`.
  # D27692 added an assertion ` Assert(root->update_colnos->length > update_col_index)`, which is incorrect. The pre-existing code comment clearly stated: `.. it is possible that planner adds extra expressions. In particular, we've seen a RowExpr when a view was updated`. As expected, this incorrect assertion fails when updating a view with a trigger (see the added test). Remove the assertion. Instead, move the expression-out-of-range check before the reading `root->update_colnos`.

Test Plan:
Jenkins: rebase: pg15

Added two tests in yb_pg15:
- one, to test whether single-row optimization is invoked when only one partition is updated (fix #1)
- other, to test UPDATE view with INSTEAD OF UPDATE trigger (fix #2). This test is taken from yb_pg_triggers (yb_pg_triggers still fails with unrelated errors)

./yb_build.sh --java-test org.yb.pgsql.TestPg15Regress#testPg15Regress

Reviewers: jason, tnayak, amartsinchyk

Reviewed By: amartsinchyk

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D31709
charleswang234 added a commit that referenced this issue Mar 18, 2024
… master config on releaseInstance and before AnsibleDestroyServer subtask

Summary:
Original commit:

252ef97 / D30103

Add a new pre-check subtask to `ReleaseInstanceFromUniverse` and as a subtask right before any ansibleDestroyServer subtask called `CheckNodeSafeToDelete`.

`CheckNodeSafeToDelete` checks that there are no tserver tablets assigned to this node if applicable and there is no master on this node that is in the universe quorum. We base these checks off of the ip of the node.

We will fail the `ReleaseInstanceFromUniverse` task if this pre-check fails. In addition, we also fail if we are taking down the server and tserver still has tablets or the server's master process is still part of the quorum. This is used as an extra pre-caution but is a short-term fix until we have a cluster whitelist implemented from the db side.

Test Plan:
Create a 4 node rf3 universe (2 1 1 for the azs):

Perform a remove node -> release node for one of the nodes in the az that has 2 nodes. Make sure that on the release node, the task succeeds.

Test #2
Perform a remove node. Manually start up the tserver on the node and remove the blacklist using ybadmin command. Perform a release node from universe. Check that the release node task fails.

Test #3
Perform a remove node. Manually start up the master process on the node we just removed. Add the node back to the quorum using yb-admin. Perform a release node from the universe. Check that the release node task fails.

(Can also just change the node state of the node to `Removed` and then run the ReleaseNodeFromUniverse task)

Other testing:
- Check that we are able to do a full move, delete a universe.

Reviewers: sanketh, nsingh, yshchetinin

Reviewed By: nsingh

Subscribers: yugaware

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D33265
charleswang234 added a commit that referenced this issue Mar 18, 2024
… master config on releaseInstance and before AnsibleDestroyServer subtask

Summary:
Original commit:

252ef97 / D30103

Add a new pre-check subtask to `ReleaseInstanceFromUniverse` and as a subtask right before any ansibleDestroyServer subtask called `CheckNodeSafeToDelete`.

`CheckNodeSafeToDelete` checks that there are no tserver tablets assigned to this node if applicable and there is no master on this node that is in the universe quorum. We base these checks off of the ip of the node.

We will fail the `ReleaseInstanceFromUniverse` task if this pre-check fails. In addition, we also fail if we are taking down the server and tserver still has tablets or the server's master process is still part of the quorum. This is used as an extra pre-caution but is a short-term fix until we have a cluster whitelist implemented from the db side.

Test Plan:
Create a 4 node rf3 universe (2 1 1 for the azs):

Perform a remove node -> release node for one of the nodes in the az that has 2 nodes. Make sure that on the release node, the task succeeds.

Test #2
Perform a remove node. Manually start up the tserver on the node and remove the blacklist using ybadmin command. Perform a release node from universe. Check that the release node task fails.

Test #3
Perform a remove node. Manually start up the master process on the node we just removed. Add the node back to the quorum using yb-admin. Perform a release node from the universe. Check that the release node task fails.

(Can also just change the node state of the node to `Removed` and then run the ReleaseNodeFromUniverse task)

Other testing:
- Check that we are able to do a full move, delete a universe.

Reviewers: sanketh, nsingh, yshchetinin

Reviewed By: nsingh

Subscribers: yugaware

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D33260
yusong-yan added a commit that referenced this issue Apr 18, 2024
…retained for CDC"

Summary:
D33131 introduced a segmentation fault which was  identified in multiple tests.
```
* thread #1, name = 'yb-tserver', stop reason = signal SIGSEGV
  * frame #0: 0x00007f4d2b6f3a84 libpthread.so.0`__pthread_mutex_lock + 4
    frame #1: 0x000055d6d1e1190b yb-tserver`yb::tablet::MvccManager::SafeTimeForFollower(yb::HybridTime, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>>) const [inlined] std::__1::unique_lock<std::__1::mutex>::unique_lock[abi:v170002](this=0x00007f4ccb6feaa0, __m=0x0000000000000110) at unique_lock.h:41:11
    frame #2: 0x000055d6d1e118f5 yb-tserver`yb::tablet::MvccManager::SafeTimeForFollower(this=0x00000000000000f0, min_allowed=<unavailable>, deadline=yb::CoarseTimePoint @ 0x00007f4ccb6feb08) const at mvcc.cc:500:32
    frame #3: 0x000055d6d1ef58e3 yb-tserver`yb::tablet::TransactionParticipant::Impl::ProcessRemoveQueueUnlocked(this=0x000037e27d26fb00, min_running_notifier=0x00007f4ccb6fef28) at transaction_participant.cc:1537:45
    frame #4: 0x000055d6d1efc11a yb-tserver`yb::tablet::TransactionParticipant::Impl::EnqueueRemoveUnlocked(this=0x000037e27d26fb00, id=<unavailable>, reason=<unavailable>, min_running_notifier=0x00007f4ccb6fef28, expected_deadlock_status=<unavailable>) at transaction_participant.cc:1516:5
    frame #5: 0x000055d6d1e3afbe yb-tserver`yb::tablet::RunningTransaction::DoStatusReceived(this=0x000037e2679b5218, status_tablet="d5922c26c9704f298d6812aff8f615f6", status=<unavailable>, response=<unavailable>, serial_no=56986, shared_self=std::__1::shared_ptr<yb::tablet::RunningTransaction>::element_type @ 0x000037e2679b5218) at running_transaction.cc:424:16
    frame #6: 0x000055d6d0d7db5f yb-tserver`yb::client::(anonymous namespace)::TransactionRpcBase::Finished(this=0x000037e29c80b420, status=<unavailable>) at transaction_rpc.cc:67:7
```
This diff reverts the change to unblock the tests.

The proper fix for this problem is WIP
Jira: DB-10780, DB-10466

Test Plan: Jenkins: urgent

Reviewers: rthallam

Reviewed By: rthallam

Subscribers: ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D34245
aishwarya24 added a commit that referenced this issue Apr 30, 2024
…-examples application (#21900)

* Upgrade the gorm docs to use latest go version and gorm v2

* Mentioned the use of smart drivers

* Harsh daryani896 patch 1 (#2)

* Update pgx driver version from v4 to v5 in docs.

* review comments and copied to preview

---------

Co-authored-by: Harsh Daryani <82017686+HarshDaryani896@users.noreply.github.com>
Co-authored-by: aishwarya24 <ashchakravarthy@gmail.com>
Sahith02 added a commit that referenced this issue May 17, 2024
…nal when creating telemetry provider

Summary:
This diff fixes the following 4 tickets:
1. [PLAT-13995] Tags should be optional when creating telemetry provider.
2. [PLAT-12270] Rename logRow to logRows + pgaudit.log_row to pgaudit.log_rows in YBA code
3. [PLAT-14002] Should not allow deleting a Telemetry Provider when it's in use
4. [PLAT-14011] Add AuditLog Payload in ModifyAuditLogging task type details

For #2, it is okay to change the request body param `logRow` -> `logRows` since this is an internal API and no one uses this DB audit logs feature yet including UI, YBM, other clients, etc.

Test Plan:
Manually tested all 4 of the above test cases.
Run UTs.
Run itests.

Reviewers: amalyshev

Reviewed By: amalyshev

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D35124
Sahith02 added a commit that referenced this issue May 17, 2024
…s should be optional when creating telemetry provider

Summary:
Original commit: 7f76a34 / D35124
This diff fixes the following 4 tickets:
1. [PLAT-13995] Tags should be optional when creating telemetry provider.
2. [PLAT-12270] Rename logRow to logRows + pgaudit.log_row to pgaudit.log_rows in YBA code
3. [PLAT-14002] Should not allow deleting a Telemetry Provider when it's in use
4. [PLAT-14011] Add AuditLog Payload in ModifyAuditLogging task type details

For #2, it is okay to change the request body param `logRow` -> `logRows` since this is an internal API and no one uses this DB audit logs feature yet including UI, YBM, other clients, etc.

Test Plan:
Manually tested all 4 of the above test cases.
Run UTs.
Run itests.

Reviewers: amalyshev

Reviewed By: amalyshev

Subscribers: yugaware

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D35151
karthik-ramanathan-3006 added a commit that referenced this issue May 17, 2024
Summary:
The YSQL webserver has occasionally produced coredumps of the following form upon receiving a termination signal from postmaster.
```
                #0  0x00007fbac35a9ae3 _ZNKSt3__112__hash_tableINS_17__hash_value_typeINS_12basic_string <snip>
                #1  0x00007fbac005485d _ZNKSt3__113unordered_mapINS_12basic_string <snip> (libyb_util.so)
                #2  0x00007fbac0053180 _ZN2yb16PrometheusWriter16WriteSingleEntryERKNSt3__113unordered_mapINS1_12basic_string <snip>
                #3  0x00007fbab21ff1eb _ZN2yb6pggateL26PgPrometheusMetricsHandlerERKNS_19WebCallbackRegistry10WebRequestEPNS1_11WebResponseE (libyb_pggate_webserver.so)
                ....
                ....
```

The coredump indicates corruption of a namespace-scoped variable of type unordered_map while attempting to serve a request after a termination signal has been received.
The current code causes the webserver (postgres background worker) to call postgres' `proc_exit()` which consequently calls `exit()`.

According to the [[ https://en.cppreference.com/w/cpp/utility/program/exit | C++ standard ]], a limited amount of cleanup is performed on exit():
 - Notably destructors of variables with automatic storage duration are not invoked. This implies that the webserver's destructor is not called, and therefore the server is not stopped.
 - Namespace-scoped variables have [[ https://en.cppreference.com/w/cpp/language/storage_duration | static storage duration ]].
 - Objects with static storage duration are destroyed.
 - This leads to a possibility of undefined behavior where the webserver may continue running for a short duration of time, while the static variables used to serve requests may have been GC'ed.

This revision explicitly stops the webserver upon receiving a termination signal, by calling its destructor.
It also adds logic to the handlers to return a `503 SERVICE_UNAVAILABLE` once termination has been initiated.
Jira: DB-7796

Test Plan:
To test this manually, use a HTTP load generation tool like locust to bombard the YSQL Webserver with requests to an endpoint like `<address>:13000/prometheus-metrics`.
On a standard devserver, I configured locust to use 30 simultaneous users (30 requests per second) to reproduce the issue.

The following bash script can be used to detect the coredumps:
```
#/bin/bash
ITERATIONS=50
YBDB_PATH=/path/to/code/yugabyte-db

# Count the number of dump files to avoid having to use `sudo coredumpctl`
idumps=$(ls /var/lib/systemd/coredump/ | wc -l)
for ((i = 0 ; i < $ITERATIONS ; i++ ))
do
        echo "Iteration: $(($i + 1))";
        $YBDB_PATH/bin/yb-ctl restart > /dev/null

        nservers=$(netstat -nlpt 2> /dev/null | grep 13000 | wc -l)
        if (( nservers != 1)); then
                echo "Web server has not come up. Exiting"
                exit 1;
        fi

        sleep 5s

        # Kill the webserver
        pkill -TERM -f 'YSQL webserver'

        # Count the number of coredumps
        # Please validate that the coredump produced is that of postgres/webserver
        ndumps=$(ls /var/lib/systemd/coredump/ | wc -l)
        if (( ndumps > idumps  )); then
                echo "Core dumps: $(($ndumps - $idumps))"
        else
                echo "No new core dumps found"
        fi
done
```

Run the script with the load generation tool running against the webserver in the background.
 - Without the fix in this revision, the above script produced 8 postgres/webserver core dumps in 50 iterations.
 - With the fix, no coredumps were observed.

Reviewers: telgersma, fizaa

Reviewed By: telgersma

Subscribers: ybase, smishra, yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D35116
svarnau pushed a commit that referenced this issue May 25, 2024
…retained for CDC"

Summary:
D33131 introduced a segmentation fault which was  identified in multiple tests.
```
* thread #1, name = 'yb-tserver', stop reason = signal SIGSEGV
  * frame #0: 0x00007f4d2b6f3a84 libpthread.so.0`__pthread_mutex_lock + 4
    frame #1: 0x000055d6d1e1190b yb-tserver`yb::tablet::MvccManager::SafeTimeForFollower(yb::HybridTime, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>>) const [inlined] std::__1::unique_lock<std::__1::mutex>::unique_lock[abi:v170002](this=0x00007f4ccb6feaa0, __m=0x0000000000000110) at unique_lock.h:41:11
    frame #2: 0x000055d6d1e118f5 yb-tserver`yb::tablet::MvccManager::SafeTimeForFollower(this=0x00000000000000f0, min_allowed=<unavailable>, deadline=yb::CoarseTimePoint @ 0x00007f4ccb6feb08) const at mvcc.cc:500:32
    frame #3: 0x000055d6d1ef58e3 yb-tserver`yb::tablet::TransactionParticipant::Impl::ProcessRemoveQueueUnlocked(this=0x000037e27d26fb00, min_running_notifier=0x00007f4ccb6fef28) at transaction_participant.cc:1537:45
    frame #4: 0x000055d6d1efc11a yb-tserver`yb::tablet::TransactionParticipant::Impl::EnqueueRemoveUnlocked(this=0x000037e27d26fb00, id=<unavailable>, reason=<unavailable>, min_running_notifier=0x00007f4ccb6fef28, expected_deadlock_status=<unavailable>) at transaction_participant.cc:1516:5
    frame #5: 0x000055d6d1e3afbe yb-tserver`yb::tablet::RunningTransaction::DoStatusReceived(this=0x000037e2679b5218, status_tablet="d5922c26c9704f298d6812aff8f615f6", status=<unavailable>, response=<unavailable>, serial_no=56986, shared_self=std::__1::shared_ptr<yb::tablet::RunningTransaction>::element_type @ 0x000037e2679b5218) at running_transaction.cc:424:16
    frame #6: 0x000055d6d0d7db5f yb-tserver`yb::client::(anonymous namespace)::TransactionRpcBase::Finished(this=0x000037e29c80b420, status=<unavailable>) at transaction_rpc.cc:67:7
```
This diff reverts the change to unblock the tests.

The proper fix for this problem is WIP
Jira: DB-10780, DB-10466

Test Plan: Jenkins: urgent

Reviewers: rthallam

Reviewed By: rthallam

Subscribers: ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D34245
svarnau pushed a commit that referenced this issue May 25, 2024
…-examples application (#21900)

* Upgrade the gorm docs to use latest go version and gorm v2

* Mentioned the use of smart drivers

* Harsh daryani896 patch 1 (#2)

* Update pgx driver version from v4 to v5 in docs.

* review comments and copied to preview

---------

Co-authored-by: Harsh Daryani <82017686+HarshDaryani896@users.noreply.github.com>
Co-authored-by: aishwarya24 <ashchakravarthy@gmail.com>
svarnau pushed a commit that referenced this issue May 25, 2024
…nal when creating telemetry provider

Summary:
This diff fixes the following 4 tickets:
1. [PLAT-13995] Tags should be optional when creating telemetry provider.
2. [PLAT-12270] Rename logRow to logRows + pgaudit.log_row to pgaudit.log_rows in YBA code
3. [PLAT-14002] Should not allow deleting a Telemetry Provider when it's in use
4. [PLAT-14011] Add AuditLog Payload in ModifyAuditLogging task type details

For #2, it is okay to change the request body param `logRow` -> `logRows` since this is an internal API and no one uses this DB audit logs feature yet including UI, YBM, other clients, etc.

Test Plan:
Manually tested all 4 of the above test cases.
Run UTs.
Run itests.

Reviewers: amalyshev

Reviewed By: amalyshev

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D35124
svarnau pushed a commit that referenced this issue May 25, 2024
Summary:
The YSQL webserver has occasionally produced coredumps of the following form upon receiving a termination signal from postmaster.
```
                #0  0x00007fbac35a9ae3 _ZNKSt3__112__hash_tableINS_17__hash_value_typeINS_12basic_string <snip>
                #1  0x00007fbac005485d _ZNKSt3__113unordered_mapINS_12basic_string <snip> (libyb_util.so)
                #2  0x00007fbac0053180 _ZN2yb16PrometheusWriter16WriteSingleEntryERKNSt3__113unordered_mapINS1_12basic_string <snip>
                #3  0x00007fbab21ff1eb _ZN2yb6pggateL26PgPrometheusMetricsHandlerERKNS_19WebCallbackRegistry10WebRequestEPNS1_11WebResponseE (libyb_pggate_webserver.so)
                ....
                ....
```

The coredump indicates corruption of a namespace-scoped variable of type unordered_map while attempting to serve a request after a termination signal has been received.
The current code causes the webserver (postgres background worker) to call postgres' `proc_exit()` which consequently calls `exit()`.

According to the [[ https://en.cppreference.com/w/cpp/utility/program/exit | C++ standard ]], a limited amount of cleanup is performed on exit():
 - Notably destructors of variables with automatic storage duration are not invoked. This implies that the webserver's destructor is not called, and therefore the server is not stopped.
 - Namespace-scoped variables have [[ https://en.cppreference.com/w/cpp/language/storage_duration | static storage duration ]].
 - Objects with static storage duration are destroyed.
 - This leads to a possibility of undefined behavior where the webserver may continue running for a short duration of time, while the static variables used to serve requests may have been GC'ed.

This revision explicitly stops the webserver upon receiving a termination signal, by calling its destructor.
It also adds logic to the handlers to return a `503 SERVICE_UNAVAILABLE` once termination has been initiated.
Jira: DB-7796

Test Plan:
To test this manually, use a HTTP load generation tool like locust to bombard the YSQL Webserver with requests to an endpoint like `<address>:13000/prometheus-metrics`.
On a standard devserver, I configured locust to use 30 simultaneous users (30 requests per second) to reproduce the issue.

The following bash script can be used to detect the coredumps:
```
#/bin/bash
ITERATIONS=50
YBDB_PATH=/path/to/code/yugabyte-db

# Count the number of dump files to avoid having to use `sudo coredumpctl`
idumps=$(ls /var/lib/systemd/coredump/ | wc -l)
for ((i = 0 ; i < $ITERATIONS ; i++ ))
do
        echo "Iteration: $(($i + 1))";
        $YBDB_PATH/bin/yb-ctl restart > /dev/null

        nservers=$(netstat -nlpt 2> /dev/null | grep 13000 | wc -l)
        if (( nservers != 1)); then
                echo "Web server has not come up. Exiting"
                exit 1;
        fi

        sleep 5s

        # Kill the webserver
        pkill -TERM -f 'YSQL webserver'

        # Count the number of coredumps
        # Please validate that the coredump produced is that of postgres/webserver
        ndumps=$(ls /var/lib/systemd/coredump/ | wc -l)
        if (( ndumps > idumps  )); then
                echo "Core dumps: $(($ndumps - $idumps))"
        else
                echo "No new core dumps found"
        fi
done
```

Run the script with the load generation tool running against the webserver in the background.
 - Without the fix in this revision, the above script produced 8 postgres/webserver core dumps in 50 iterations.
 - With the fix, no coredumps were observed.

Reviewers: telgersma, fizaa

Reviewed By: telgersma

Subscribers: ybase, smishra, yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D35116
svarnau pushed a commit that referenced this issue May 29, 2024
…s should be optional when creating telemetry provider

Summary:
Original commit: 7f76a34 / D35124
This diff fixes the following 4 tickets:
1. [PLAT-13995] Tags should be optional when creating telemetry provider.
2. [PLAT-12270] Rename logRow to logRows + pgaudit.log_row to pgaudit.log_rows in YBA code
3. [PLAT-14002] Should not allow deleting a Telemetry Provider when it's in use
4. [PLAT-14011] Add AuditLog Payload in ModifyAuditLogging task type details

For #2, it is okay to change the request body param `logRow` -> `logRows` since this is an internal API and no one uses this DB audit logs feature yet including UI, YBM, other clients, etc.

Test Plan:
Manually tested all 4 of the above test cases.
Run UTs.
Run itests.

Reviewers: amalyshev

Reviewed By: amalyshev

Subscribers: yugaware

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D35151
svarnau pushed a commit that referenced this issue May 29, 2024
… SIGTERM

Summary:
Original commit: 5862233 / D35116
The YSQL webserver has occasionally produced coredumps of the following form upon receiving a termination signal from postmaster.
```
                #0  0x00007fbac35a9ae3 _ZNKSt3__112__hash_tableINS_17__hash_value_typeINS_12basic_string <snip>
                #1  0x00007fbac005485d _ZNKSt3__113unordered_mapINS_12basic_string <snip> (libyb_util.so)
                #2  0x00007fbac0053180 _ZN2yb16PrometheusWriter16WriteSingleEntryERKNSt3__113unordered_mapINS1_12basic_string <snip>
                #3  0x00007fbab21ff1eb _ZN2yb6pggateL26PgPrometheusMetricsHandlerERKNS_19WebCallbackRegistry10WebRequestEPNS1_11WebResponseE (libyb_pggate_webserver.so)
                ....
                ....
```

The coredump indicates corruption of a namespace-scoped variable of type unordered_map while attempting to serve a request after a termination signal has been received.
The current code causes the webserver (postgres background worker) to call postgres' `proc_exit()` which consequently calls `exit()`.

According to the [[ https://en.cppreference.com/w/cpp/utility/program/exit | C++ standard ]], a limited amount of cleanup is performed on exit():
 - Notably destructors of variables with automatic storage duration are not invoked. This implies that the webserver's destructor is not called, and therefore the server is not stopped.
 - Namespace-scoped variables have [[ https://en.cppreference.com/w/cpp/language/storage_duration | static storage duration ]].
 - Objects with static storage duration are destroyed.
 - This leads to a possibility of undefined behavior where the webserver may continue running for a short duration of time, while the static variables used to serve requests may have been GC'ed.

This revision explicitly stops the webserver upon receiving a termination signal, by calling its destructor.
It also adds logic to the handlers to return a `503 SERVICE_UNAVAILABLE` once termination has been initiated.
Jira: DB-7796

Test Plan:
To test this manually, use a HTTP load generation tool like locust to bombard the YSQL Webserver with requests to an endpoint like `<address>:13000/prometheus-metrics`.
On a standard devserver, I configured locust to use 30 simultaneous users (30 requests per second) to reproduce the issue.

The following bash script can be used to detect the coredumps:
```
#/bin/bash
ITERATIONS=50
YBDB_PATH=/path/to/code/yugabyte-db

# Count the number of dump files to avoid having to use `sudo coredumpctl`
idumps=$(ls /var/lib/systemd/coredump/ | wc -l)
for ((i = 0 ; i < $ITERATIONS ; i++ ))
do
        echo "Iteration: $(($i + 1))";
        $YBDB_PATH/bin/yb-ctl restart > /dev/null

        nservers=$(netstat -nlpt 2> /dev/null | grep 13000 | wc -l)
        if (( nservers != 1)); then
                echo "Web server has not come up. Exiting"
                exit 1;
        fi

        sleep 5s

        # Kill the webserver
        pkill -TERM -f 'YSQL webserver'

        # Count the number of coredumps
        # Please validate that the coredump produced is that of postgres/webserver
        ndumps=$(ls /var/lib/systemd/coredump/ | wc -l)
        if (( ndumps > idumps  )); then
                echo "Core dumps: $(($ndumps - $idumps))"
        else
                echo "No new core dumps found"
        fi
done
```

Run the script with the load generation tool running against the webserver in the background.
 - Without the fix in this revision, the above script produced 8 postgres/webserver core dumps in 50 iterations.
 - With the fix, no coredumps were observed.

Reviewers: telgersma, fizaa

Reviewed By: telgersma

Subscribers: yql, smishra, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D35169
karthik-ramanathan-3006 added a commit that referenced this issue Jun 6, 2024
…IGTERM

Summary:
Original commit: 5862233 / D35116
The YSQL webserver has occasionally produced coredumps of the following form upon receiving a termination signal from postmaster.
```
                #0  0x00007fbac35a9ae3 _ZNKSt3__112__hash_tableINS_17__hash_value_typeINS_12basic_string <snip>
                #1  0x00007fbac005485d _ZNKSt3__113unordered_mapINS_12basic_string <snip> (libyb_util.so)
                #2  0x00007fbac0053180 _ZN2yb16PrometheusWriter16WriteSingleEntryERKNSt3__113unordered_mapINS1_12basic_string <snip>
                #3  0x00007fbab21ff1eb _ZN2yb6pggateL26PgPrometheusMetricsHandlerERKNS_19WebCallbackRegistry10WebRequestEPNS1_11WebResponseE (libyb_pggate_webserver.so)
                ....
                ....
```

The coredump indicates corruption of a namespace-scoped variable of type unordered_map while attempting to serve a request after a termination signal has been received.
The current code causes the webserver (postgres background worker) to call postgres' `proc_exit()` which consequently calls `exit()`.

According to the [[ https://en.cppreference.com/w/cpp/utility/program/exit | C++ standard ]], a limited amount of cleanup is performed on exit():
 - Notably destructors of variables with automatic storage duration are not invoked. This implies that the webserver's destructor is not called, and therefore the server is not stopped.
 - Namespace-scoped variables have [[ https://en.cppreference.com/w/cpp/language/storage_duration | static storage duration ]].
 - Objects with static storage duration are destroyed.
 - This leads to a possibility of undefined behavior where the webserver may continue running for a short duration of time, while the static variables used to serve requests may have been GC'ed.

This revision explicitly stops the webserver upon receiving a termination signal, by calling its destructor.
It also adds logic to the handlers to return a `503 SERVICE_UNAVAILABLE` once termination has been initiated.
Jira: DB-7796

Test Plan:
To test this manually, use a HTTP load generation tool like locust to bombard the YSQL Webserver with requests to an endpoint like `<address>:13000/prometheus-metrics`.
On a standard devserver, I configured locust to use 30 simultaneous users (30 requests per second) to reproduce the issue.

The following bash script can be used to detect the coredumps:
```
ITERATIONS=50
YBDB_PATH=/path/to/code/yugabyte-db

idumps=$(ls /var/lib/systemd/coredump/ | wc -l)
for ((i = 0 ; i < $ITERATIONS ; i++ ))
do
        echo "Iteration: $(($i + 1))";
        $YBDB_PATH/bin/yb-ctl restart > /dev/null

        nservers=$(netstat -nlpt 2> /dev/null | grep 13000 | wc -l)
        if (( nservers != 1)); then
                echo "Web server has not come up. Exiting"
                exit 1;
        fi

        sleep 5s

        # Kill the webserver
        pkill -TERM -f 'YSQL webserver'

        # Count the number of coredumps
        # Please validate that the coredump produced is that of postgres/webserver
        ndumps=$(ls /var/lib/systemd/coredump/ | wc -l)
        if (( ndumps > idumps  )); then
                echo "Core dumps: $(($ndumps - $idumps))"
        else
                echo "No new core dumps found"
        fi
done
```

Run the script with the load generation tool running against the webserver in the background.
 - Without the fix in this revision, the above script produced 8 postgres/webserver core dumps in 50 iterations.
 - With the fix, no coredumps were observed.

Reviewers: telgersma, fizaa

Reviewed By: telgersma

Subscribers: yql, smishra, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D35171
karthik-ramanathan-3006 added a commit that referenced this issue Jun 6, 2024
…IGTERM

Summary:
Original commit: 5862233 / D35116
The YSQL webserver has occasionally produced coredumps of the following form upon receiving a termination signal from postmaster.
```
                #0  0x00007fbac35a9ae3 _ZNKSt3__112__hash_tableINS_17__hash_value_typeINS_12basic_string <snip>
                #1  0x00007fbac005485d _ZNKSt3__113unordered_mapINS_12basic_string <snip> (libyb_util.so)
                #2  0x00007fbac0053180 _ZN2yb16PrometheusWriter16WriteSingleEntryERKNSt3__113unordered_mapINS1_12basic_string <snip>
                #3  0x00007fbab21ff1eb _ZN2yb6pggateL26PgPrometheusMetricsHandlerERKNS_19WebCallbackRegistry10WebRequestEPNS1_11WebResponseE (libyb_pggate_webserver.so)
                ....
                ....
```

The coredump indicates corruption of a namespace-scoped variable of type unordered_map while attempting to serve a request after a termination signal has been received.
The current code causes the webserver (postgres background worker) to call postgres' `proc_exit()` which consequently calls `exit()`.

According to the [[ https://en.cppreference.com/w/cpp/utility/program/exit | C++ standard ]], a limited amount of cleanup is performed on exit():
 - Notably destructors of variables with automatic storage duration are not invoked. This implies that the webserver's destructor is not called, and therefore the server is not stopped.
 - Namespace-scoped variables have [[ https://en.cppreference.com/w/cpp/language/storage_duration | static storage duration ]].
 - Objects with static storage duration are destroyed.
 - This leads to a possibility of undefined behavior where the webserver may continue running for a short duration of time, while the static variables used to serve requests may have been GC'ed.

This revision explicitly stops the webserver upon receiving a termination signal, by calling its destructor.
It also adds logic to the handlers to return a `503 SERVICE_UNAVAILABLE` once termination has been initiated.
Jira: DB-7796

Test Plan:
To test this manually, use a HTTP load generation tool like locust to bombard the YSQL Webserver with requests to an endpoint like `<address>:13000/prometheus-metrics`.
On a standard devserver, I configured locust to use 30 simultaneous users (30 requests per second) to reproduce the issue.

The following bash script can be used to detect the coredumps:
```
#/bin/bash
ITERATIONS=50
YBDB_PATH=/path/to/code/yugabyte-db

# Count the number of dump files to avoid having to use `sudo coredumpctl`
idumps=$(ls /var/lib/systemd/coredump/ | wc -l)
for ((i = 0 ; i < $ITERATIONS ; i++ ))
do
        echo "Iteration: $(($i + 1))";
        $YBDB_PATH/bin/yb-ctl restart > /dev/null

        nservers=$(netstat -nlpt 2> /dev/null | grep 13000 | wc -l)
        if (( nservers != 1)); then
                echo "Web server has not come up. Exiting"
                exit 1;
        fi

        sleep 5s

        # Kill the webserver
        pkill -TERM -f 'YSQL webserver'

        # Count the number of coredumps
        # Please validate that the coredump produced is that of postgres/webserver
        ndumps=$(ls /var/lib/systemd/coredump/ | wc -l)
        if (( ndumps > idumps  )); then
                echo "Core dumps: $(($ndumps - $idumps))"
        else
                echo "No new core dumps found"
        fi
done
```

Run the script with the load generation tool running against the webserver in the background.
 - Without the fix in this revision, the above script produced 8 postgres/webserver core dumps in 50 iterations.
 - With the fix, no coredumps were observed.

Reviewers: telgersma, fizaa

Reviewed By: telgersma

Subscribers: ybase, smishra, yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D35170
jasonyb pushed a commit that referenced this issue Jun 11, 2024
Support for database/user/client based aggregates added to access
these statistics with three new views added. Some new counters added
including min/max/mean's time histograms. We are saving the parameters
of the slow queries, which can be tested later. Did some refactoring
of the code, by renaming the whole extension from pg_stat_statement to
pg_stat_monitor.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement This is an enhancement of an existing feature priority/low Low priority
Projects
YCQL
  
Done
Development

No branches or pull requests

6 participants