Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CDCSDK] Address performance impact of increasing retention period of intents #21580

Closed
yugabyte-ci opened this issue Mar 19, 2024 · 0 comments
Closed
Assignees
Labels
2024.1 Backport Required 2024.1.1_blocker area/docdb YugabyteDB core features jira-originated kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue

Comments

@yugabyte-ci
Copy link
Contributor

yugabyte-ci commented Mar 19, 2024

Jira Link: DB-10466

Setup: Two RF3 3x c5.4xlarge universe with provisioned IOPS set to 10000, 4 million rows, taking a constant stream of single-row updates in full transactions from 6 threads. One universe has no CDC streams (non-CDC universe), and the other universe has a CDC stream that is lagging behind – no changes are being sent (CDC universe). This performance issue was observed after the memory issue #21290 was resolved.

The following graph shows the throughput of (blue) the non-CDC universe, and (green) the CDC universe. (Prometheus)
Screenshot 2024-03-19 at 9 39 56 AM

This issue tracks the fixes needed to address the above performance issue.

@yugabyte-ci yugabyte-ci added area/docdb YugabyteDB core features jira-originated kind/enhancement This is an enhancement of an existing feature priority/low Low priority labels Mar 19, 2024
@rthallamko3 rthallamko3 added priority/medium Medium priority issue and removed priority/low Low priority labels Mar 19, 2024
es1024 added a commit that referenced this issue Apr 5, 2024
Summary:
When CDC is lagging behind, there may be many SST files in intentsdb which only consist of applied transactions, but which we cannot yet delete, since CDC has not streamed the changes yet. These SST files impact performance of reading from intentsdb, even though we don't actually care about them in most cases (since all changes in them have already been applied).

This diff adds a hybrid time filter on intent iterators for read path, conflict resolution, and intent apply, to skip all SST files before min running hybrid time. This is gated behind the newly added `docdb_ht_filter_intents` gflag (default on in debug). `docdb_ht_filter_intents` to be set to default on after CDC stress tests with D31900 / 559b2b0 changes enabled as well.

Jira: DB-10466

Test Plan: Jenkins.

Reviewers: sergei, mbautin

Reviewed By: sergei, mbautin

Subscribers: yql, ybase, bogdan, rthallam

Differential Revision: https://phorge.dev.yugabyte.com/D33131
yusong-yan added a commit that referenced this issue Apr 18, 2024
…retained for CDC"

Summary:
D33131 introduced a segmentation fault which was  identified in multiple tests.
```
* thread #1, name = 'yb-tserver', stop reason = signal SIGSEGV
  * frame #0: 0x00007f4d2b6f3a84 libpthread.so.0`__pthread_mutex_lock + 4
    frame #1: 0x000055d6d1e1190b yb-tserver`yb::tablet::MvccManager::SafeTimeForFollower(yb::HybridTime, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>>) const [inlined] std::__1::unique_lock<std::__1::mutex>::unique_lock[abi:v170002](this=0x00007f4ccb6feaa0, __m=0x0000000000000110) at unique_lock.h:41:11
    frame #2: 0x000055d6d1e118f5 yb-tserver`yb::tablet::MvccManager::SafeTimeForFollower(this=0x00000000000000f0, min_allowed=<unavailable>, deadline=yb::CoarseTimePoint @ 0x00007f4ccb6feb08) const at mvcc.cc:500:32
    frame #3: 0x000055d6d1ef58e3 yb-tserver`yb::tablet::TransactionParticipant::Impl::ProcessRemoveQueueUnlocked(this=0x000037e27d26fb00, min_running_notifier=0x00007f4ccb6fef28) at transaction_participant.cc:1537:45
    frame #4: 0x000055d6d1efc11a yb-tserver`yb::tablet::TransactionParticipant::Impl::EnqueueRemoveUnlocked(this=0x000037e27d26fb00, id=<unavailable>, reason=<unavailable>, min_running_notifier=0x00007f4ccb6fef28, expected_deadlock_status=<unavailable>) at transaction_participant.cc:1516:5
    frame #5: 0x000055d6d1e3afbe yb-tserver`yb::tablet::RunningTransaction::DoStatusReceived(this=0x000037e2679b5218, status_tablet="d5922c26c9704f298d6812aff8f615f6", status=<unavailable>, response=<unavailable>, serial_no=56986, shared_self=std::__1::shared_ptr<yb::tablet::RunningTransaction>::element_type @ 0x000037e2679b5218) at running_transaction.cc:424:16
    frame #6: 0x000055d6d0d7db5f yb-tserver`yb::client::(anonymous namespace)::TransactionRpcBase::Finished(this=0x000037e29c80b420, status=<unavailable>) at transaction_rpc.cc:67:7
```
This diff reverts the change to unblock the tests.

The proper fix for this problem is WIP
Jira: DB-10780, DB-10466

Test Plan: Jenkins: urgent

Reviewers: rthallam

Reviewed By: rthallam

Subscribers: ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D34245
es1024 added a commit that referenced this issue May 6, 2024
…er intent SST files only retained for CDC""

Summary:
This reverts commit D34245 / 89316bd, which reverted
D33131 / fb7c86c due to a segmentation fault introduced due to
`min_running_ht` being initialized too early; this issue is now fixed with
D34389 / 138b81a.
Jira: DB-10466, DB-10780

Test Plan: Jenkins

Reviewers: yyan, sergei

Reviewed By: yyan

Subscribers: rthallam, ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D34745
es1024 added a commit that referenced this issue May 15, 2024
…d for CDC

Summary:
Original commit: fb7c86c / D33131
When CDC is lagging behind, there may be many SST files in intentsdb which only consist of applied transactions, but which we cannot yet delete, since CDC has not streamed the changes yet. These SST files impact performance of reading from intentsdb, even though we don't actually care about them in most cases (since all changes in them have already been applied).

This diff adds a hybrid time filter on intent iterators for read path, conflict resolution, and intent apply, to skip all SST files before min running hybrid time. This is gated behind the newly added `docdb_ht_filter_intents` gflag (default on in debug). `docdb_ht_filter_intents` to be set to default on after CDC stress tests with D31900 / 559b2b0 changes enabled as well.

Jira: DB-10466

Test Plan: Jenkins.

Reviewers: sergei, mbautin, rthallam

Reviewed By: rthallam

Subscribers: rthallam, bogdan, ybase, yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D34746
svarnau pushed a commit that referenced this issue May 25, 2024
…retained for CDC"

Summary:
D33131 introduced a segmentation fault which was  identified in multiple tests.
```
* thread #1, name = 'yb-tserver', stop reason = signal SIGSEGV
  * frame #0: 0x00007f4d2b6f3a84 libpthread.so.0`__pthread_mutex_lock + 4
    frame #1: 0x000055d6d1e1190b yb-tserver`yb::tablet::MvccManager::SafeTimeForFollower(yb::HybridTime, std::__1::chrono::time_point<yb::CoarseMonoClock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>>) const [inlined] std::__1::unique_lock<std::__1::mutex>::unique_lock[abi:v170002](this=0x00007f4ccb6feaa0, __m=0x0000000000000110) at unique_lock.h:41:11
    frame #2: 0x000055d6d1e118f5 yb-tserver`yb::tablet::MvccManager::SafeTimeForFollower(this=0x00000000000000f0, min_allowed=<unavailable>, deadline=yb::CoarseTimePoint @ 0x00007f4ccb6feb08) const at mvcc.cc:500:32
    frame #3: 0x000055d6d1ef58e3 yb-tserver`yb::tablet::TransactionParticipant::Impl::ProcessRemoveQueueUnlocked(this=0x000037e27d26fb00, min_running_notifier=0x00007f4ccb6fef28) at transaction_participant.cc:1537:45
    frame #4: 0x000055d6d1efc11a yb-tserver`yb::tablet::TransactionParticipant::Impl::EnqueueRemoveUnlocked(this=0x000037e27d26fb00, id=<unavailable>, reason=<unavailable>, min_running_notifier=0x00007f4ccb6fef28, expected_deadlock_status=<unavailable>) at transaction_participant.cc:1516:5
    frame #5: 0x000055d6d1e3afbe yb-tserver`yb::tablet::RunningTransaction::DoStatusReceived(this=0x000037e2679b5218, status_tablet="d5922c26c9704f298d6812aff8f615f6", status=<unavailable>, response=<unavailable>, serial_no=56986, shared_self=std::__1::shared_ptr<yb::tablet::RunningTransaction>::element_type @ 0x000037e2679b5218) at running_transaction.cc:424:16
    frame #6: 0x000055d6d0d7db5f yb-tserver`yb::client::(anonymous namespace)::TransactionRpcBase::Finished(this=0x000037e29c80b420, status=<unavailable>) at transaction_rpc.cc:67:7
```
This diff reverts the change to unblock the tests.

The proper fix for this problem is WIP
Jira: DB-10780, DB-10466

Test Plan: Jenkins: urgent

Reviewers: rthallam

Reviewed By: rthallam

Subscribers: ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D34245
svarnau pushed a commit that referenced this issue May 25, 2024
…er intent SST files only retained for CDC""

Summary:
This reverts commit D34245 / 89316bd, which reverted
D33131 / fb7c86c due to a segmentation fault introduced due to
`min_running_ht` being initialized too early; this issue is now fixed with
D34389 / 138b81a.
Jira: DB-10466, DB-10780

Test Plan: Jenkins

Reviewers: yyan, sergei

Reviewed By: yyan

Subscribers: rthallam, ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D34745
svarnau pushed a commit that referenced this issue May 29, 2024
…d for CDC

Summary:
Original commit: fb7c86c / D33131
When CDC is lagging behind, there may be many SST files in intentsdb which only consist of applied transactions, but which we cannot yet delete, since CDC has not streamed the changes yet. These SST files impact performance of reading from intentsdb, even though we don't actually care about them in most cases (since all changes in them have already been applied).

This diff adds a hybrid time filter on intent iterators for read path, conflict resolution, and intent apply, to skip all SST files before min running hybrid time. This is gated behind the newly added `docdb_ht_filter_intents` gflag (default on in debug). `docdb_ht_filter_intents` to be set to default on after CDC stress tests with D31900 / 559b2b0 changes enabled as well.

Jira: DB-10466

Test Plan: Jenkins.

Reviewers: sergei, mbautin, rthallam

Reviewed By: rthallam

Subscribers: rthallam, bogdan, ybase, yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D34746
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2024.1 Backport Required 2024.1.1_blocker area/docdb YugabyteDB core features jira-originated kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue
Projects
None yet
Development

No branches or pull requests

3 participants