Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] flaky test: YbAdminSnapshotScheduleTest.CacheRefreshOnNewConnection #15006

Closed
bmatican opened this issue Nov 15, 2022 · 2 comments
Closed
Assignees
Labels
2.16 Backport Required area/docdb YugabyteDB core features kind/bug This issue is a bug kind/failing-test Tests and testing infra priority/high High Priority

Comments

@bmatican
Copy link
Contributor

bmatican commented Nov 15, 2022

Jira Link: DB-4249

Description

https://detective-gcp.dev.yugabyte.com/stability/test?analyze_trends=true&branch=master&build_type=all&class=YbAdminSnapshotScheduleTest&fail_tag=all&name=CacheRefreshOnNewConnection&platform=linux

Bunch of failures of this form, so maybe just overloaded server in the test

ERROR: Already present: Duplicate request 8 from client 6307025e-51a7-4a50-8c09-de3a13eb8609 (min running 8))

@bmatican bmatican added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Nov 15, 2022
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Nov 15, 2022
@bmatican bmatican added kind/failing-test Tests and testing infra priority/high High Priority and removed priority/medium Medium priority issue labels Nov 15, 2022
@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Nov 15, 2022
@sanketkedia sanketkedia assigned Huqicheng and unassigned sanketkedia Nov 15, 2022
@sanketkedia
Copy link
Contributor

Based on slack conversation seems like caused due to 5a09155

Huqicheng added a commit that referenced this issue Nov 19, 2022
…n retrying a batcher

Summary:
When retrying a batcher for tablet split error, we reuse the same request id as the failed batcher.
So we should make sure all ops having same request id should be retried with the same WriteRpc.

But currently, it's possible that an operation failed with TabletLookup and retry with a seperate WriteRpc,
this will lead to unexpected `Duplicate request` error.
The test YbAdminSnapshotScheduleTest.CacheRefreshOnNewConnection is flaky due to this error.

With this diff, if TabletLookup errors happen for an operation is going to be retried (have a valid request_id),
we also set the error to operations having the same request_id. They will be retried in one shot with the next retry batcher.

Test Plan: YbAdminSnapshotScheduleTest.CacheRefreshOnNewConnection -n 100

Reviewers: sergei, rthallam, timur

Reviewed By: timur

Subscribers: ybase, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D21137
@Huqicheng
Copy link
Contributor

Look at the detective trend: https://detective.dev.yugabyte.com/stability/test?analyze_trends=true&branch=master&class=YbAdminSnapshotScheduleTest&name=CacheRefreshOnNewConnection

The test is passing all builds after the fix is in.

Also need to back port this to 2.16 2.17.0

Huqicheng added a commit that referenced this issue Nov 21, 2022
… request id when retrying a batcher

Summary:
Original commit: 47dec92 / D21137
When retrying a batcher for tablet split error, we reuse the same request id as the failed batcher.
So we should make sure all ops having same request id should be retried with the same WriteRpc.

But currently, it's possible that an operation failed with TabletLookup and retry with a seperate WriteRpc,
this will lead to unexpected `Duplicate request` error.
The test YbAdminSnapshotScheduleTest.CacheRefreshOnNewConnection is flaky due to this error.

With this diff, if TabletLookup errors happen for an operation is going to be retried (have a valid request_id),
we also set the error to operations having the same request_id. They will be retried in one shot with the next retry batcher.

Test Plan: YbAdminSnapshotScheduleTest.CacheRefreshOnNewConnection -n 100

Reviewers: sergei, rthallam, timur, jmeehan

Reviewed By: jmeehan

Subscribers: jmeehan, bogdan, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D21228
Huqicheng added a commit that referenced this issue Nov 21, 2022
…me request id when retrying a batcher

Summary:
Original commit: 47dec92 / D21137
When retrying a batcher for tablet split error, we reuse the same request id as the failed batcher.
So we should make sure all ops having same request id should be retried with the same WriteRpc.

But currently, it's possible that an operation failed with TabletLookup and retry with a seperate WriteRpc,
this will lead to unexpected `Duplicate request` error.
The test YbAdminSnapshotScheduleTest.CacheRefreshOnNewConnection is flaky due to this error.

With this diff, if TabletLookup errors happen for an operation is going to be retried (have a valid request_id),
we also set the error to operations having the same request_id. They will be retried in one shot with the next retry batcher.

Test Plan: YbAdminSnapshotScheduleTest.CacheRefreshOnNewConnection -n 100

Reviewers: sergei, rthallam, timur, jmeehan

Reviewed By: jmeehan

Subscribers: jmeehan, ybase, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D21229
jayant07-yb pushed a commit to jayant07-yb/yugabyte-db that referenced this issue Dec 7, 2022
…t id when retrying a batcher

Summary:
When retrying a batcher for tablet split error, we reuse the same request id as the failed batcher.
So we should make sure all ops having same request id should be retried with the same WriteRpc.

But currently, it's possible that an operation failed with TabletLookup and retry with a seperate WriteRpc,
this will lead to unexpected `Duplicate request` error.
The test YbAdminSnapshotScheduleTest.CacheRefreshOnNewConnection is flaky due to this error.

With this diff, if TabletLookup errors happen for an operation is going to be retried (have a valid request_id),
we also set the error to operations having the same request_id. They will be retried in one shot with the next retry batcher.

Test Plan: YbAdminSnapshotScheduleTest.CacheRefreshOnNewConnection -n 100

Reviewers: sergei, rthallam, timur

Reviewed By: timur

Subscribers: ybase, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D21137
Huqicheng added a commit that referenced this issue Dec 22, 2022
… request id when retrying a batcher

Summary:
Original commit: 6bd085b / D21137
When retrying a batcher for tablet split error, we reuse the same request id as the failed batcher.
So we should make sure all ops having same request id should be retried with the same WriteRpc.

But currently, it's possible that an operation failed with TabletLookup and retry with a seperate WriteRpc,
this will lead to unexpected `Duplicate request` error.
The test YbAdminSnapshotScheduleTest.CacheRefreshOnNewConnection is flaky due to this error.

With this diff, if TabletLookup errors happen for an operation is going to be retried (have a valid request_id),
we also set the error to operations having the same request_id. They will be retried in one shot with the next retry batcher.

Test Plan: YbAdminSnapshotScheduleTest.CacheRefreshOnNewConnection -n 100

Reviewers: sergei, rthallam, timur, jmeehan

Reviewed By: jmeehan

Subscribers: bogdan, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D21815
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.16 Backport Required area/docdb YugabyteDB core features kind/bug This issue is a bug kind/failing-test Tests and testing infra priority/high High Priority
Projects
None yet
Development

No branches or pull requests

5 participants