[DocDB] flaky test: YbAdminSnapshotScheduleTest.CacheRefreshOnNewConnection #15006

bmatican · 2022-11-15T15:52:31Z

Jira Link: DB-4249

Description

https://detective-gcp.dev.yugabyte.com/stability/test?analyze_trends=true&branch=master&build_type=all&class=YbAdminSnapshotScheduleTest&fail_tag=all&name=CacheRefreshOnNewConnection&platform=linux

Bunch of failures of this form, so maybe just overloaded server in the test

ERROR: Already present: Duplicate request 8 from client 6307025e-51a7-4a50-8c09-de3a13eb8609 (min running 8))

sanketkedia · 2022-11-15T23:17:47Z

Based on slack conversation seems like caused due to 5a09155

…n retrying a batcher Summary: When retrying a batcher for tablet split error, we reuse the same request id as the failed batcher. So we should make sure all ops having same request id should be retried with the same WriteRpc. But currently, it's possible that an operation failed with TabletLookup and retry with a seperate WriteRpc, this will lead to unexpected `Duplicate request` error. The test YbAdminSnapshotScheduleTest.CacheRefreshOnNewConnection is flaky due to this error. With this diff, if TabletLookup errors happen for an operation is going to be retried (have a valid request_id), we also set the error to operations having the same request_id. They will be retried in one shot with the next retry batcher. Test Plan: YbAdminSnapshotScheduleTest.CacheRefreshOnNewConnection -n 100 Reviewers: sergei, rthallam, timur Reviewed By: timur Subscribers: ybase, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D21137

Huqicheng · 2022-11-21T15:17:03Z

Look at the detective trend: https://detective.dev.yugabyte.com/stability/test?analyze_trends=true&branch=master&class=YbAdminSnapshotScheduleTest&name=CacheRefreshOnNewConnection

The test is passing all builds after the fix is in.

Also need to back port this to 2.16 2.17.0

… request id when retrying a batcher Summary: Original commit: 47dec92 / D21137 When retrying a batcher for tablet split error, we reuse the same request id as the failed batcher. So we should make sure all ops having same request id should be retried with the same WriteRpc. But currently, it's possible that an operation failed with TabletLookup and retry with a seperate WriteRpc, this will lead to unexpected `Duplicate request` error. The test YbAdminSnapshotScheduleTest.CacheRefreshOnNewConnection is flaky due to this error. With this diff, if TabletLookup errors happen for an operation is going to be retried (have a valid request_id), we also set the error to operations having the same request_id. They will be retried in one shot with the next retry batcher. Test Plan: YbAdminSnapshotScheduleTest.CacheRefreshOnNewConnection -n 100 Reviewers: sergei, rthallam, timur, jmeehan Reviewed By: jmeehan Subscribers: jmeehan, bogdan, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D21228

…me request id when retrying a batcher Summary: Original commit: 47dec92 / D21137 When retrying a batcher for tablet split error, we reuse the same request id as the failed batcher. So we should make sure all ops having same request id should be retried with the same WriteRpc. But currently, it's possible that an operation failed with TabletLookup and retry with a seperate WriteRpc, this will lead to unexpected `Duplicate request` error. The test YbAdminSnapshotScheduleTest.CacheRefreshOnNewConnection is flaky due to this error. With this diff, if TabletLookup errors happen for an operation is going to be retried (have a valid request_id), we also set the error to operations having the same request_id. They will be retried in one shot with the next retry batcher. Test Plan: YbAdminSnapshotScheduleTest.CacheRefreshOnNewConnection -n 100 Reviewers: sergei, rthallam, timur, jmeehan Reviewed By: jmeehan Subscribers: jmeehan, ybase, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D21229

…t id when retrying a batcher Summary: When retrying a batcher for tablet split error, we reuse the same request id as the failed batcher. So we should make sure all ops having same request id should be retried with the same WriteRpc. But currently, it's possible that an operation failed with TabletLookup and retry with a seperate WriteRpc, this will lead to unexpected `Duplicate request` error. The test YbAdminSnapshotScheduleTest.CacheRefreshOnNewConnection is flaky due to this error. With this diff, if TabletLookup errors happen for an operation is going to be retried (have a valid request_id), we also set the error to operations having the same request_id. They will be retried in one shot with the next retry batcher. Test Plan: YbAdminSnapshotScheduleTest.CacheRefreshOnNewConnection -n 100 Reviewers: sergei, rthallam, timur Reviewed By: timur Subscribers: ybase, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D21137

… request id when retrying a batcher Summary: Original commit: 6bd085b / D21137 When retrying a batcher for tablet split error, we reuse the same request id as the failed batcher. So we should make sure all ops having same request id should be retried with the same WriteRpc. But currently, it's possible that an operation failed with TabletLookup and retry with a seperate WriteRpc, this will lead to unexpected `Duplicate request` error. The test YbAdminSnapshotScheduleTest.CacheRefreshOnNewConnection is flaky due to this error. With this diff, if TabletLookup errors happen for an operation is going to be retried (have a valid request_id), we also set the error to operations having the same request_id. They will be retried in one shot with the next retry batcher. Test Plan: YbAdminSnapshotScheduleTest.CacheRefreshOnNewConnection -n 100 Reviewers: sergei, rthallam, timur, jmeehan Reviewed By: jmeehan Subscribers: bogdan, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D21815

bmatican added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Nov 15, 2022

bmatican assigned lingamsandeep Nov 15, 2022

yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Nov 15, 2022

bmatican added kind/failing-test Tests and testing infra priority/high High Priority and removed priority/medium Medium priority issue labels Nov 15, 2022

yugabyte-ci assigned sanketkedia and unassigned lingamsandeep Nov 15, 2022

yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Nov 15, 2022

sanketkedia assigned Huqicheng and unassigned sanketkedia Nov 15, 2022

Huqicheng added the 2.16 Backport Required label Nov 21, 2022

Huqicheng closed this as completed Nov 24, 2022

This was referenced Mar 9, 2023

[YSQL][LST] ERROR: Already present: Duplicate request 14 from client ... (min running 14) #14368

Closed

[YSQL] COPY fails with [ERROR: Already present: Duplicate request] #7251

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DocDB] flaky test: YbAdminSnapshotScheduleTest.CacheRefreshOnNewConnection #15006

[DocDB] flaky test: YbAdminSnapshotScheduleTest.CacheRefreshOnNewConnection #15006

bmatican commented Nov 15, 2022 •

edited by yugabyte-ci

sanketkedia commented Nov 15, 2022

Huqicheng commented Nov 21, 2022

[DocDB] flaky test: YbAdminSnapshotScheduleTest.CacheRefreshOnNewConnection #15006

[DocDB] flaky test: YbAdminSnapshotScheduleTest.CacheRefreshOnNewConnection #15006

Comments

bmatican commented Nov 15, 2022 • edited by yugabyte-ci

Description

sanketkedia commented Nov 15, 2022

Huqicheng commented Nov 21, 2022

bmatican commented Nov 15, 2022 •

edited by yugabyte-ci