Eventually-consistent reads from a tablet replica without any committed records will time out indefinitely #124

mbautin · 2018-03-23T22:36:43Z

Jira Link: DB-1875
Found when investigating #120. The issue is because we need to obtain a safe time value when reading, and we currently don't allow that safe time to be the minimum possible hybrid time. We wait to get the propagated safe time from the leader, or learn about an operation with a certain hybrid time being committed in Raft.

@lumedar

…wait until servers converge on committed OpId as opposed to just last received Summary: LinkedListTest.TestLoadWhileOneServerDownAndVerify brings one tablet server down, loads some data, then brings the server up and waits for it to catch up, then shuts down the two other servers and verifies it can still read the data from the first server. It used to fail once in a while as pointed out by @lumedar in https://forum.yugabyte.com/t/why-does-the-unit-test-fail-to-pass/130. The issue turned out to be in how we wait for the server that was originally down to catch up with the rest. We waited for last received OpIds to be the same in all servers' Raft logs, but that is not sufficient: only the leader can tell a follower that some operation is committed, and if we shut down the two servers that originally loaded the data too early, the third server might actually end up with a zero committed id. Filed #124 to track that remaining issue separately, and fixed the test to wait for all servers' committed OpIds to also match their last received OpIds All these 6 OpIds, in case of RF=3, should be the same for the test to proceed. Test Plan: ybd linked_list-test --gtest_filter LinkedListTest.TestLoadWhileOneServerDownAndVerify -n 100 Reviewers: bogdan, sergei, bharat, kannan Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D4454

…rsions (#124)

rthallamko3 · 2023-03-23T15:38:53Z

@amitanandaiyer , Seems like bc3bc7d fixes this? Can we close this issue?

mbautin self-assigned this Mar 23, 2018

rkarthik007 added this to To Do in Fault Tolerance via automation Apr 3, 2018

mbautin pushed a commit that referenced this issue Jun 20, 2019

Fix ordering of versions in dropdown menu and add note for earlier ve…

bc3bc7d

…rsions (#124)

rthallamko3 added the area/docdb YugabyteDB core features label Mar 12, 2022

yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Jun 9, 2022

yugabyte-ci assigned rthallamko3 and unassigned mbautin Jul 27, 2022

yugabyte-ci assigned amitanandaiyer and unassigned rthallamko3 Sep 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eventually-consistent reads from a tablet replica without any committed records will time out indefinitely #124

Eventually-consistent reads from a tablet replica without any committed records will time out indefinitely #124

mbautin commented Mar 23, 2018 •

edited by yugabyte-ci

rthallamko3 commented Mar 23, 2023

Eventually-consistent reads from a tablet replica without any committed records will time out indefinitely #124

Eventually-consistent reads from a tablet replica without any committed records will time out indefinitely #124

Comments

mbautin commented Mar 23, 2018 • edited by yugabyte-ci

rthallamko3 commented Mar 23, 2023

mbautin commented Mar 23, 2018 •

edited by yugabyte-ci