New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DocDB] Fix tablet guardrail mechanism and turn it on for universes with 16 GB or less per-node RAM #20667
Closed
1 task done
Labels
area/docdb
YugabyteDB core features
kind/new-feature
This is a request for a completely new feature
priority/medium
Medium priority issue
Comments
mdbridge
added
area/docdb
YugabyteDB core features
status/awaiting-triage
Issue awaiting triage
labels
Jan 17, 2024
yugabyte-ci
added
kind/new-feature
This is a request for a completely new feature
priority/medium
Medium priority issue
labels
Jan 17, 2024
mdbridge
changed the title
[DocDB] Fix tablet guardrail mechanism and turn it on for testing or maybe new universes
[DocDB] Fix tablet guardrail mechanism and turn it on for universes with 16 GB or less per-node RAM
Jan 17, 2024
UPDATE: tentative decision is to turn this on via #20664 for 16 GB or less boxes |
mdbridge
added a commit
that referenced
this issue
Feb 8, 2024
Summary: We built a guardrail mechanism to limit the number of tablets users can create – see #16177. However, it is currently broken and cannot be safely used because it requires setting a limit for the per tablet overhead memory tracker, which would cause issues because: * the memory tracker hierarchy under per-tablet-overhead is incorrect, containing MemTable trackers, which are not per tablet, and failing to contain untracked memory, which is almost all per tablet overhead * the current code for checking for memory pressure double counts soft memory limit checks at every level of the memory hierarchy that has a limit * we should not be enforcing a soft memory limit for any but the root memory tracker as the point of the soft memory limit is to absorb server-wide spikes not control individual memory components Here we work around this issue by temporarily making the gflag: ``` ~/code/yugabyte-db/src/yb/tserver/tablet_memory_manager.cc:55: DEFINE_NON_RUNTIME_int32( tablet_overhead_size_percentage, 0, "Percentage of total available memory to use for tablet-related overheads. Default is 0, " "meaning no limit. Must be between 0 and 100 inclusive."); ``` not set a limit on the per-tablet overhead memory tracker. This will make it possible to turn on this gflag without running into the previous issues. Other GitHub issues will address fixing these issues so that it will be possible to use the gflag in the future to set a limit for that memory tracker. We still need information on how much memory is available for per-tablet overhead for the guardrails; we get that instead directly from the function that calculates what the limit would have been, ``` ~/code/yugabyte-db/src/yb/tserver/tablet_memory_manager.cc:161: int64 ComputeTabletOverheadLimit() { CHECK(0 [= FLAGS_tablet_overhead_size_percentage && FLAGS_tablet_overhead_size_percentage <= 100) << Format( "Flag tablet_overhead_size_percentage must be between 0 and 100 inclusive. Current " "value: $0", FLAGS_tablet_overhead_size_percentage); if (0 == FLAGS_tablet_overhead_size_percentage) return -1; return MemTracker::GetRootTracker()-]limit() * FLAGS_tablet_overhead_size_percentage / 100; } ``` Also added rate-limited logging for when we reject and VLOGging for when we don't. There is a GitHub issue, #20668, for additionally: [DocDB] Add metric and alert for tablet guardrail limit reached Jira: DB-9663 Test Plan: Start a one node cluster and verify via memory trackers page (http://127.0.0.1:[79]000/mem-trackers) that there is no limit for tablets_overhead regardless of whether or not we pass the limits flag: ``` bin/yb-ctl start bin/yb-ctl start --tserver_flags 'tablet_overhead_size_percentage=10' --master_flags 'tablet_overhead_size_percentage=10' bin/yb-ctl start --tserver_flags 'tablet_overhead_size_percentage=10' --master_flags 'tablet_overhead_size_percentage=10,tablet_replicas_per_gib_limit=1462,vmodule=*tablet_limits*=1' ``` Started one node cluster and verify: * guardrail limiting is not on: ``` bin/yb-ctl start ``` * guardrail limiting is on but not biting: ``` bin/yb-ctl start --tserver_flags 'tablet_overhead_size_percentage=10' --master_flags 'tablet_overhead_size_percentage=10,tablet_replicas_per_gib_limit=1462,vmodule=*tablet_limits*=1' I0206 12:05:51.925747 14832 tablet_limits.cc:128] vlog1: Approved an additional 8 tablet replicas, which will increase the total running tablet replica count to 8, which is still below the safe system maximum of 2963 ``` * guardrail limiting is on and biting: ``` bin/yb-ctl start --tserver_flags 'tablet_overhead_size_percentage=10' --master_flags 'tablet_overhead_size_percentage=10,tablet_replicas_per_gib_limit=2,vmodule=*tablet_limits*=1' E0206 12:07:22.505185 15587 tablet_limits.cc:125] The requested number of tablet replicas (8) would cause the total running tablet replica count (8) to exceed the safe system maximum (4) W0206 12:07:22.505299 15587 catalog_manager.cc:3929] Invalid argument (yb/master/tablet_limits.cc:126): The requested number of tablet replicas (8) would cause the total running tablet replica count (8) to exceed the safe system maximum (4) W0206 12:07:22.505342 15587 catalog_manager.cc:1436] T 00000000000000000000000000000000 P 502cb84f04454e1f89e4203f6358e8af: Failed creating transaction status table, waiting: Invalid argument (yb/master/tablet_limits.cc:126): The requested number of tablet replicas (8) would cause the total running tablet replica count (8) to exceed the safe system maximum (4) ... ``` Update: Testing with this turned on using Jenkins found a few tests that needed adjustment: Turn feature on by hard coding (flags are not passed down): ``` ~/code/yugabyte-db/src/yb/master/tablet_limits.cc:24: DEFINE_RUNTIME_int32(tablet_replicas_per_gib_limit, 1400, ``` ``` ~/code/yugabyte-db/src/yb/tserver/tablet_memory_manager.cc:56: DEFINE_NON_RUNTIME_int32(tablet_overhead_size_percentage, 10, ``` Then test: ``` ybd asan --cxx-test integration-tests_client-stress-test --gtest_filter ClientStressTest_LowMemory.TestMemoryThrottling --test-args '--tablet_replicas_per_gib_limit=1400 --tablet_overhead_size_percentage=10 --vmodule=*tablet_limits*=1' |& tee /tmp/generic.mdl.log ``` ``` ybd --cxx-test master_master-test --test-args '--tablet_replicas_per_gib_limit=1400 --tablet_overhead_size_percentage=10 --vmodule=*tablet_limits*=1' |& tee /tmp/generic.mdl.log ``` Reviewers: hsunder Reviewed By: hsunder Subscribers: ybase, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D32228
mdbridge
added a commit
that referenced
this issue
Feb 9, 2024
Summary: Original commit: 3a9000d / D32228 Clean backport, no conflicts. We built a guardrail mechanism to limit the number of tablets users can create – see #16177. However, it is currently broken and cannot be safely used because it requires setting a limit for the per tablet overhead memory tracker, which would cause issues because: * the memory tracker hierarchy under per-tablet-overhead is incorrect, containing MemTable trackers, which are not per tablet, and failing to contain untracked memory, which is almost all per tablet overhead * the current code for checking for memory pressure double counts soft memory limit checks at every level of the memory hierarchy that has a limit * we should not be enforcing a soft memory limit for any but the root memory tracker as the point of the soft memory limit is to absorb server-wide spikes not control individual memory components Here we work around this issue by temporarily making the gflag: ``` ~/code/yugabyte-db/src/yb/tserver/tablet_memory_manager.cc:55: DEFINE_NON_RUNTIME_int32( tablet_overhead_size_percentage, 0, "Percentage of total available memory to use for tablet-related overheads. Default is 0, " "meaning no limit. Must be between 0 and 100 inclusive."); ``` not set a limit on the per-tablet overhead memory tracker. This will make it possible to turn on this gflag without running into the previous issues. Other GitHub issues will address fixing these issues so that it will be possible to use the gflag in the future to set a limit for that memory tracker. We still need information on how much memory is available for per-tablet overhead for the guardrails; we get that instead directly from the function that calculates what the limit would have been, ``` ~/code/yugabyte-db/src/yb/tserver/tablet_memory_manager.cc:161: int64 ComputeTabletOverheadLimit() { CHECK(0 [= FLAGS_tablet_overhead_size_percentage && FLAGS_tablet_overhead_size_percentage <= 100) << Format( "Flag tablet_overhead_size_percentage must be between 0 and 100 inclusive. Current " "value: $0", FLAGS_tablet_overhead_size_percentage); if (0 == FLAGS_tablet_overhead_size_percentage) return -1; return MemTracker::GetRootTracker()-]limit() * FLAGS_tablet_overhead_size_percentage / 100; } ``` Also added rate-limited logging for when we reject and VLOGging for when we don't. There is a GitHub issue, #20668, for additionally: [DocDB] Add metric and alert for tablet guardrail limit reached Jira: DB-9663 Test Plan: Start a one node cluster and verify via memory trackers page (http://127.0.0.1:[79]000/mem-trackers) that there is no limit for tablets_overhead regardless of whether or not we pass the limits flag: ``` bin/yb-ctl start bin/yb-ctl start --tserver_flags 'tablet_overhead_size_percentage=10' --master_flags 'tablet_overhead_size_percentage=10' bin/yb-ctl start --tserver_flags 'tablet_overhead_size_percentage=10' --master_flags 'tablet_overhead_size_percentage=10,tablet_replicas_per_gib_limit=1462,vmodule=*tablet_limits*=1' ``` Started one node cluster and verify: * guardrail limiting is not on: ``` bin/yb-ctl start ``` * guardrail limiting is on but not biting: ``` bin/yb-ctl start --tserver_flags 'tablet_overhead_size_percentage=10' --master_flags 'tablet_overhead_size_percentage=10,tablet_replicas_per_gib_limit=1462,vmodule=*tablet_limits*=1' I0206 12:05:51.925747 14832 tablet_limits.cc:128] vlog1: Approved an additional 8 tablet replicas, which will increase the total running tablet replica count to 8, which is still below the safe system maximum of 2963 ``` * guardrail limiting is on and biting: ``` bin/yb-ctl start --tserver_flags 'tablet_overhead_size_percentage=10' --master_flags 'tablet_overhead_size_percentage=10,tablet_replicas_per_gib_limit=2,vmodule=*tablet_limits*=1' E0206 12:07:22.505185 15587 tablet_limits.cc:125] The requested number of tablet replicas (8) would cause the total running tablet replica count (8) to exceed the safe system maximum (4) W0206 12:07:22.505299 15587 catalog_manager.cc:3929] Invalid argument (yb/master/tablet_limits.cc:126): The requested number of tablet replicas (8) would cause the total running tablet replica count (8) to exceed the safe system maximum (4) W0206 12:07:22.505342 15587 catalog_manager.cc:1436] T 00000000000000000000000000000000 P 502cb84f04454e1f89e4203f6358e8af: Failed creating transaction status table, waiting: Invalid argument (yb/master/tablet_limits.cc:126): The requested number of tablet replicas (8) would cause the total running tablet replica count (8) to exceed the safe system maximum (4) ... ``` Update: Testing with this turned on using Jenkins found a few tests that needed adjustment: Turn feature on by hard coding (flags are not passed down): ``` ~/code/yugabyte-db/src/yb/master/tablet_limits.cc:24: DEFINE_RUNTIME_int32(tablet_replicas_per_gib_limit, 1400, ``` ``` ~/code/yugabyte-db/src/yb/tserver/tablet_memory_manager.cc:56: DEFINE_NON_RUNTIME_int32(tablet_overhead_size_percentage, 10, ``` Then test: ``` ybd asan --cxx-test integration-tests_client-stress-test --gtest_filter ClientStressTest_LowMemory.TestMemoryThrottling --test-args '--tablet_replicas_per_gib_limit=1400 --tablet_overhead_size_percentage=10 --vmodule=*tablet_limits*=1' |& tee /tmp/generic.mdl.log ``` ``` ybd --cxx-test master_master-test --test-args '--tablet_replicas_per_gib_limit=1400 --tablet_overhead_size_percentage=10 --vmodule=*tablet_limits*=1' |& tee /tmp/generic.mdl.log ``` Reviewers: hsunder, bogdan, slingam Reviewed By: hsunder Subscribers: bogdan, ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D32318
mdbridge
added a commit
that referenced
this issue
Mar 11, 2024
…strations Summary: When master leadership changes, initially the master has no information about any TServers. That information comes in over time as TServers heartbeat for the first time to the new master. However, the tablet guardrail mechanism as it currently stands assumes it knows all the live TServers's information at all times. This means if they CREATE TABLE request comes in shortly after he master leadership has changes then we might be missing information about some of the TServers and thus conclude that we can support fewer tablets than we actually can and incorrectly reject the create request. This diff removes these false-positives (incorrectly rejecting) at the cost of introducing false negatives (failing to reject when should) by simply not checking for too many tablets for the first 120 seconds (controlled by a gflag) after master leadership changes. I made a fairly generic flag: ``` ~/code/yugabyte-db/src/yb/master/catalog_manager.cc:638: DEFINE_RUNTIME_uint32(wait_for_registration_after_leadership_change_secs, yb::master::kDelayAfterFailoverSecs, "Amount of time to wait between becoming master leader and relying on all live TServers having " "registered."); TAG_FLAG(wait_for_registration_after_leadership_change_secs, advanced); ``` rather than tying it explicitly to the tablet guardrail mechanism because I thought we might want to use the same wait for future features. I did not use the pre-existing load balancer delay: ``` ~/code/yugabyte-db/src/yb/master/catalog_manager_bg_tasks.cc:62: DEFINE_RUNTIME_int32(load_balancer_initial_delay_secs, yb::master::kDelayAfterFailoverSecs, "Amount of time to wait between becoming master leader and enabling the load " "balancer."); ``` because I thought it might need to be longer than just the registration wait. Please advise if this makes sense or if I should rename the flag. Jira: DB-9663 Test Plan: ``` ybd --cxx-test tablet_limits_test --test-args '' |& tee /tmp/generic.mdl.log ybd --cxx-test tablet_limits_integration_test --test-args '' |& tee /tmp/generic.mdl.log ``` If I do not set the reduce the wait flag, then some of the integration tests do not pass. I did not create an extra test just to verify that going forward. Reviewers: hsunder Reviewed By: hsunder Subscribers: ybase, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D32683
mdbridge
added a commit
that referenced
this issue
Mar 12, 2024
…or TServer registrations Summary: Original commit: 58cf314 / D32683 Mostly clean commit; got a bit confused about a macro: Resolved: ``` <<<<<<< HEAD #define RETURN_FALSE_IF(cond) \ do { \ if ((cond)) { \ VLOG_WITH_FUNC(3) << "Returning false"; \ return false; \ } \ } while (0) ======= DEFINE_test_flag(bool, disable_set_catalog_version_table_in_perdb_mode, false, "Whether to disable setting the catalog version table in perdb mode."); DEFINE_RUNTIME_uint32(initial_tserver_registration_duration_secs, yb::master::kDelayAfterFailoverSecs, "Amount of time to wait between becoming master leader and relying on all live TServers having " "registered."); TAG_FLAG(initial_tserver_registration_duration_secs, advanced); >>>>>>> d6b3f198b0 ([#20667] DocDB: Make tablet guardrail mechanism wait for TServer registrations) ``` to ``` DEFINE_RUNTIME_uint32(initial_tserver_registration_duration_secs, yb::master::kDelayAfterFailoverSecs, "Amount of time to wait between becoming master leader and relying on all live TServers having " "registered."); TAG_FLAG(initial_tserver_registration_duration_secs, advanced); #define RETURN_FALSE_IF(cond) \ do { \ if ((cond)) { \ VLOG_WITH_FUNC(3) << "Returning false"; \ return false; \ } \ } while (0) ``` When master leadership changes, initially the master has no information about any TServers. That information comes in over time as TServers heartbeat for the first time to the new master. However, the tablet guardrail mechanism as it currently stands assumes it knows all the live TServers's information at all times. This means if they CREATE TABLE request comes in shortly after he master leadership has changes then we might be missing information about some of the TServers and thus conclude that we can support fewer tablets than we actually can and incorrectly reject the create request. This diff removes these false-positives (incorrectly rejecting) at the cost of introducing false negatives (failing to reject when should) by simply not checking for too many tablets for the first 120 seconds (controlled by a gflag) after master leadership changes. I made a fairly generic flag: ``` ~/code/yugabyte-db/src/yb/master/catalog_manager.cc:638: DEFINE_RUNTIME_uint32(wait_for_registration_after_leadership_change_secs, yb::master::kDelayAfterFailoverSecs, "Amount of time to wait between becoming master leader and relying on all live TServers having " "registered."); TAG_FLAG(wait_for_registration_after_leadership_change_secs, advanced); ``` rather than tying it explicitly to the tablet guardrail mechanism because I thought we might want to use the same wait for future features. I did not use the pre-existing load balancer delay: ``` ~/code/yugabyte-db/src/yb/master/catalog_manager_bg_tasks.cc:62: DEFINE_RUNTIME_int32(load_balancer_initial_delay_secs, yb::master::kDelayAfterFailoverSecs, "Amount of time to wait between becoming master leader and enabling the load " "balancer."); ``` because I thought it might need to be longer than just the registration wait. Please advise if this makes sense or if I should rename the flag. Jira: DB-9663 Test Plan: ``` ybd --cxx-test tablet_limits_test --test-args '' |& tee /tmp/generic.mdl.log ybd --cxx-test tablet_limits_integration_test --test-args '' |& tee /tmp/generic.mdl.log ``` If I do not set the reduce the wait flag, then some of the integration tests do not pass. I did not create an extra test just to verify that going forward. Reviewers: hsunder Reviewed By: hsunder Subscribers: bogdan, ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D33039
asrinivasanyb
pushed a commit
to asrinivasanyb/yugabyte-db
that referenced
this issue
Mar 18, 2024
…ver registrations Summary: When master leadership changes, initially the master has no information about any TServers. That information comes in over time as TServers heartbeat for the first time to the new master. However, the tablet guardrail mechanism as it currently stands assumes it knows all the live TServers's information at all times. This means if they CREATE TABLE request comes in shortly after he master leadership has changes then we might be missing information about some of the TServers and thus conclude that we can support fewer tablets than we actually can and incorrectly reject the create request. This diff removes these false-positives (incorrectly rejecting) at the cost of introducing false negatives (failing to reject when should) by simply not checking for too many tablets for the first 120 seconds (controlled by a gflag) after master leadership changes. I made a fairly generic flag: ``` ~/code/yugabyte-db/src/yb/master/catalog_manager.cc:638: DEFINE_RUNTIME_uint32(wait_for_registration_after_leadership_change_secs, yb::master::kDelayAfterFailoverSecs, "Amount of time to wait between becoming master leader and relying on all live TServers having " "registered."); TAG_FLAG(wait_for_registration_after_leadership_change_secs, advanced); ``` rather than tying it explicitly to the tablet guardrail mechanism because I thought we might want to use the same wait for future features. I did not use the pre-existing load balancer delay: ``` ~/code/yugabyte-db/src/yb/master/catalog_manager_bg_tasks.cc:62: DEFINE_RUNTIME_int32(load_balancer_initial_delay_secs, yb::master::kDelayAfterFailoverSecs, "Amount of time to wait between becoming master leader and enabling the load " "balancer."); ``` because I thought it might need to be longer than just the registration wait. Please advise if this makes sense or if I should rename the flag. Jira: DB-9663 Test Plan: ``` ybd --cxx-test tablet_limits_test --test-args '' |& tee /tmp/generic.mdl.log ybd --cxx-test tablet_limits_integration_test --test-args '' |& tee /tmp/generic.mdl.log ``` If I do not set the reduce the wait flag, then some of the integration tests do not pass. I did not create an extra test just to verify that going forward. Reviewers: hsunder Reviewed By: hsunder Subscribers: ybase, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D32683
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/docdb
YugabyteDB core features
kind/new-feature
This is a request for a completely new feature
priority/medium
Medium priority issue
Jira Link: DB-9663
Description
We built a guardrail mechanism to limit the number of tablets users can create -- see #16177.
However, it is currently broken and cannot be safely used because it requires setting a limit for the per tablet overhead memory tracker, which would cause issues because:
This task is to work around this issue by temporarily making the Google flag:
not set a limit on the per-tablet overhead memory tracker. This will make it possible to turn on this Google flag without running into the previous issues. Other GitHub issues will address fixing these issues so that it will be possible to use the Google flag in the future to set a limit for that memory tracker.
We still need information on how much memory is available for per-tablet overhead for the guardrails; get that instead directly from the function that calculates what the limit would have been,
Once the issues have been worked around, turn on the guardrail mechanism based on memory available for per-tablet overhead:
Here, set the first flag to 1024/0.7 for now (this is based on experiments).
Discussion is still pending whether to enable this feature via a auto flag that turns on for new universes or simply via a preview flag for now. In either case, create a new boolean enable flag as it is advised to avoid integer auto flags; also it allows the customer to just turn the feature on without having to worry about what the default limits should be.
Backport to 2.20. Possibly the backport won't turn the feature on unlike trunk.
Other possibly useful code snippets:
Issue Type
kind/new-feature
Warning: Please confirm that this issue does not contain any sensitive information
The text was updated successfully, but these errors were encountered: