[DocDB] Fix tablet guardrail mechanism and turn it on for universes with 16 GB or less per-node RAM #20667

mdbridge · 2024-01-17T21:41:17Z

Jira Link: DB-9663

Description

We built a guardrail mechanism to limit the number of tablets users can create -- see #16177.

However, it is currently broken and cannot be safely used because it requires setting a limit for the per tablet overhead memory tracker, which would cause issues because:

the memory tracker hierarchy under per-tablet-overhead is incorrect, containing MemTable trackers, which are not per tablet, and failing to contain untracked memory, which is almost all per tablet overhead
the current code for checking for memory pressure double counts soft memory limit checks at every level of the memory hierarchy that has a limit
we should not be enforcing a soft memory limit for any but the root memory tracker as the point of the soft memory limit is to absorb server-wide spikes not control individual memory components

This task is to work around this issue by temporarily making the Google flag:

~/code/yugabyte-db/src/yb/tserver/tablet_memory_manager.cc:55:
DEFINE_NON_RUNTIME_int32(
    tablet_overhead_size_percentage, 0,
    "Percentage of total available memory to use for tablet-related overheads. Default is 0, "
    "meaning no limit. Must be between 0 and 100 inclusive.");

not set a limit on the per-tablet overhead memory tracker. This will make it possible to turn on this Google flag without running into the previous issues. Other GitHub issues will address fixing these issues so that it will be possible to use the Google flag in the future to set a limit for that memory tracker.

We still need information on how much memory is available for per-tablet overhead for the guardrails; get that instead directly from the function that calculates what the limit would have been,

~/code/yugabyte-db/src/yb/tserver/tablet_memory_manager.cc:161:
int64 ComputeTabletOverheadLimit() {
  CHECK(0 <= FLAGS_tablet_overhead_size_percentage && FLAGS_tablet_overhead_size_percentage <= 100)
      << Format(
             "Flag tablet_overhead_size_percentage must be between 0 and 100 inclusive. Current "
             "value: $0",
             FLAGS_tablet_overhead_size_percentage);
  if (0 == FLAGS_tablet_overhead_size_percentage) return -1;
  return MemTracker::GetRootTracker()->limit() * FLAGS_tablet_overhead_size_percentage / 100;
}

Once the issues have been worked around, turn on the guardrail mechanism based on memory available for per-tablet overhead:

~/code/yugabyte-db/src/yb/master/tablet_limits.cc:23:
DEFINE_RUNTIME_int32(tablet_replicas_per_gib_limit, 0,
    "The maximum number of tablets per GiB of RAM reserved by TServers for tablet overheads the "
    "cluster can support. A non-positive number means no limit.");
TAG_FLAG(tablet_replicas_per_gib_limit, experimental);

DEFINE_RUNTIME_int32(tablet_replicas_per_core_limit, 0,
    "The maximum number of tablets per vCPU available to TServer processes the cluster can "
    "support. A non-positive number means no limit.");
TAG_FLAG(tablet_replicas_per_core_limit, experimental);

Here, set the first flag to 1024/0.7 for now (this is based on experiments).

Discussion is still pending whether to enable this feature via a auto flag that turns on for new universes or simply via a preview flag for now. In either case, create a new boolean enable flag as it is advised to avoid integer auto flags; also it allows the customer to just turn the feature on without having to worry about what the default limits should be.

Backport to 2.20. Possibly the backport won't turn the feature on unlike trunk.

Other possibly useful code snippets:

~/code/yugabyte-db/src/yb/tserver/tablet_memory_manager.cc:181:
  tablets_overhead_mem_tracker_ = MemTracker::FindOrCreateTracker(
      ComputeTabletOverheadLimit(), "Tablets_overhead", server_mem_tracker_);

~/code/yugabyte-db/src/yb/tserver/heartbeater.cc:403:
    auto tracker =
        server_->tablet_manager()->tablet_memory_manager()->tablets_overhead_mem_tracker();
    // Only set the tablet overhead limit if the tablet overheads memory tracker exists and has a
    // limit set.  The flag to set the memory tracker's limit is tablet_overhead_size_percentage.
    if (tracker && tracker->has_limit()) {
      resources->set_tablet_overhead_ram_in_bytes(tracker->limit());
    }
  }

Issue Type

kind/new-feature

Warning: Please confirm that this issue does not contain any sensitive information

I confirm this issue does not contain any sensitive information.

The text was updated successfully, but these errors were encountered:

mdbridge · 2024-01-17T22:37:12Z

UPDATE: tentative decision is to turn this on via #20664 for 16 GB or less boxes

Summary: We built a guardrail mechanism to limit the number of tablets users can create – see #16177. However, it is currently broken and cannot be safely used because it requires setting a limit for the per tablet overhead memory tracker, which would cause issues because: * the memory tracker hierarchy under per-tablet-overhead is incorrect, containing MemTable trackers, which are not per tablet, and failing to contain untracked memory, which is almost all per tablet overhead * the current code for checking for memory pressure double counts soft memory limit checks at every level of the memory hierarchy that has a limit * we should not be enforcing a soft memory limit for any but the root memory tracker as the point of the soft memory limit is to absorb server-wide spikes not control individual memory components Here we work around this issue by temporarily making the gflag: ``` ~/code/yugabyte-db/src/yb/tserver/tablet_memory_manager.cc:55: DEFINE_NON_RUNTIME_int32( tablet_overhead_size_percentage, 0, "Percentage of total available memory to use for tablet-related overheads. Default is 0, " "meaning no limit. Must be between 0 and 100 inclusive."); ``` not set a limit on the per-tablet overhead memory tracker. This will make it possible to turn on this gflag without running into the previous issues. Other GitHub issues will address fixing these issues so that it will be possible to use the gflag in the future to set a limit for that memory tracker. We still need information on how much memory is available for per-tablet overhead for the guardrails; we get that instead directly from the function that calculates what the limit would have been, ``` ~/code/yugabyte-db/src/yb/tserver/tablet_memory_manager.cc:161: int64 ComputeTabletOverheadLimit() { CHECK(0 [= FLAGS_tablet_overhead_size_percentage && FLAGS_tablet_overhead_size_percentage <= 100) << Format( "Flag tablet_overhead_size_percentage must be between 0 and 100 inclusive. Current " "value: $0", FLAGS_tablet_overhead_size_percentage); if (0 == FLAGS_tablet_overhead_size_percentage) return -1; return MemTracker::GetRootTracker()-]limit() * FLAGS_tablet_overhead_size_percentage / 100; } ``` Also added rate-limited logging for when we reject and VLOGging for when we don't. There is a GitHub issue, #20668, for additionally: [DocDB] Add metric and alert for tablet guardrail limit reached Jira: DB-9663 Test Plan: Start a one node cluster and verify via memory trackers page (http://127.0.0.1:[79]000/mem-trackers) that there is no limit for tablets_overhead regardless of whether or not we pass the limits flag: ``` bin/yb-ctl start bin/yb-ctl start --tserver_flags 'tablet_overhead_size_percentage=10' --master_flags 'tablet_overhead_size_percentage=10' bin/yb-ctl start --tserver_flags 'tablet_overhead_size_percentage=10' --master_flags 'tablet_overhead_size_percentage=10,tablet_replicas_per_gib_limit=1462,vmodule=*tablet_limits*=1' ``` Started one node cluster and verify: * guardrail limiting is not on: ``` bin/yb-ctl start ``` * guardrail limiting is on but not biting: ``` bin/yb-ctl start --tserver_flags 'tablet_overhead_size_percentage=10' --master_flags 'tablet_overhead_size_percentage=10,tablet_replicas_per_gib_limit=1462,vmodule=*tablet_limits*=1' I0206 12:05:51.925747 14832 tablet_limits.cc:128] vlog1: Approved an additional 8 tablet replicas, which will increase the total running tablet replica count to 8, which is still below the safe system maximum of 2963 ``` * guardrail limiting is on and biting: ``` bin/yb-ctl start --tserver_flags 'tablet_overhead_size_percentage=10' --master_flags 'tablet_overhead_size_percentage=10,tablet_replicas_per_gib_limit=2,vmodule=*tablet_limits*=1' E0206 12:07:22.505185 15587 tablet_limits.cc:125] The requested number of tablet replicas (8) would cause the total running tablet replica count (8) to exceed the safe system maximum (4) W0206 12:07:22.505299 15587 catalog_manager.cc:3929] Invalid argument (yb/master/tablet_limits.cc:126): The requested number of tablet replicas (8) would cause the total running tablet replica count (8) to exceed the safe system maximum (4) W0206 12:07:22.505342 15587 catalog_manager.cc:1436] T 00000000000000000000000000000000 P 502cb84f04454e1f89e4203f6358e8af: Failed creating transaction status table, waiting: Invalid argument (yb/master/tablet_limits.cc:126): The requested number of tablet replicas (8) would cause the total running tablet replica count (8) to exceed the safe system maximum (4) ... ``` Update: Testing with this turned on using Jenkins found a few tests that needed adjustment: Turn feature on by hard coding (flags are not passed down): ``` ~/code/yugabyte-db/src/yb/master/tablet_limits.cc:24: DEFINE_RUNTIME_int32(tablet_replicas_per_gib_limit, 1400, ``` ``` ~/code/yugabyte-db/src/yb/tserver/tablet_memory_manager.cc:56: DEFINE_NON_RUNTIME_int32(tablet_overhead_size_percentage, 10, ``` Then test: ``` ybd asan --cxx-test integration-tests_client-stress-test --gtest_filter ClientStressTest_LowMemory.TestMemoryThrottling --test-args '--tablet_replicas_per_gib_limit=1400 --tablet_overhead_size_percentage=10 --vmodule=*tablet_limits*=1' |& tee /tmp/generic.mdl.log ``` ``` ybd --cxx-test master_master-test --test-args '--tablet_replicas_per_gib_limit=1400 --tablet_overhead_size_percentage=10 --vmodule=*tablet_limits*=1' |& tee /tmp/generic.mdl.log ``` Reviewers: hsunder Reviewed By: hsunder Subscribers: ybase, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D32228

Summary: Original commit: 3a9000d / D32228 Clean backport, no conflicts. We built a guardrail mechanism to limit the number of tablets users can create – see #16177. However, it is currently broken and cannot be safely used because it requires setting a limit for the per tablet overhead memory tracker, which would cause issues because: * the memory tracker hierarchy under per-tablet-overhead is incorrect, containing MemTable trackers, which are not per tablet, and failing to contain untracked memory, which is almost all per tablet overhead * the current code for checking for memory pressure double counts soft memory limit checks at every level of the memory hierarchy that has a limit * we should not be enforcing a soft memory limit for any but the root memory tracker as the point of the soft memory limit is to absorb server-wide spikes not control individual memory components Here we work around this issue by temporarily making the gflag: ``` ~/code/yugabyte-db/src/yb/tserver/tablet_memory_manager.cc:55: DEFINE_NON_RUNTIME_int32( tablet_overhead_size_percentage, 0, "Percentage of total available memory to use for tablet-related overheads. Default is 0, " "meaning no limit. Must be between 0 and 100 inclusive."); ``` not set a limit on the per-tablet overhead memory tracker. This will make it possible to turn on this gflag without running into the previous issues. Other GitHub issues will address fixing these issues so that it will be possible to use the gflag in the future to set a limit for that memory tracker. We still need information on how much memory is available for per-tablet overhead for the guardrails; we get that instead directly from the function that calculates what the limit would have been, ``` ~/code/yugabyte-db/src/yb/tserver/tablet_memory_manager.cc:161: int64 ComputeTabletOverheadLimit() { CHECK(0 [= FLAGS_tablet_overhead_size_percentage && FLAGS_tablet_overhead_size_percentage <= 100) << Format( "Flag tablet_overhead_size_percentage must be between 0 and 100 inclusive. Current " "value: $0", FLAGS_tablet_overhead_size_percentage); if (0 == FLAGS_tablet_overhead_size_percentage) return -1; return MemTracker::GetRootTracker()-]limit() * FLAGS_tablet_overhead_size_percentage / 100; } ``` Also added rate-limited logging for when we reject and VLOGging for when we don't. There is a GitHub issue, #20668, for additionally: [DocDB] Add metric and alert for tablet guardrail limit reached Jira: DB-9663 Test Plan: Start a one node cluster and verify via memory trackers page (http://127.0.0.1:[79]000/mem-trackers) that there is no limit for tablets_overhead regardless of whether or not we pass the limits flag: ``` bin/yb-ctl start bin/yb-ctl start --tserver_flags 'tablet_overhead_size_percentage=10' --master_flags 'tablet_overhead_size_percentage=10' bin/yb-ctl start --tserver_flags 'tablet_overhead_size_percentage=10' --master_flags 'tablet_overhead_size_percentage=10,tablet_replicas_per_gib_limit=1462,vmodule=*tablet_limits*=1' ``` Started one node cluster and verify: * guardrail limiting is not on: ``` bin/yb-ctl start ``` * guardrail limiting is on but not biting: ``` bin/yb-ctl start --tserver_flags 'tablet_overhead_size_percentage=10' --master_flags 'tablet_overhead_size_percentage=10,tablet_replicas_per_gib_limit=1462,vmodule=*tablet_limits*=1' I0206 12:05:51.925747 14832 tablet_limits.cc:128] vlog1: Approved an additional 8 tablet replicas, which will increase the total running tablet replica count to 8, which is still below the safe system maximum of 2963 ``` * guardrail limiting is on and biting: ``` bin/yb-ctl start --tserver_flags 'tablet_overhead_size_percentage=10' --master_flags 'tablet_overhead_size_percentage=10,tablet_replicas_per_gib_limit=2,vmodule=*tablet_limits*=1' E0206 12:07:22.505185 15587 tablet_limits.cc:125] The requested number of tablet replicas (8) would cause the total running tablet replica count (8) to exceed the safe system maximum (4) W0206 12:07:22.505299 15587 catalog_manager.cc:3929] Invalid argument (yb/master/tablet_limits.cc:126): The requested number of tablet replicas (8) would cause the total running tablet replica count (8) to exceed the safe system maximum (4) W0206 12:07:22.505342 15587 catalog_manager.cc:1436] T 00000000000000000000000000000000 P 502cb84f04454e1f89e4203f6358e8af: Failed creating transaction status table, waiting: Invalid argument (yb/master/tablet_limits.cc:126): The requested number of tablet replicas (8) would cause the total running tablet replica count (8) to exceed the safe system maximum (4) ... ``` Update: Testing with this turned on using Jenkins found a few tests that needed adjustment: Turn feature on by hard coding (flags are not passed down): ``` ~/code/yugabyte-db/src/yb/master/tablet_limits.cc:24: DEFINE_RUNTIME_int32(tablet_replicas_per_gib_limit, 1400, ``` ``` ~/code/yugabyte-db/src/yb/tserver/tablet_memory_manager.cc:56: DEFINE_NON_RUNTIME_int32(tablet_overhead_size_percentage, 10, ``` Then test: ``` ybd asan --cxx-test integration-tests_client-stress-test --gtest_filter ClientStressTest_LowMemory.TestMemoryThrottling --test-args '--tablet_replicas_per_gib_limit=1400 --tablet_overhead_size_percentage=10 --vmodule=*tablet_limits*=1' |& tee /tmp/generic.mdl.log ``` ``` ybd --cxx-test master_master-test --test-args '--tablet_replicas_per_gib_limit=1400 --tablet_overhead_size_percentage=10 --vmodule=*tablet_limits*=1' |& tee /tmp/generic.mdl.log ``` Reviewers: hsunder, bogdan, slingam Reviewed By: hsunder Subscribers: bogdan, ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D32318

…strations Summary: When master leadership changes, initially the master has no information about any TServers. That information comes in over time as TServers heartbeat for the first time to the new master. However, the tablet guardrail mechanism as it currently stands assumes it knows all the live TServers's information at all times. This means if they CREATE TABLE request comes in shortly after he master leadership has changes then we might be missing information about some of the TServers and thus conclude that we can support fewer tablets than we actually can and incorrectly reject the create request. This diff removes these false-positives (incorrectly rejecting) at the cost of introducing false negatives (failing to reject when should) by simply not checking for too many tablets for the first 120 seconds (controlled by a gflag) after master leadership changes. I made a fairly generic flag: ``` ~/code/yugabyte-db/src/yb/master/catalog_manager.cc:638: DEFINE_RUNTIME_uint32(wait_for_registration_after_leadership_change_secs, yb::master::kDelayAfterFailoverSecs, "Amount of time to wait between becoming master leader and relying on all live TServers having " "registered."); TAG_FLAG(wait_for_registration_after_leadership_change_secs, advanced); ``` rather than tying it explicitly to the tablet guardrail mechanism because I thought we might want to use the same wait for future features. I did not use the pre-existing load balancer delay: ``` ~/code/yugabyte-db/src/yb/master/catalog_manager_bg_tasks.cc:62: DEFINE_RUNTIME_int32(load_balancer_initial_delay_secs, yb::master::kDelayAfterFailoverSecs, "Amount of time to wait between becoming master leader and enabling the load " "balancer."); ``` because I thought it might need to be longer than just the registration wait. Please advise if this makes sense or if I should rename the flag. Jira: DB-9663 Test Plan: ``` ybd --cxx-test tablet_limits_test --test-args '' |& tee /tmp/generic.mdl.log ybd --cxx-test tablet_limits_integration_test --test-args '' |& tee /tmp/generic.mdl.log ``` If I do not set the reduce the wait flag, then some of the integration tests do not pass. I did not create an extra test just to verify that going forward. Reviewers: hsunder Reviewed By: hsunder Subscribers: ybase, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D32683

…or TServer registrations Summary: Original commit: 58cf314 / D32683 Mostly clean commit; got a bit confused about a macro: Resolved: ``` <<<<<<< HEAD #define RETURN_FALSE_IF(cond) \ do { \ if ((cond)) { \ VLOG_WITH_FUNC(3) << "Returning false"; \ return false; \ } \ } while (0) ======= DEFINE_test_flag(bool, disable_set_catalog_version_table_in_perdb_mode, false, "Whether to disable setting the catalog version table in perdb mode."); DEFINE_RUNTIME_uint32(initial_tserver_registration_duration_secs, yb::master::kDelayAfterFailoverSecs, "Amount of time to wait between becoming master leader and relying on all live TServers having " "registered."); TAG_FLAG(initial_tserver_registration_duration_secs, advanced); >>>>>>> d6b3f198b0 ([#20667] DocDB: Make tablet guardrail mechanism wait for TServer registrations) ``` to ``` DEFINE_RUNTIME_uint32(initial_tserver_registration_duration_secs, yb::master::kDelayAfterFailoverSecs, "Amount of time to wait between becoming master leader and relying on all live TServers having " "registered."); TAG_FLAG(initial_tserver_registration_duration_secs, advanced); #define RETURN_FALSE_IF(cond) \ do { \ if ((cond)) { \ VLOG_WITH_FUNC(3) << "Returning false"; \ return false; \ } \ } while (0) ``` When master leadership changes, initially the master has no information about any TServers. That information comes in over time as TServers heartbeat for the first time to the new master. However, the tablet guardrail mechanism as it currently stands assumes it knows all the live TServers's information at all times. This means if they CREATE TABLE request comes in shortly after he master leadership has changes then we might be missing information about some of the TServers and thus conclude that we can support fewer tablets than we actually can and incorrectly reject the create request. This diff removes these false-positives (incorrectly rejecting) at the cost of introducing false negatives (failing to reject when should) by simply not checking for too many tablets for the first 120 seconds (controlled by a gflag) after master leadership changes. I made a fairly generic flag: ``` ~/code/yugabyte-db/src/yb/master/catalog_manager.cc:638: DEFINE_RUNTIME_uint32(wait_for_registration_after_leadership_change_secs, yb::master::kDelayAfterFailoverSecs, "Amount of time to wait between becoming master leader and relying on all live TServers having " "registered."); TAG_FLAG(wait_for_registration_after_leadership_change_secs, advanced); ``` rather than tying it explicitly to the tablet guardrail mechanism because I thought we might want to use the same wait for future features. I did not use the pre-existing load balancer delay: ``` ~/code/yugabyte-db/src/yb/master/catalog_manager_bg_tasks.cc:62: DEFINE_RUNTIME_int32(load_balancer_initial_delay_secs, yb::master::kDelayAfterFailoverSecs, "Amount of time to wait between becoming master leader and enabling the load " "balancer."); ``` because I thought it might need to be longer than just the registration wait. Please advise if this makes sense or if I should rename the flag. Jira: DB-9663 Test Plan: ``` ybd --cxx-test tablet_limits_test --test-args '' |& tee /tmp/generic.mdl.log ybd --cxx-test tablet_limits_integration_test --test-args '' |& tee /tmp/generic.mdl.log ``` If I do not set the reduce the wait flag, then some of the integration tests do not pass. I did not create an extra test just to verify that going forward. Reviewers: hsunder Reviewed By: hsunder Subscribers: bogdan, ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D33039

…ver registrations Summary: When master leadership changes, initially the master has no information about any TServers. That information comes in over time as TServers heartbeat for the first time to the new master. However, the tablet guardrail mechanism as it currently stands assumes it knows all the live TServers's information at all times. This means if they CREATE TABLE request comes in shortly after he master leadership has changes then we might be missing information about some of the TServers and thus conclude that we can support fewer tablets than we actually can and incorrectly reject the create request. This diff removes these false-positives (incorrectly rejecting) at the cost of introducing false negatives (failing to reject when should) by simply not checking for too many tablets for the first 120 seconds (controlled by a gflag) after master leadership changes. I made a fairly generic flag: ``` ~/code/yugabyte-db/src/yb/master/catalog_manager.cc:638: DEFINE_RUNTIME_uint32(wait_for_registration_after_leadership_change_secs, yb::master::kDelayAfterFailoverSecs, "Amount of time to wait between becoming master leader and relying on all live TServers having " "registered."); TAG_FLAG(wait_for_registration_after_leadership_change_secs, advanced); ``` rather than tying it explicitly to the tablet guardrail mechanism because I thought we might want to use the same wait for future features. I did not use the pre-existing load balancer delay: ``` ~/code/yugabyte-db/src/yb/master/catalog_manager_bg_tasks.cc:62: DEFINE_RUNTIME_int32(load_balancer_initial_delay_secs, yb::master::kDelayAfterFailoverSecs, "Amount of time to wait between becoming master leader and enabling the load " "balancer."); ``` because I thought it might need to be longer than just the registration wait. Please advise if this makes sense or if I should rename the flag. Jira: DB-9663 Test Plan: ``` ybd --cxx-test tablet_limits_test --test-args '' |& tee /tmp/generic.mdl.log ybd --cxx-test tablet_limits_integration_test --test-args '' |& tee /tmp/generic.mdl.log ``` If I do not set the reduce the wait flag, then some of the integration tests do not pass. I did not create an extra test just to verify that going forward. Reviewers: hsunder Reviewed By: hsunder Subscribers: ybase, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D32683

mdbridge added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Jan 17, 2024

yugabyte-ci added kind/new-feature This is a request for a completely new feature priority/medium Medium priority issue labels Jan 17, 2024

mdbridge changed the title ~~[DocDB] Fix tablet guardrail mechanism and turn it on for testing or maybe new universes~~ [DocDB] Fix tablet guardrail mechanism and turn it on for universes with 16 GB or less per-node RAM Jan 17, 2024

yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Jan 23, 2024

yugabyte-ci assigned mdbridge Jan 23, 2024

mdbridge closed this as completed in 010ddcc Feb 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DocDB] Fix tablet guardrail mechanism and turn it on for universes with 16 GB or less per-node RAM #20667

[DocDB] Fix tablet guardrail mechanism and turn it on for universes with 16 GB or less per-node RAM #20667

mdbridge commented Jan 17, 2024 •

edited by jira bot

mdbridge commented Jan 17, 2024

[DocDB] Fix tablet guardrail mechanism and turn it on for universes with 16 GB or less per-node RAM #20667

[DocDB] Fix tablet guardrail mechanism and turn it on for universes with 16 GB or less per-node RAM #20667

Comments

mdbridge commented Jan 17, 2024 • edited by jira bot

Description

Issue Type

Warning: Please confirm that this issue does not contain any sensitive information

mdbridge commented Jan 17, 2024

mdbridge commented Jan 17, 2024 •

edited by jira bot