You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a TServer restarts (say after 1 minute) all tablets on that TServer try to catch up from their leaders. The UpdateConsensus RPCs with default 32MB from the leaders happening concurrently for 100s of tablets can cause unnecessary memory spikes, timeouts and wasteful retries. Setting this to a more conservative value like 2MB should be good enough in practice, and help in the near-term till additional/better auto-throttling is put in.
The text was updated successfully, but these errors were encountered:
Summary: When a TServer restarts (say after 1 minute) all tablets on that TServer try to catch up from their leaders. The UpdateConsensus RPCs with default 32MB from the leaders happening concurrently for 100s of tablets can cause unnecessary memory spikes, timeouts and wasteful retries. Setting this to a more conservative value like 2MB should be good enough in practice, and help in the near-term till additional/better auto-throttling is put in.
Test Plan:
- Did 4-5 ptest runs for each 2MB/32MB. The difference in results is mostly within std dev between same-build runs. And CassandraBatchTimeseries_w96_r0 is slightly better (almost within std dev) for 2mb max batch size. Average is 410kops/sec for 2mb and 372kops/sec for 32mb. Std dev is 33/37kops/sec.
- With 2MB max batch size, for 48 follower tablet peers maximum live calls tracked memory usage after one-minute downtime is 128MB for calls raw data + 286MB for calls parsed params, 414MB total.
Reviewers: sergei, mikhail, bogdan, kannan
Reviewed By: kannan
Subscribers: rao, kannan, ybase
Differential Revision: https://phabricator.dev.yugabyte.com/D7475
When a TServer restarts (say after 1 minute) all tablets on that TServer try to catch up from their leaders. The UpdateConsensus RPCs with default 32MB from the leaders happening concurrently for 100s of tablets can cause unnecessary memory spikes, timeouts and wasteful retries. Setting this to a more conservative value like 2MB should be good enough in practice, and help in the near-term till additional/better auto-throttling is put in.
The text was updated successfully, but these errors were encountered: