Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

throttler: don't allocate any resources unless it is actually enabled #8643

Merged
merged 2 commits into from
Aug 25, 2021

Conversation

deepthi
Copy link
Member

@deepthi deepthi commented Aug 18, 2021

Description

Even when -enable_lag_throttler is not set, various caches are created and checked at intervals.
Specifically nonLowPriorityAppRequestsThrottled is checked every 100ms. This is wasteful.
All of this is now gated by the flag. In addition to this, I moved various other initializations (like SelfChecks) to also be gated by the flag.

Related Issue(s)

Checklist

  • Should this PR be backported? NO
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

Signed-off-by: deepthi <deepthi@planetscale.com>
Copy link
Contributor

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine.

The other thing is that we can increase the garbage-collect interval on most of these caches without affecting logic.

@deepthi
Copy link
Member Author

deepthi commented Aug 19, 2021

Out-of-band comment from @shlomi-noach:

It should be just fine to increase the cleanup interval on that cache/map. Change:

var mysqlMetricCache = cache.New(cache.NoExpiration, 10*time.Millisecond)

to:

var mysqlMetricCache = cache.New(cache.NoExpiration, 10*time.Second)

The logic will behave in exact same way because items are put into the map with an explicit timeout, which is then checked for at Get() time. All the change does is to increase the map's internal garbage collection; keeping a low interval is useful in maps that have many changing metrics. This map will only have a handful of metrics.

I will make this change as well before merging the PR.

@deepthi
Copy link
Member Author

deepthi commented Aug 19, 2021

The other thing is that we can increase the garbage-collect interval on most of these caches without affecting logic.

We have the following:

		throttler.throttledApps = cache.New(cache.NoExpiration, 10*time.Second)
		throttler.mysqlClusterThresholds = cache.New(cache.NoExpiration, 0)
		throttler.aggregatedMetrics = cache.New(5 * time.Second, 1 * time.Second)
		throttler.recentApps = cache.New(recentAppsExpiration, time.Minute)
		throttler.metricsHealth = cache.New(cache.NoExpiration, 0)
		throttler.nonLowPriorityAppRequestsThrottled = cache.New(time.Second, 100 * time.MilliSecond)

What are reasonable values for the cleanup intervals that are currently 1 second or less? Should I change those to 10 seconds? There are two of them: nonLowPriorityAppRequestsThrottled and aggregatedMetrics.

Signed-off-by: deepthi <deepthi@planetscale.com>
@vmg
Copy link
Collaborator

vmg commented Aug 23, 2021

Are we really OK with the increase for the timeouts/intervals in production systems? Vitess knows via its env when it's running in a local deployment, so maybe it would be wise to check that before increasing those time values, and only doing so in local environments with the purpose of reducing CPU usage. The increased accuracy is probably always worth it in production environments, and the increase in CPU usage there will be negligible.

@shlomi-noach shlomi-noach self-assigned this Aug 24, 2021
@shlomi-noach
Copy link
Contributor

There are two of them: nonLowPriorityAppRequestsThrottled and aggregatedMetrics.

Both are fine to change to 10 * time.Second

Are we really OK with the increase for the timeouts/intervals in production systems?

Yes, allow me to explain. It's in how this particular cache implementation works. This cache is a KV map, where a value is a combination of:

  • arbitrary data (interface{})
  • timestamp of insertion
  • expiry duration

Whenever you Get(...) an item from the cache

  • if it doesn't exist, fine, return empty
  • if it exists, the cache checks whether timestamp of insertion + expiry duration < now. If so, it's as if the item does not exist, return empty
  • otherwise return the item

Thus far the logic is sound and complete. There is no need for garbage-collecting in terms of correctness of data.

Of course, caches can blow up with data, and the garbage collection, which is the topic of this discussion, is how the cache is cleaned up: the garbage collector iterates all items, computes if timestamp of insertion + expiry duration < now and proactively deletes items where the condition holds.

This is useful in caches that have many items, where new keys ar ebeign introduced and are short-lived.

However, in the lag-throttler caches, the number of items is very limited: lag computation cache size == number of servers in a shard (just a handful). Aggregated metrics == 2 (one for lag metric, one for primary metric). Cached check results size == number of app names the inquire for checks. This is also a handful (gh-ost, vreplication, ...). The user may define any arbitrary check name, but it's unlikely that the number will be huge.

Therefore, it's the same keys again and again, and there is no fear of bloating the cache size; hence, garbage collection can be very relaxed. It can probably be disabled altogether without impact; but a 10 * time.Second should go unnoticeable (or else we have bigger problems).

@vmg
Copy link
Collaborator

vmg commented Aug 25, 2021

I'm convinced! 👍

@deepthi deepthi merged commit f0e9966 into vitessio:main Aug 25, 2021
@deepthi deepthi deleted the ds-lag-throttler branch August 25, 2021 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants