Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expire the cache by size #139

Merged
merged 2 commits into from Feb 2, 2024
Merged

Expire the cache by size #139

merged 2 commits into from Feb 2, 2024

Conversation

djmb
Copy link
Collaborator

@djmb djmb commented Jan 24, 2024

Add a new option to the cache to limit it by size. The size is estimated by sampling the byte_size column of the cache entries.

The size estimation firsts loads the byte_size of the N largest records ("the outliers"). We also grab the smallest size in that list to use as a cutoff for estimating the size of the remaining records ("the non-outliers").

The estimate of the size of the non-outliers is calculated by sampling a random portion of the records. To quickly sample the records we use the index on key_hash and byte_size.

We estimate how many records we'll need to sample by dividing the sample size by the estimated record count. This is our sample fraction.

We choose a random range of key_hash values the size of our sample fraction. We then sum the byte_size of the records in that range that do not exceed the cutoff and divide this value by the sample fraction.

This gives up our non-outlier estimate which we add to the outlier total for our estimated size.

On Hey, so far we see this giving an estimate that it +/-5% of the actual total with 10,000 samples. It takes about 6ms (client side) to calculate the estimate. Different cache distributions and sizes may give different results.

Compared to just random sampling, adding the separate outlier check reduces the standard deviation of the guesses by about 25%, with almost exactly the same mean value. It will generally help in cases where there are a few very large records.

Because we resample every time we are deciding whether to expire records we should be fairly resilient to the odd poor estimate.

Add a new option to the cache to limit it by size. The size is estimated
by sampling the `byte_size` column of the cache entries.

The size estimation firsts loads the `byte_size` of the N largest records
("the outliers"). We also grab the smallest size in that list to use as
a cutoff for estimating the size of the remaining records ("the
non-outliers").

The estimate of the size of the non-outliers is calculated by sampling a
random portion of the records. To quickly sample the records we use the
index on `key_hash` and `byte_size`.

We estimate how many records we'll need to sample by dividing the sample
size by the estimated record count. This is our sample fraction.

We choose a random range of `key_hash` values the size of our sample
fraction. We then sum the `byte_size` of the records in that range that
do not exceed the cutoff and divide this value by the sample fraction.

This gives up our non-outlier estimate which we add to the outlier total
for our estimated size.

On Hey, so far we see this giving an estimate that it +/-5% of the
actual total with 10,000 samples. It takes about 6ms (client side) to
calculate the estimate. Different cache distributions and sizes may give
different results.

Compared to just random sampling, adding the separate outlier check
reduces the standard deviation of the guesses by about 25%, with almost
exactly the same mean value. It will generally help in cases where there
are a few very large records.

Because we resample every time we are deciding whether to expire records
we should be fairly resilient to the odd poor estimate.
@djmb djmb merged commit c666d29 into main Feb 2, 2024
34 checks passed
@djmb djmb deleted the cache-size-estimation branch February 2, 2024 11:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant