Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add a new option to the cache to limit it by size. The size is estimated by sampling the
byte_size
column of the cache entries.The size estimation firsts loads the
byte_size
of the N largest records ("the outliers"). We also grab the smallest size in that list to use as a cutoff for estimating the size of the remaining records ("the non-outliers").The estimate of the size of the non-outliers is calculated by sampling a random portion of the records. To quickly sample the records we use the index on
key_hash
andbyte_size
.We estimate how many records we'll need to sample by dividing the sample size by the estimated record count. This is our sample fraction.
We choose a random range of
key_hash
values the size of our sample fraction. We then sum thebyte_size
of the records in that range that do not exceed the cutoff and divide this value by the sample fraction.This gives up our non-outlier estimate which we add to the outlier total for our estimated size.
On Hey, so far we see this giving an estimate that it +/-5% of the actual total with 10,000 samples. It takes about 6ms (client side) to calculate the estimate. Different cache distributions and sizes may give different results.
Compared to just random sampling, adding the separate outlier check reduces the standard deviation of the guesses by about 25%, with almost exactly the same mean value. It will generally help in cases where there are a few very large records.
Because we resample every time we are deciding whether to expire records we should be fairly resilient to the odd poor estimate.