Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compaction: Fix key estimation per sstable to produce efficient filters #15727

Conversation

raphaelsc
Copy link
Member

The estimation assumes that size of other components are irrelevant, when estimating the number of partitions for each output sstable. The sstables are split according to the data file size, therefore size of other files are irrelevant for the estimation.

With certain data models, like single-row partitions containing small values, the index could be even larger than data.
For example, assume index is as large as data, then the estimation would say that 2x more sstables will be generated, and as a result, each sstable are underestimated to have 2x less keys.

Fix it by only accounting size of data file.

Fixes #15726.

The estimation assumes that size of other components are irrelevant,
when estimating the number of partitions for each output sstable.
The sstables are split according to the data file size, therefore
size of other files are irrelevant for the estimation.

With certain data models, like single-row partitions containing small
values, the index could be even larger than data.
For example, assume index is as large as data, then the estimation
would say that 2x more sstables will be generated, and as a result,
each sstable are underestimated to have 2x less keys.

Fix it by only accounting size of data file.

Fixes scylladb#15726.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
@raphaelsc raphaelsc requested a review from nyh as a code owner October 16, 2023 19:57
@raphaelsc raphaelsc requested a review from denesb October 16, 2023 19:57
@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Unit Tests
✅ - Sanity Tests

Build Details:

@denesb denesb requested a review from bhalevy October 17, 2023 08:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inefficient filter due to misestimation of partitions per sstable, by accounting components other than Data
4 participants