Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LogMergePolicy knob del_docs_percentage_before_merge #1238

Merged
merged 10 commits into from
Dec 20, 2021

Conversation

shikhar
Copy link
Member

@shikhar shikhar commented Dec 16, 2021

Addresses #115

If this % of deleted documents is exceeded for a segment, it will always be proposed for a merge, even if it is the only one at that level.

TODOs

@shikhar
Copy link
Member Author

shikhar commented Dec 16, 2021

Need to add additional test cases.

I tried it out at 10% and in initial usage it is seeming effective for us:
image

vs previously:
image

@codecov-commenter
Copy link

codecov-commenter commented Dec 16, 2021

Codecov Report

Merging #1238 (8059a55) into main (b2da82f) will decrease coverage by 0.03%.
The diff coverage is 80.95%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1238      +/-   ##
==========================================
- Coverage   94.19%   94.16%   -0.04%     
==========================================
  Files         206      206              
  Lines       34891    34913      +22     
==========================================
+ Hits        32866    32876      +10     
- Misses       2025     2037      +12     
Impacted Files Coverage Δ
src/indexer/log_merge_policy.rs 96.39% <80.00%> (-1.67%) ⬇️
src/indexer/index_writer.rs 97.58% <100.00%> (ø)
src/fieldnorm/writer.rs 98.57% <0.00%> (-1.43%) ⬇️
src/indexer/segment_updater.rs 94.45% <0.00%> (-0.88%) ⬇️
src/directory/watch_event_router.rs 96.18% <0.00%> (-0.74%) ⬇️
src/store/index/mod.rs 97.83% <0.00%> (-0.55%) ⬇️
src/query/boolean_query/block_wand.rs 96.85% <0.00%> (-0.21%) ⬇️
src/indexer/segment_manager.rs 82.31% <0.00%> (+1.36%) ⬆️
src/directory/directory.rs 95.58% <0.00%> (+1.47%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b2da82f...8059a55. Read the comment docs.

@@ -8,6 +8,7 @@ const DEFAULT_LEVEL_LOG_SIZE: f64 = 0.75;
const DEFAULT_MIN_LAYER_SIZE: u32 = 10_000;
const DEFAULT_MIN_NUM_SEGMENTS_IN_MERGE: usize = 8;
const DEFAULT_MAX_DOCS_BEFORE_MERGE: usize = 10_000_000;
const DEFAULT_MAX_DEL_DOCS_PCT: u8 = 100;
Copy link
Member Author

@shikhar shikhar Dec 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lucene default for a similar knob is 33% https://github.com/apache/lucene/blob/c64e5fe84c4990968844193e3a62f4ebbba638ea/lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java#L91

100% is effectively a no-op over the current policy. Lowering it to 33% causes some tests to fail, probably worth working through it though, if the approach makes sense!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should lower it, not sure if 33% is not too early though.

@shikhar shikhar changed the title max_del_docs_pct knob on LogMergePolicy LogMergePolicy knob del_docs_percentage_before_merge Dec 20, 2021
@fulmicoton
Copy link
Collaborator

Need to add additional test cases.

I tried it out at 10% and in initial usage it is seeming effective for us: image

vs previously: image

Seeing that kind of graph makes me happy :)

@fulmicoton
Copy link
Collaborator

@shikhar LGTM! Thank you for taking care of that very old issue!
I'll open an issue to eventually change the issue.

@fulmicoton
Copy link
Collaborator

@shikhar @PSeitz

I added a bunch of unit test, and switch back to from (percentage, u8) into (ratio, f32) for different "soft reasons".

I had to make a change anyway, because in the previous implem, the the actual first value where we observed the switch was pretty unexpected.
For 10,000 docs and a pct of 25%, the threshold was expressed as > 26,000 deleted docs and not > 25,000 deleted documents. It is possible to fix it using only integer operations using the cross product, but then the code would not be exactly as simple.

@fulmicoton fulmicoton merged commit e5e252c into quickwit-oss:main Dec 20, 2021
This was referenced Feb 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants