Set bloom_fpr automatically #3969

kostja · 2019-02-08T14:07:00Z

While mainstream designs along the top curve in Figure 1 set the same false positive rate to Bloom
filters across all levels of LSM-tree, Monkey, the current
state of the art, sets exponentially lower false positive rates to
Bloom filters at smaller levels [22]. This is shown to minimize
the sum of false positive rates across all filters and to thereby
minimize I/O for point lookups

See N. Dayan, M. Athanassoulis, and S. Idreos. Monkey: Optimal Navigable KeyValue Store. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 79–94, 2017

See also the dostoevsky paper referencing the paper above.

The optimal bloom_fpr is calculated using transformed optimization from Monkey: https://stratos.seas.harvard.edu/files/stratos/files/monkeykeyvaluestore.pdf article: the calculation based not on a run's level in LSM tree, but on it's size. See also tarantool#3969 NO_DOC=internal NO_TEST=internal

@TarantoolBot

@TarantoolBot document Title: optimized bloom_fpr calculation The optimal bloom_fpr is calculated using transformed optimization from Monkey: https://stratos.seas.harvard.edu/files/stratos/files/monkeykeyvaluestore.pdf article: the calculation based not on a run's level in LSM tree, but on it's size. vy_scheduler.c added functions: vy_task_find_range_bloom_fpr(lsm, range, new_run_stmts_count, W, total_stmts_count, skip_n) Calculate optimal bloom_fpr for given range. Works for runs created during compaction and as a subfunction during dump run creation. Parameters: * lsm — LSM-tree we are working with * range — concrete range we are working with * new_run_stmts_count — statements count in the newly created after dump/compaction run * W — total number of runs in the range after operation (dump/compaction) * total_stmts_count — total number of statements in the range after operation (dump/compaction) * skip_n — number of runs to skip during the range traverse as they are compacted into one new run Returns: optimally calculated bloom_fpr vy_task_find_lsm_bloom_fpr(lsm, new_run_stmts_count) Calculate optimal bloom_fpr for given lsm. Using vy_task_find_range_bloom_fpr inside. Parameters: * lsm — LSM-tree we are working with * new_run_stmts_count — statements count in the newly created after dump run Returns: optimally calculated bloom_fpr for dump See also tarantool#3969 NO_TEST=internal

kyukhin added feature A new functionality vinyl performance labels Feb 13, 2019

kyukhin added this to the 2.2.0 milestone Feb 13, 2019

kyukhin modified the milestones: 2.2.0, 2.3.0 Mar 27, 2019

kyukhin modified the milestones: 2.3.0, 2.4.0 Jul 22, 2019

kyukhin modified the milestones: 2.4.1, 2.5.1 Jan 21, 2020

kyukhin modified the milestones: 2.5.1, 2.6.1 Apr 10, 2020

kyukhin modified the milestones: 2.6.1, wishlist Oct 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set bloom_fpr automatically #3969

Set bloom_fpr automatically #3969

kostja commented Feb 8, 2019 •

edited by tsafin

Set bloom_fpr automatically #3969

Set bloom_fpr automatically #3969

Comments

kostja commented Feb 8, 2019 • edited by tsafin

kostja commented Feb 8, 2019 •

edited by tsafin