Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-1868 properly propagate "clear links" across commit logs. #1871

Merged
merged 5 commits into from Mar 24, 2022

Conversation

etiennedi
Copy link
Member

The following situation caused an issue. Imagine there are two commit
logs in chronological order:

Log 1:

  • AddLink for Node/Level/Link 0/0/1
  • AddLink for Node/Level/Link 0/0/2

Log 2:

  • ClearLinks for Node/Level 0/0
  • AddLink for Node/Level/Link 0/0/3
  • AddLink for Node/Level/Link 0/0/4

It should be quite obvious that after reading all the logs back in the
correct order node 0 will have the connections [3,4] at level zero, as
there was a "clear" event in between.

However, prior to this fix we would have ended up with [1,2,3,4] because
the clear information got lost when condensing each log independently.
This fix makes sure that information that relates to changes from prior
commit logs is kept when condensing a commit log. The actual bug was
when reading (as part of condensing) as we used the clear only to clear
the links (which from the perspective of log 2 were already empty). This
led us to condensing log 2 to just "add 3, add 4", whereas the correctly
condensed log would have been "clear, add 3, add 4". This is the case
now.

To write the tests needed for this fix, it was required to add some
functional options to the commit logger to control the commit log size
thresholds. As of now these are not yet exposed to the user, but this
would also add value.

fixes #1868

Potential follow-up tasks:

  1. Expose Commitlog configuration (e.g. thresholds) to user, so they can
    finetune them to their needs
  2. Clean up the redundant and partly unused code, and the confusion
    between condensor/condensor2 and deserializer/deserializer2

The following situation caused an issue. Imagine there are two commit
logs in chronological order:

Log 1:
- AddLink for Node/Level/Link 0/0/1
- AddLink for Node/Level/Link 0/0/2

Log 2:
- ClearLinks for Node/Level 0/0
- AddLink for Node/Level/Link 0/0/3
- AddLink for Node/Level/Link 0/0/4

It should be quite obvious that after reading all the logs back in the
correct order node 0 will have the connections [3,4] at level zero, as
there was a "clear" event in between.

However, prior to this fix we would have ended up with [1,2,3,4] because
the clear information got lost when condensing each log independently.
This fix makes sure that information that relates to changes from prior
commit logs is kept when condensing a commit log. The actual bug was
when reading (as part of condensing) as we used the clear only to clear
the links (which from the perspective of log 2 were already empty). This
led us to condensing log 2 to just "add 3, add 4", whereas the correctly
condensed log would have been "clear, add 3, add 4". This is the case
now.

To write the tests needed for this fix, it was required to add some
functional options to the commit logger to control the commit log size
thresholds. As of now these are not yet exposed to the user, but this
would also add value.

fixes #1868

Potential follow-up tasks:

1. Expose Commitlog configuration (e.g. thresholds) to user, so they can
   finetune them to their needs
2. Clean up the redundant and partly unused code, and the confusion
   between condensor/condensor2 and deserializer/deserializer2
@codecov
Copy link

codecov bot commented Mar 23, 2022

Codecov Report

Merging #1871 (34562e0) into master (7860f17) will increase coverage by 0.09%.
The diff coverage is 47.84%.

@@            Coverage Diff             @@
##           master    #1871      +/-   ##
==========================================
+ Coverage   66.27%   66.37%   +0.09%     
==========================================
  Files         415      418       +3     
  Lines       31433    31740     +307     
==========================================
+ Hits        20833    21068     +235     
- Misses       8792     8846      +54     
- Partials     1808     1826      +18     
Flag Coverage Δ
integration 67.37% <54.94%> (+0.23%) ⬆️
unittests 66.37% <47.84%> (+0.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
adapters/handlers/rest/configure_api.go 0.00% <0.00%> (ø)
...dapters/repos/db/lsmkv/segment_group_compaction.go 68.21% <ø> (-0.96%) ⬇️
...db/vector/hnsw/commit_logger_functional_options.go 0.00% <0.00%> (ø)
adapters/repos/db/vector/hnsw/condensor2.go 62.90% <0.00%> (ø)
adapters/repos/db/vector/hnsw/debug.go 29.33% <0.00%> (-6.16%) ⬇️
usecases/config/config_handler.go 0.00% <0.00%> (ø)
usecases/config/environment.go 0.00% <0.00%> (ø)
adapters/repos/db/shard_disk_use.go 25.53% <25.53%> (ø)
adapters/repos/db/vector/hnsw/commit_logger.go 58.33% <56.25%> (-0.58%) ⬇️
adapters/repos/db/vector/hnsw/deserializer2.go 65.32% <58.82%> (+2.61%) ⬆️
... and 25 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cfde2d2...34562e0. Read the comment docs.

@etiennedi
Copy link
Member Author

etiennedi commented Mar 23, 2022

Please don't merge yet - one of the tests seems to be flaky on CI, I will investigate

Update: This has been addressed and this PR should be ready to be reviewed/merged.

@antas-marcin antas-marcin merged commit ea967ef into master Mar 24, 2022
@antas-marcin antas-marcin deleted the bugfix-gh-1868 branch March 24, 2022 11:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: HNSW commit log condensor leads to index growing too large after re-read
2 participants