Clean up deleted doc ids from inverted indices periodically #1285

etiennedi · 2020-11-08T13:37:21Z

Background

This is split out from #1272, as that issue can provide value earlier - even without this feature

Goals

periodically we clean up the deleted entries from the inverted index rows. Although this is a full-db scan, it should still lead to much better performance as (a) we can do this in batches and read each row just once, even if we delete multiple doc IDs and (b) this is async and does not affect queries which will stay fast.

Tech notes

All just meant as suggestions - if you find a better way, that's of course fine :)

To find out which doc ids in a shard are marked as deleted you can iterate over all key/value pairs in the doc id bucket. The rows in there can be unmarshalled using this helper which will return a docid.Lookup. Lookup itself has a Deleted bool field.
Once you have determined the ids to be deleted, you need to iterate over each row for each property bucket. The structure of a row is outlined in those doc comments. Make sure to also update the count (i.e. if you have a delete list of 20 doc ids and 7 of those were found in the old row, the count needs to be reduced by 7). Make sure to generate a new checksum of the row, so it can be cached in between multiple requests.
Most likely you'll be able to reuse this old method which was meant to delete a single doc id from a row. You could refactor it to take in a list of docIDs as opposed to just one. The old method is quite long and could probably benefit from some refactoring/splitting up.
Finally, after the doc id was removed, you should also remove the entry from the doc id bucket.
Open question: what happens if the clean up fails, for example because the amount of stuff to be cleaned up is so large that it can't be done in a reasonable time. If we were to try the exact same thing next iteration, we would probably time out again. One way to solve this might be a dynamic sleep in between clean ups, for example:
- no doc ids to clean up, sleep 60s
- found 100 doc ids to clean up, take the first 10, sleep 5s as there is work left to do (we only cleaned up 10 out of 100)
- found 90 doc ids to clean up, take first 10, sleep 5s, ...
- ... etc
- found 10 doc ids to clean up, take all 10, no more work left, sleep 60s

The text was updated successfully, but these errors were encountered:

…-doc-ids-periodically gh-1285 Clean up deleted doc ids from inverted indices periodically

etiennedi mentioned this issue Nov 8, 2020

Make DocIDs immutable, set flag for delete, dont serve flagged entries at search time #1272

Closed

4 tasks

etiennedi added this to the Standalone milestone Nov 8, 2020

etiennedi added the blocked Issue can't be worked on before something else has been completed. label Nov 8, 2020

etiennedi mentioned this issue Nov 11, 2020

Standalone immutable docids #1288

Merged

etiennedi added prio and removed blocked Issue can't be worked on before something else has been completed. labels Nov 13, 2020

antas-marcin self-assigned this Nov 17, 2020

etiennedi mentioned this issue Nov 25, 2020

Bugfix 1308 #1309

Merged

antas-marcin added a commit that referenced this issue Nov 26, 2020

gh-1285 Clean up deleted doc ids from inverted indices periodically

526dc9c

etiennedi mentioned this issue Dec 1, 2020

Standalone: Group inverted index operations in batch put #1259

Closed

antas-marcin added a commit that referenced this issue Dec 1, 2020

gh-1285 Clean up deleted doc ids from inverted indices periodically

cb1d693

antas-marcin added a commit that referenced this issue Dec 2, 2020

gh-1285 Clean up deleted doc ids from inverted indices periodically

3d97e1e

antas-marcin added a commit that referenced this issue Dec 2, 2020

gh-1285 Clean up deleted doc ids from inverted indices periodically

542aeee

antas-marcin added a commit that referenced this issue Dec 2, 2020

gh-1285 Clean up deleted doc ids from inverted indices periodically

4e12d38

antas-marcin added a commit that referenced this issue Dec 3, 2020

gh-1285 Clean up deleted doc ids from inverted indices periodically

333ea67

antas-marcin added a commit that referenced this issue Dec 3, 2020

gh-1285 Clean up deleted doc ids from inverted indices periodically

47b78f8

antas-marcin added a commit that referenced this issue Dec 3, 2020

gh-1285 Clean up deleted doc ids from inverted indices periodically

6fec8e0

antas-marcin added a commit that referenced this issue Dec 3, 2020

gh-1285 Clean up deleted doc ids from inverted indices periodically

3b09350

antas-marcin added a commit that referenced this issue Dec 3, 2020

gh-1285 Clean up deleted doc ids from inverted indices periodically

42f5d1c

etiennedi added a commit that referenced this issue Dec 4, 2020

Merge pull request #1315 from semi-technologies/1285-clean-up-deleted…

0549724

…-doc-ids-periodically gh-1285 Clean up deleted doc ids from inverted indices periodically

antas-marcin closed this as completed Dec 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up deleted doc ids from inverted indices periodically #1285

Clean up deleted doc ids from inverted indices periodically #1285

etiennedi commented Nov 8, 2020 •

edited by antas-marcin

Clean up deleted doc ids from inverted indices periodically #1285

Clean up deleted doc ids from inverted indices periodically #1285

Comments

etiennedi commented Nov 8, 2020 • edited by antas-marcin

Background

Goals

Tech notes

etiennedi commented Nov 8, 2020 •

edited by antas-marcin