Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up deleted doc ids from inverted indices periodically #1285

Closed
1 task done
etiennedi opened this issue Nov 8, 2020 · 0 comments
Closed
1 task done

Clean up deleted doc ids from inverted indices periodically #1285

etiennedi opened this issue Nov 8, 2020 · 0 comments
Assignees
Milestone

Comments

@etiennedi
Copy link
Member

etiennedi commented Nov 8, 2020

Background

This is split out from #1272, as that issue can provide value earlier - even without this feature

Goals

  • periodically we clean up the deleted entries from the inverted index rows. Although this is a full-db scan, it should still lead to much better performance as (a) we can do this in batches and read each row just once, even if we delete multiple doc IDs and (b) this is async and does not affect queries which will stay fast.

Tech notes

All just meant as suggestions - if you find a better way, that's of course fine :)

  • To find out which doc ids in a shard are marked as deleted you can iterate over all key/value pairs in the doc id bucket. The rows in there can be unmarshalled using this helper which will return a docid.Lookup. Lookup itself has a Deleted bool field.
  • Once you have determined the ids to be deleted, you need to iterate over each row for each property bucket. The structure of a row is outlined in those doc comments. Make sure to also update the count (i.e. if you have a delete list of 20 doc ids and 7 of those were found in the old row, the count needs to be reduced by 7). Make sure to generate a new checksum of the row, so it can be cached in between multiple requests.
  • Most likely you'll be able to reuse this old method which was meant to delete a single doc id from a row. You could refactor it to take in a list of docIDs as opposed to just one. The old method is quite long and could probably benefit from some refactoring/splitting up.
  • Finally, after the doc id was removed, you should also remove the entry from the doc id bucket.
  • Open question: what happens if the clean up fails, for example because the amount of stuff to be cleaned up is so large that it can't be done in a reasonable time. If we were to try the exact same thing next iteration, we would probably time out again. One way to solve this might be a dynamic sleep in between clean ups, for example:
    • no doc ids to clean up, sleep 60s
    • found 100 doc ids to clean up, take the first 10, sleep 5s as there is work left to do (we only cleaned up 10 out of 100)
    • found 90 doc ids to clean up, take first 10, sleep 5s, ...
    • ... etc
    • found 10 doc ids to clean up, take all 10, no more work left, sleep 60s
@etiennedi etiennedi added this to the Standalone milestone Nov 8, 2020
@etiennedi etiennedi added the blocked Issue can't be worked on before something else has been completed. label Nov 8, 2020
@etiennedi etiennedi added prio and removed blocked Issue can't be worked on before something else has been completed. labels Nov 13, 2020
@antas-marcin antas-marcin self-assigned this Nov 17, 2020
@etiennedi etiennedi mentioned this issue Nov 25, 2020
etiennedi added a commit that referenced this issue Dec 4, 2020
…-doc-ids-periodically

gh-1285 Clean up deleted doc ids from inverted indices periodically
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants