Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make DocIDs immutable, set flag for delete, dont serve flagged entries at search time #1272

Closed
4 tasks done
etiennedi opened this issue Nov 3, 2020 · 0 comments · Fixed by #1288
Closed
4 tasks done
Assignees
Labels
Technical Debt It's impossible to avoid technical debt, but one of the most important things is to track it!
Milestone

Comments

@etiennedi
Copy link
Member

etiennedi commented Nov 3, 2020

Current State

In the current implementation of the inverted index, documents are mutable. As a result an update might remove some inverted-index properties and add others. However, we have seen so far that this is problematic:

  • The HNSW vector index is immutable by design, so we are in a messy situation where we currently don't assign a new ID when the vector is unchanged, but do assign a new ID if the vector is not changed.
  • The above becomes even more problematic when specific properties have property-specific indices, such as with the Geo-Index, which in turn are immutable, so an update would also require a new doc ID for this
  • Inverted-Index Deletes are very slow at scale (1M+ entries in the db) as our scale tests with the arXiv dataset have shown if an update happens at this scale, deleting from an inverted index row is very slow. It would be much more efficient if we could just stay in "append-only" mode.

Goals

  • DocIDs (and therefore all entries in the inverted and vector indices) are immutable
  • On update, we assign a new one and mark the old one as deleted (very cheap)
    • this will probably have to be held in mem and be written to disk
    • possible shortcut could include to remove the in mem cache part, however, this might produce considerably worse performace as every single read would have to check the "what is currently deleted" store from disk
    • as part of the technical analysis, we have found out that we can simply use the existing docID->UUID lookup table which is now extended to also contain deletion status and deletion timestamp
  • At search time, we check the deletion list / tombstone if an id we encountered has already been deleted, if yes, it is ignored. This means a delete is consistent to the user as a deleted entry can no longer be served, once a tombstone is in.
  • complex decision logic about when to increase the doc id and when to mutate it, is removed. (This should lead to better code as well)

Out of scope

Process

This issue should probably be coordinated with the implementation of #1282, they will probably touch on very similar areas. So if they can happen in sequence as opposed to parallel, we will probably avoid plenty of merge conflicts.

@etiennedi etiennedi added this to the Standalone milestone Nov 3, 2020
@etiennedi etiennedi added the Technical Debt It's impossible to avoid technical debt, but one of the most important things is to track it! label Nov 3, 2020
@etiennedi etiennedi self-assigned this Nov 5, 2020
@etiennedi etiennedi added the prio label Nov 6, 2020
etiennedi added a commit that referenced this issue Nov 10, 2020
There was a case where - in a small graph - a tombstoned entrypoint
would lead to subsequent inserts not being connected to any other nodes,
as the only viable candidate (the ep) had a tombstone and was therefore
excluded from possible connections.
etiennedi added a commit that referenced this issue Nov 10, 2020
The merge implementation relied on the fact that - in the old
implemtation - an explicit vector update was the only thing that could
lead to a new doc id. Now that every update leads to a new doc id, the
old implementation would sometimes write into the vector index with a
nil-vector

Note that we currently still have the same issue on the batch-reference
which also relied on the fact that a ref update would never alter the
vector position. It is marked with a TODO and needs to be fixed before
the completion of this issue.
etiennedi added a commit that referenced this issue Nov 10, 2020
In the reference batcher we previously attempted concurrent reads/writes
which aren't a problem per se. However, we could get into a situation
where we would mark a doc id as deleted before it was ever imported.
That would be problematic as doc ids on additional indices (such as
HNSW) are themselves immutable. So an import to a deleted doc id would
lead to issues, as we couldn't distuingish this from an attempt to reuse
a doc id.

As a side-effect the fix for this also made the ref-batcher slightly
more efficient, as we remove unnecessary updates on the "same" object
which was touched multiple times in one batch and therefore received
multiple increases on the doc id
etiennedi added a commit that referenced this issue Nov 11, 2020
It was part of the inverted package, but I don't think it really belong
there. It was just one of the packages that were extracted early on, so
it was already "independent".

However, this issue introduced the DocID package, grouping all kinds of
docid related things. I think this is a great fit for the scanByDoc ID
function.

While moving it, I also split it up as it had become way too big.
@etiennedi etiennedi changed the title Make DocIDs immutable, set flag for delete and clean up periodically Make DocIDs immutable, set flag for delete, dont serve flagged entries at search time Nov 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Technical Debt It's impossible to avoid technical debt, but one of the most important things is to track it!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant