Make DocIDs immutable, set flag for delete, dont serve flagged entries at search time #1272

etiennedi · 2020-11-03T10:04:55Z

Current State

In the current implementation of the inverted index, documents are mutable. As a result an update might remove some inverted-index properties and add others. However, we have seen so far that this is problematic:

The HNSW vector index is immutable by design, so we are in a messy situation where we currently don't assign a new ID when the vector is unchanged, but do assign a new ID if the vector is not changed.
The above becomes even more problematic when specific properties have property-specific indices, such as with the Geo-Index, which in turn are immutable, so an update would also require a new doc ID for this
Inverted-Index Deletes are very slow at scale (1M+ entries in the db) as our scale tests with the arXiv dataset have shown if an update happens at this scale, deleting from an inverted index row is very slow. It would be much more efficient if we could just stay in "append-only" mode.

Goals

DocIDs (and therefore all entries in the inverted and vector indices) are immutable
On update, we assign a new one and mark the old one as deleted (very cheap)
- ~~this will probably have to be held in mem and be written to disk~~
- possible shortcut could include to remove the in mem cache part, however, this might produce considerably worse performace as every single read would have to check the "what is currently deleted" store from disk
- as part of the technical analysis, we have found out that we can simply use the existing docID->UUID lookup table which is now extended to also contain deletion status and deletion timestamp
At search time, we check the deletion list / tombstone if an id we encountered has already been deleted, if yes, it is ignored. This means a delete is consistent to the user as a deleted entry can no longer be served, once a tombstone is in.
complex decision logic about when to increase the doc id and when to mutate it, is removed. (This should lead to better code as well)

Out of scope

the periodic removal has been scoped out to Clean up deleted doc ids from inverted indices periodically #1285

Process

This issue should probably be coordinated with the implementation of #1282, they will probably touch on very similar areas. So if they can happen in sequence as opposed to parallel, we will probably avoid plenty of merge conflicts.

The text was updated successfully, but these errors were encountered:

There was a case where - in a small graph - a tombstoned entrypoint would lead to subsequent inserts not being connected to any other nodes, as the only viable candidate (the ep) had a tombstone and was therefore excluded from possible connections.

The merge implementation relied on the fact that - in the old implemtation - an explicit vector update was the only thing that could lead to a new doc id. Now that every update leads to a new doc id, the old implementation would sometimes write into the vector index with a nil-vector Note that we currently still have the same issue on the batch-reference which also relied on the fact that a ref update would never alter the vector position. It is marked with a TODO and needs to be fixed before the completion of this issue.

In the reference batcher we previously attempted concurrent reads/writes which aren't a problem per se. However, we could get into a situation where we would mark a doc id as deleted before it was ever imported. That would be problematic as doc ids on additional indices (such as HNSW) are themselves immutable. So an import to a deleted doc id would lead to issues, as we couldn't distuingish this from an attempt to reuse a doc id. As a side-effect the fix for this also made the ref-batcher slightly more efficient, as we remove unnecessary updates on the "same" object which was touched multiple times in one batch and therefore received multiple increases on the doc id

It was part of the inverted package, but I don't think it really belong there. It was just one of the packages that were extracted early on, so it was already "independent". However, this issue introduced the DocID package, grouping all kinds of docid related things. I think this is a great fit for the scanByDoc ID function. While moving it, I also split it up as it had become way too big.

etiennedi added this to the Standalone milestone Nov 3, 2020

etiennedi added the Technical Debt It's impossible to avoid technical debt, but one of the most important things is to track it! label Nov 3, 2020

etiennedi mentioned this issue Nov 3, 2020

Standalone Geo Spatial Index #1249

Closed

12 tasks

etiennedi self-assigned this Nov 5, 2020

This was referenced Nov 5, 2020

Standalone support updates on geo props #1281

Closed

Make docID uint64 (currently uint32) - to enable imports larger than 4M per class #1282

Closed

etiennedi added the prio label Nov 6, 2020

etiennedi mentioned this issue Nov 8, 2020

Clean up deleted doc ids from inverted indices periodically #1285

Closed

1 task

etiennedi added a commit that referenced this issue Nov 9, 2020

gh-1272 [ci skip] WIP write doc ids with deletion status

6150084

etiennedi added a commit that referenced this issue Nov 9, 2020

gh-1272 [ci skip] fix points To read issues

df76cc4

etiennedi added a commit that referenced this issue Nov 10, 2020

gh-1272 recognize empty graphs at search time in HNSW index

88bf4a0

etiennedi added a commit that referenced this issue Nov 10, 2020

gh-1272 no longer delete from inverted index

08bdd9c

etiennedi added a commit that referenced this issue Nov 10, 2020

gh-1272 perform additional index updates (vector, geo) in ref batcher

d5f2e78

etiennedi mentioned this issue Nov 11, 2020

Standalone immutable docids #1288

Merged

etiennedi changed the title ~~Make DocIDs immutable, set flag for delete and clean up periodically~~ Make DocIDs immutable, set flag for delete, dont serve flagged entries at search time Nov 11, 2020

etiennedi added a commit that referenced this issue Nov 13, 2020

gh-1281 gh-1272 implement review feedback

7623bb5

antas-marcin closed this as completed in #1288 Nov 13, 2020

etiennedi mentioned this issue Dec 1, 2020

Standalone: Group inverted index operations in batch put #1259

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make DocIDs immutable, set flag for delete, dont serve flagged entries at search time #1272

Make DocIDs immutable, set flag for delete, dont serve flagged entries at search time #1272

etiennedi commented Nov 3, 2020 •

edited

Make DocIDs immutable, set flag for delete, dont serve flagged entries at search time #1272

Make DocIDs immutable, set flag for delete, dont serve flagged entries at search time #1272

Comments

etiennedi commented Nov 3, 2020 • edited

Current State

Goals

Out of scope

Process

etiennedi commented Nov 3, 2020 •

edited