-
Notifications
You must be signed in to change notification settings - Fork 680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make DocIDs immutable, set flag for delete, dont serve flagged entries at search time #1272
Closed
4 tasks done
Labels
Technical Debt
It's impossible to avoid technical debt, but one of the most important things is to track it!
Milestone
Comments
etiennedi
added
the
Technical Debt
It's impossible to avoid technical debt, but one of the most important things is to track it!
label
Nov 3, 2020
This was referenced Nov 5, 2020
1 task
etiennedi
added a commit
that referenced
this issue
Nov 9, 2020
etiennedi
added a commit
that referenced
this issue
Nov 9, 2020
etiennedi
added a commit
that referenced
this issue
Nov 10, 2020
There was a case where - in a small graph - a tombstoned entrypoint would lead to subsequent inserts not being connected to any other nodes, as the only viable candidate (the ep) had a tombstone and was therefore excluded from possible connections.
etiennedi
added a commit
that referenced
this issue
Nov 10, 2020
The merge implementation relied on the fact that - in the old implemtation - an explicit vector update was the only thing that could lead to a new doc id. Now that every update leads to a new doc id, the old implementation would sometimes write into the vector index with a nil-vector Note that we currently still have the same issue on the batch-reference which also relied on the fact that a ref update would never alter the vector position. It is marked with a TODO and needs to be fixed before the completion of this issue.
etiennedi
added a commit
that referenced
this issue
Nov 10, 2020
etiennedi
added a commit
that referenced
this issue
Nov 10, 2020
etiennedi
added a commit
that referenced
this issue
Nov 10, 2020
etiennedi
added a commit
that referenced
this issue
Nov 10, 2020
In the reference batcher we previously attempted concurrent reads/writes which aren't a problem per se. However, we could get into a situation where we would mark a doc id as deleted before it was ever imported. That would be problematic as doc ids on additional indices (such as HNSW) are themselves immutable. So an import to a deleted doc id would lead to issues, as we couldn't distuingish this from an attempt to reuse a doc id. As a side-effect the fix for this also made the ref-batcher slightly more efficient, as we remove unnecessary updates on the "same" object which was touched multiple times in one batch and therefore received multiple increases on the doc id
etiennedi
added a commit
that referenced
this issue
Nov 11, 2020
It was part of the inverted package, but I don't think it really belong there. It was just one of the packages that were extracted early on, so it was already "independent". However, this issue introduced the DocID package, grouping all kinds of docid related things. I think this is a great fit for the scanByDoc ID function. While moving it, I also split it up as it had become way too big.
etiennedi
changed the title
Make DocIDs immutable, set flag for delete and clean up periodically
Make DocIDs immutable, set flag for delete, dont serve flagged entries at search time
Nov 11, 2020
etiennedi
added a commit
that referenced
this issue
Nov 13, 2020
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Technical Debt
It's impossible to avoid technical debt, but one of the most important things is to track it!
Current State
In the current implementation of the inverted index, documents are mutable. As a result an update might remove some inverted-index properties and add others. However, we have seen so far that this is problematic:
Goals
this will probably have to be held in mem and be written to diskpossible shortcut could include to remove the in mem cache part, however, this might produce considerably worse performace as every single read would have to check the "what is currently deleted" store from diskdocID->UUID
lookup table which is now extended to also contain deletion status and deletion timestampOut of scope
Process
This issue should probably be coordinated with the implementation of #1282, they will probably touch on very similar areas. So if they can happen in sequence as opposed to parallel, we will probably avoid plenty of merge conflicts.
The text was updated successfully, but these errors were encountered: