Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If indexing occurs in-place, then results during indexing may be incorrect #64

Closed
wardle opened this issue Mar 10, 2024 · 0 comments
Closed
Labels
enhancement New feature or request

Comments

@wardle
Copy link
Owner

wardle commented Mar 10, 2024

The usual and suggested approach is to build immutable data files which are then used in production systems as-is.

There has been some interest in simpler deployments in being able to update-in-place. That means a file-based index is updated while it is being used by another process.

Both LMDB and Apache Lucene would support in-place updates. However, the current approach importing into the key-value store adopts a two-phase approach, in which entities such as descriptions and relationships are updated, and then all indexes are deleted and re-built. It is therefore possible that processes reading the index while it is being written read incorrect information.

LMDB serialises writes, and reads can run in parallel with a write transaction.

Options:

  1. Ignore the issue and recommend that users should not update in place
  2. During import, continue the 2-step write/index process but write values and then delete and re-index in a single write transaction. Readers will then always see 'correct' information
  3. Re-write import so that each write is indexed.

Option (1) works for me, but limits use for some users who want to update-in-place.
Option (2) could easily be implemented by either using a parent transaction and nested r/w transactions or reworking to simply use one transaction for both.
Option (3) is more complex to get right, because we'd have to incrementally write the index entries based on effectiveTime and active parameters.

The problem with option 3 can be illustrated with an example. Firstly an import occurs that imports a relationship. For each relationship, we create a parent index source--type--group--destination and a child index destination--type--group--source. Using these indices, we can then rapidly determine the child or parent concepts given a concept.

Imagine in one release there is an active relationship. Thus we create the appropriate indices. Now imagine we import two releases one after another but there is no guarantee that newer items will be imported after older items. As such, we may import an older item - this should be ignored if there is already a newer relationship. If not, and the relationship is active, we should create the required index entries. If the relationship is inactive, then we need to delete any existing index items for that relationship. The problem is that a different relationship id may be created for the same source-type-group-destination - so the issue then is, as the relationship index does not track the relationship id, only source--type--group--destination. We may end up removing an entry from the index because a relationship is inactive but the index should exist because that tuple was defined in a new active relationship. The fix for that is to include the relationship id in the tuple, at the end, but that a) increases index size, and, b) potential causes duplicates.

The simpler option is to delete all index entries and recreate in a single transaction. However, this does potentially mean that an update-in-place database will increase in size as there is more churn during index creation. A progressive approach to indexing, albeit with slightly more complex logic will likely allow lmdb to avoid creating additional empty pages as most index entries will not actually be deleted. This could be used for the refset index, which does include the UUID of the reference set, making it safe to index progressively.

The answer is therefore to benchmark indexing and assess database file size growth during index updates. A simple method would be build a test suite to import synthetic data and index, and then re-import the same data and re-index. Firstly, we'd expect data integrity from concurrent reads so that, for example, at no point does a request for the relationships of a concept return no results. Secondly, we'd expect minimal database file growth. Concurrent reads may lock pages and so force some growth, and that will be unavoidable.

wardle added a commit that referenced this issue Mar 11, 2024
As per #64, indexing currently uses two sequential write transactions which delete the
relationship and refset indices, and then rebuild those indices. That means if there are
concurrent threads or processes may read inconsistent data during an update-in-place operation.
This tests for this issue and will result in a failing test if data are inconsistent during an
indexing operation.
@wardle wardle closed this as completed in d98e4f1 Mar 11, 2024
@wardle wardle added the enhancement New feature or request label Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant