Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement index directory traits based on S3 API #333

Open
lulf opened this issue Aug 3, 2023 · 1 comment
Open

Implement index directory traits based on S3 API #333

lulf opened this issue Aug 3, 2023 · 1 comment
Assignees

Comments

@lulf
Copy link
Contributor

lulf commented Aug 3, 2023

This is on the border of an RFC, but it is not that much design work.

Today, the bombastic and vexination index relies on the local file system for the index, and there is a periodic 'sync' of the index to S3 for the indexer, and from S3 for the API processes. This comes with a few issues as described in trustification/trustification.dev#26:

  • Indexer process needs to build a zstd archive of all the files on disk. Unfortunately it cannot know which files are not 'active' in the index, so the archive becomes bigger and bigger with older segments still attached. This process is also not real-time in the sense that we have to configure an index update interval.
  • API processes need to download the zstd archive and unpack on local disk. This causes a minor window of taking a writer lock on the index and searches will be stalled (though not a significant problem now). It also means changes to index take some time to propagate to search.
  • Both of the above cause more s3 and network traffic than is strictly needed.

Instead, the tantivy library has traits for the Index Directory that comes with 2 out of box implementations:

  • RamDirectory - we use this in tests
  • MmapDirectory - we use this in 'production'

Instead of going via the local file system, the proposal is to implement an S3Directory which implements the same trait. This is probably similar to what quickwit (search server based on tantivy, AGPL) is already doing, but I have not checked. In any case the S3 API supports all the operations needed by the Directory trait.

The advantages of this approach would be:

  • No need for local persistence
  • Minimal memory usage in indexer and s3 processes
  • Immediate index updates
  • Better debugging experience: use tooling directly against s3 bucket to inspect and make queries for experimentation

Tantivy is pretty good at caching which means that search queries should still be fast enough.

@lulf
Copy link
Contributor Author

lulf commented Aug 8, 2023

Basic POC working in #345, but following TODO should be done before we try this:

  • Enable async searching/reading of index
  • Avoid loading entire file into memory on read
  • Make switching between old and new index configurable

Making it runtime configurable is important so that we can run it in staging and flip back to the existing index format if needed.

@lulf lulf self-assigned this Aug 8, 2023
lulf pushed a commit that referenced this issue Aug 14, 2023
* Add a runtime config option to enable s3 directory backed index.
* Implement tantivy directory trait based on s3 storage to bypass the local filesystem and syncing.

Issue #333
lulf pushed a commit that referenced this issue Aug 14, 2023
* Add a runtime config option to enable s3 directory backed index.
* Implement tantivy directory trait based on s3 storage to bypass the local filesystem and syncing.

Issue #333
lulf pushed a commit that referenced this issue Aug 14, 2023
* Add a runtime config option to enable s3 directory backed index.
* Implement tantivy directory trait based on s3 storage to bypass the local filesystem and syncing.

Issue #333
lulf pushed a commit that referenced this issue Aug 14, 2023
* Add a runtime config option to enable s3 directory backed index.
* Implement tantivy directory trait based on s3 storage to bypass the local filesystem and syncing.

Issue #333
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

1 participant