Implement index directory traits based on S3 API #333

lulf · 2023-08-03T08:11:25Z

This is on the border of an RFC, but it is not that much design work.

Today, the bombastic and vexination index relies on the local file system for the index, and there is a periodic 'sync' of the index to S3 for the indexer, and from S3 for the API processes. This comes with a few issues as described in trustification/trustification.dev#26:

Indexer process needs to build a zstd archive of all the files on disk. Unfortunately it cannot know which files are not 'active' in the index, so the archive becomes bigger and bigger with older segments still attached. This process is also not real-time in the sense that we have to configure an index update interval.
API processes need to download the zstd archive and unpack on local disk. This causes a minor window of taking a writer lock on the index and searches will be stalled (though not a significant problem now). It also means changes to index take some time to propagate to search.
Both of the above cause more s3 and network traffic than is strictly needed.

Instead, the tantivy library has traits for the Index Directory that comes with 2 out of box implementations:

RamDirectory - we use this in tests
MmapDirectory - we use this in 'production'

Instead of going via the local file system, the proposal is to implement an S3Directory which implements the same trait. This is probably similar to what quickwit (search server based on tantivy, AGPL) is already doing, but I have not checked. In any case the S3 API supports all the operations needed by the Directory trait.

The advantages of this approach would be:

No need for local persistence
Minimal memory usage in indexer and s3 processes
Immediate index updates
Better debugging experience: use tooling directly against s3 bucket to inspect and make queries for experimentation

Tantivy is pretty good at caching which means that search queries should still be fast enough.

The text was updated successfully, but these errors were encountered:

lulf · 2023-08-08T07:53:43Z

Basic POC working in #345, but following TODO should be done before we try this:

Enable async searching/reading of index
Avoid loading entire file into memory on read
Make switching between old and new index configurable

Making it runtime configurable is important so that we can run it in staging and flip back to the existing index format if needed.

* Add a runtime config option to enable s3 directory backed index. * Implement tantivy directory trait based on s3 storage to bypass the local filesystem and syncing. Issue #333

lulf self-assigned this Aug 8, 2023

lulf mentioned this issue Aug 14, 2023

feat: support using s3 directly as index backing store #392

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement index directory traits based on S3 API #333

Implement index directory traits based on S3 API #333

lulf commented Aug 3, 2023 •

edited

lulf commented Aug 8, 2023

Implement index directory traits based on S3 API #333

Implement index directory traits based on S3 API #333

Comments

lulf commented Aug 3, 2023 • edited

lulf commented Aug 8, 2023

lulf commented Aug 3, 2023 •

edited