Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider using DynamoDB or S3 for original document storage #17

Closed
tvanhens opened this issue Dec 1, 2022 · 1 comment
Closed

Consider using DynamoDB or S3 for original document storage #17

tvanhens opened this issue Dec 1, 2022 · 1 comment
Labels
enhancement New feature or request

Comments

@tvanhens
Copy link
Owner

tvanhens commented Dec 1, 2022

When Tantivy indexes documents it will optionally store the original text as well. This is used for generating snippets which are highlights of the matching text. Tantivy uses the filesystem to store these documents and stores them in a compressed format. This works fairly well when not using a networked storage solution but when using EFS, the whole store for a segment needs to be pulled in order to find a given document.

.store files tend to be considerably larger than the rest of the segment:

-rw-rw-r--  1 1001 1001  48K Nov 28 22:19 d908e24f73f04e83b85e679acf1d361b.7388140.del
-rw-rw-r--  1 1001 1001   99 Nov 28 21:44 d908e24f73f04e83b85e679acf1d361b.fast
-rw-rw-r--  1 1001 1001 2.7M Nov 28 21:44 d908e24f73f04e83b85e679acf1d361b.fieldnorm
-rw-rw-r--  1 1001 1001  30M Nov 28 21:44 d908e24f73f04e83b85e679acf1d361b.idx
-rw-rw-r--  1 1001 1001  17M Nov 28 21:44 d908e24f73f04e83b85e679acf1d361b.pos
-rw-rw-r--  1 1001 1001  91M Nov 28 21:44 d908e24f73f04e83b85e679acf1d361b.store
-rw-rw-r--  1 1001 1001 5.8M Nov 28 21:44 d908e24f73f04e83b85e679acf1d361b.term

Looking at all the .store files in the test index it is clear that pulling .store files to look up specific documents by id would be incredibly inefficient when taking network latency into account:

-rw-rw-r--  1 1001 1001 149M Nov 28 22:14 03578055b76b45bd961cf3931a0282d9.store
-rw-rw-r--  1 1001 1001 176M Nov 28 23:29 103552f789714d07a2dff9f7143e001c.store
-rw-rw-r--  1 1001 1001 162M Nov 29 00:06 1bfb3f7ef08e40b4bd166919c0786769.store
-rw-rw-r--  1 1001 1001 9.9M Nov 29 00:38 87c0dd2f0577477ba233ee6a1c57c948.store
-rw-rw-r--  1 1001 1001  66M Nov 29 00:16 92f9621c4f3e41a6938ad65d2c37e969.store
-rw-rw-r--  1 1001 1001  51M Nov 29 00:32 b208026b097445c4a8eef6ea7dc6754e.store
-rw-rw-r--  1 1001 1001  91M Nov 28 21:44 d908e24f73f04e83b85e679acf1d361b.store
-rw-rw-r--  1 1001 1001  58M Nov 29 00:24 dd957117dbf3437ca3cdb552b38cc8c4.store
-rw-rw-r--  1 1001 1001  38M Nov 29 00:37 ea6445abef914faeb767686e1c054987.store

Instead of using Tantivy's built-in storage capability, we could use S3 or DynamoDB to store original documents such that they could be retrieved efficiently by id. A beneficial side-effect of this change would be that it should be cheaper as well as both DynamoDB and S3 have a lower monthly storage cost compared to EFS.

@tvanhens tvanhens added the enhancement New feature or request label Dec 1, 2022
@tvanhens tvanhens mentioned this issue Dec 2, 2022
@tvanhens
Copy link
Owner Author

tvanhens commented Dec 9, 2022

Implemented in #18

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant