Consider using DynamoDB or S3 for original document storage #17

tvanhens · 2022-12-01T16:05:55Z

When Tantivy indexes documents it will optionally store the original text as well. This is used for generating snippets which are highlights of the matching text. Tantivy uses the filesystem to store these documents and stores them in a compressed format. This works fairly well when not using a networked storage solution but when using EFS, the whole store for a segment needs to be pulled in order to find a given document.

.store files tend to be considerably larger than the rest of the segment:

-rw-rw-r--  1 1001 1001  48K Nov 28 22:19 d908e24f73f04e83b85e679acf1d361b.7388140.del
-rw-rw-r--  1 1001 1001   99 Nov 28 21:44 d908e24f73f04e83b85e679acf1d361b.fast
-rw-rw-r--  1 1001 1001 2.7M Nov 28 21:44 d908e24f73f04e83b85e679acf1d361b.fieldnorm
-rw-rw-r--  1 1001 1001  30M Nov 28 21:44 d908e24f73f04e83b85e679acf1d361b.idx
-rw-rw-r--  1 1001 1001  17M Nov 28 21:44 d908e24f73f04e83b85e679acf1d361b.pos
-rw-rw-r--  1 1001 1001  91M Nov 28 21:44 d908e24f73f04e83b85e679acf1d361b.store
-rw-rw-r--  1 1001 1001 5.8M Nov 28 21:44 d908e24f73f04e83b85e679acf1d361b.term

Looking at all the .store files in the test index it is clear that pulling .store files to look up specific documents by id would be incredibly inefficient when taking network latency into account:

-rw-rw-r--  1 1001 1001 149M Nov 28 22:14 03578055b76b45bd961cf3931a0282d9.store
-rw-rw-r--  1 1001 1001 176M Nov 28 23:29 103552f789714d07a2dff9f7143e001c.store
-rw-rw-r--  1 1001 1001 162M Nov 29 00:06 1bfb3f7ef08e40b4bd166919c0786769.store
-rw-rw-r--  1 1001 1001 9.9M Nov 29 00:38 87c0dd2f0577477ba233ee6a1c57c948.store
-rw-rw-r--  1 1001 1001  66M Nov 29 00:16 92f9621c4f3e41a6938ad65d2c37e969.store
-rw-rw-r--  1 1001 1001  51M Nov 29 00:32 b208026b097445c4a8eef6ea7dc6754e.store
-rw-rw-r--  1 1001 1001  91M Nov 28 21:44 d908e24f73f04e83b85e679acf1d361b.store
-rw-rw-r--  1 1001 1001  58M Nov 29 00:24 dd957117dbf3437ca3cdb552b38cc8c4.store
-rw-rw-r--  1 1001 1001  38M Nov 29 00:37 ea6445abef914faeb767686e1c054987.store

Instead of using Tantivy's built-in storage capability, we could use S3 or DynamoDB to store original documents such that they could be retrieved efficiently by id. A beneficial side-effect of this change would be that it should be cheaper as well as both DynamoDB and S3 have a lower monthly storage cost compared to EFS.

The text was updated successfully, but these errors were encountered:

tvanhens · 2022-12-09T18:42:43Z

Implemented in #18

tvanhens added the enhancement New feature or request label Dec 1, 2022

tvanhens mentioned this issue Dec 2, 2022

Document store #18

Merged

tvanhens closed this as completed Dec 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider using DynamoDB or S3 for original document storage #17

Consider using DynamoDB or S3 for original document storage #17

tvanhens commented Dec 1, 2022

tvanhens commented Dec 9, 2022

Consider using DynamoDB or S3 for original document storage #17

Consider using DynamoDB or S3 for original document storage #17

Comments

tvanhens commented Dec 1, 2022

tvanhens commented Dec 9, 2022