Make tantivy work in the browser with a statically hosted database and on-demand fetching #1067

phiresky · 2021-05-29T12:21:24Z

hi everyone!

I've managed to make tantivy work in WASM with partial loading of queries via HTTP Range requests. This means it's possible to host a tantivy search engine on a statically hosted website (or a distributed website on IPFS).

For example, doing a full text search on an index of size 14 GByte takes 2 seconds, and only needs to download only ~1.5MByte of the index.

Here's a demo using the english Wikipedia as well as the OpenLibrary metadata:

https://demo.phiresky.xyz/tmp-ytccrzsovkcjoylr/dist/index.html

Since tantivy heavily relies on memory mapping, this required some pretty deep changes both here and in tantivy-fst.

Of course, this whole thing is much less efficient than just using tantivy on the backend, but needing only a static file host is much cheaper and this is much more efficient than when doing the same thing with SQLite. This is also a unique feature no other database or search engine has (as far as i know).

Here's some details about how it works:

Instead of OwnedBytes being passed around, we pass around FileSlices and only get the actual content out of it as late as possible. I've added a new trait FakeArr<> that is used in tantivy-fst as well as tantivy for slicing operations etc to replace the native ones.
I've replaced most instances of usize with type Ulen = u64 because why would I want to be able to only load databases < 4GByte on a 32 bit system (like wasm)? :)
I've added a new implementation of Directory and FileHandle that hooks into JavaScript instead of the file system. This implementation uses two layers of caching and prefetching:
1. Files are always read in chunks of (configurable) 32kByte. If a single byte is read, the whole 32kByte chunk is fetched and cached indefinitely.
2. Each file handle has three virtual "read heads" with a current position. If a chunk is fetched that is sequential to the previous request, an increasing number of chunks is prefetched speculatively to reduce the number of requests in the future. This was very useful for SQLite but I'm not sure how much it does for tantivy since the DB structure is very different and most requests have predictable sizes anyways.
I've adjusted the tests to work with the new interfaces
The changes to tantivy-fst are in this companion PR: Make fst work without memory mapping with an arbitrary "fake array" quickwit-inc/fst#15

And now here's the reasons this probably can't directly be merged:

I've changed instances of usize to Ulen somewhat blindly, because it's tons of changes, so there might be some cases where it should be kept as usize
No compile time flag to change Ulen back to 32 bit if someone really needs it. I doubt many people actually use tantivy on 32bit systems though.
Dynamic invocation that probably causes unacceptable performance loss when using tantivy normally. There's dynamic invokation between FileSlice and FileHandle as well as between FakeArrSlice and FakeArr (the FileSlice and FakeArrSlice traits could be unified probably).

I think this could be solved without losing any flexibility by either adding more generics and using static invocation, or by adding a compile time flag.
I based it on tantivy 0.14 not master
I did not care about the .pos file so far so that's still fetched as a whole if you need phrase queries

Here's the code of the demo: https://github.com/phiresky/tantivy-wasm/

phiresky · 2021-05-29T13:44:47Z

Another note: The same thing could be used with an IndexedDB backend to e.g. fix the Matrix search in the browser: matrix-org/seshat#84

So far I've only looked at read-only but writing should be possible similarily

ngbrown · 2021-05-29T14:37:25Z

It's not clear that IPFS supports byte range requests apart from the gateway-browser connection. So the actual files stored may have to be pre-chunked and stored as separate files. If the "read heads" read dynamically vary the ranges they read over, then would it be simple to replace the range query with a numbered file fetch?

phiresky · 2021-05-29T18:37:21Z

It's not clear that IPFS supports byte range requests apart from the gateway-browser connection.
may have to be pre-chunked

In my above-linked article I actually do split the database file into multiple chunks (db chunked in 10MB, fetched in chunks of 1kB), so my code here kinda already supports it. So that's useful if e.g. the CDN (≅ IPFS gateway) can only fetch and cache whole files, not file chunks.

Someone actually is using the same method of fetching file chunks with IPFS and SQLite and it works pretty well (can't link it sadly). The main limitation seems to be that most IPFS gateways throw 429 errors after ~20 requests. I also don't know if IPFS supports fetching only parts of files from the network, but if not pre-chunking the files definitely works.

fulmicoton · 2021-05-31T01:28:18Z

(For obvious reason, this is unmergeable but I assume this was not the purpose of this PR?)

Good job finding out how to use the new FileSlice API without any guidance.
We (Quickwit) introduced the API precisely to be able to fetch information from a distant directory.

I am curious on how you bridged the gap between the FS syscall to HTTP get requests in your WASM demo?

lidel · 2021-06-02T15:47:23Z

This is an exciting PoC – demonstrates a potential way of solving ipfs/distributed-wikipedia-mirror#76 ❤️

I can provide an answer for:

It's not clear that IPFS supports byte range requests [..]
I also don't know if IPFS supports fetching only parts of files from the network [..]

IPFS supports range requests, either as HTTP range requests to a gateway, or by passing offset / length to ipfs cat command. Data stored on IPFS is already chunked and represented as a DAG, and a range request will traverse the graph in a way that fetches the minimal amount of chunks to fulfill the range request.

Due to this, for most use cases, as long IPFS is used, there is no need to do additional chunking at the filesystem level.

main limitation seems to be that most IPFS gateways throw 429 errors after ~20 requests

AFAIK this is not IPFS, but an artificial limitation introduced by a specific gateway instance on Nginx or similar reverse proxy (to mitigate ongoing abuse). Solutions: switch gateway, your own, or use IPFS directly.

phiresky · 2021-06-10T11:36:19Z

For obvious reason, this is unmergeable but I assume this was not the purpose of this PR?

Actually, it would be great to get as much as possible of this merged :) I put this draft PR up to find out if you're interested and how best to proceed, I could split it up into separate PRs for example. None of the changes in this PR are wasm-specific and they could be useful for other use cases as well.

For example, the usize -> Ulen=u64 change is the most number of lines changed and could fairly easily be integrated without affecting any other use cases of tantivy, while allowing use of large indexes on 32-bit systems.

Good job finding out how to use the new FileSlice API without any guidance

The FileSlice API is good, but not that useful with how it is currently handled in the main branch I think - in many cases it is converted to OwnedBytes very soon when it could be converted later on:

the field norm file is immediately converted to OwnedBytes / loaded fully loaded into memory. in my case, this would need fetching 60MB via http, so my PR makes it read only the needed parts instead
the FST is loaded fully into memory. This is a multiple-gigabyte file in my case, so I changed it to use FileSlice until the actual bytes are read. This way I only have to fetch <100kB instead of 10GB.
the term info file is also loaded as a single OwnedBytes although the reads in the term info file are always for an exact and small byte range within the file.

Note that these changes aren't just useful for wasm, but also when memory mapping is not available in general as well as other use cases. For example, Element (Matrix) uses tantivy to index messages, but the search index is stored on disk encrypted. Their current implementation thus decrypts and loads the whole index into memory since you can't dynamically decrypt chunks with memory mapping. With these changes that could be changed to only decrypt the needed parts on demand when a query is run.

@fulmicoton I am curious on how you bridged the gap between the FS syscall to HTTP get requests in your WASM demo?

I'm not sure what exactly you mean - but that part is simple, I just compile tantivy to wasm, and implement the Directory and FileHandle trait with functions that hook into Typescript XMLHttpRequests. The hard part was making sure the read_bytes() method of FileSlice is only called when actually needed, not just on the whole files.

Edit: I see there's already some comments by @fulmicoton about reducing reliance to memmaps here

fulmicoton · 2021-06-10T12:47:09Z

Apart from fieldnorms, we solved all of the problems you mentioned above in a more efficient way here.
https://github.com/quickwit-inc/tantivy/tree/quickwit/src/termdict/sstable_termdict
FST are by nature (at least not with this layout) not suited for this.

That work will be cleaned up and added to tantivy soonish.

fulmicoton · 2021-06-10T12:50:23Z

I'm not sure what exactly you mean - but that part is simple, I just compile tantivy to wasm, and implement the Directory and FileHandle trait with functions that hook into Typescript XMLHttpRequests. The hard part was making sure the read_bytes() method of FileSlice is only called when actually needed, not just on the whole files.

Did you find a way to parallelize the requests?

phiresky · 2021-06-10T12:57:27Z

Did you find a way to parallelize the requests?

Most of the requests are sequential and synchronous, I didn't change anything there, except that it optimistically prefetches more data than needed using heuristics. I only parallelized the fetching of the actual document contents, since that was a pretty easy change - now it's a single HTTP request with multiple file ranges to get all the matched documents instead of one read for each document. I added the FileSlice method fn read_bytes_slice_multiple(&self, ranges: &[Range<Ulen>]) for that. The same could be done to fetch the terms and maybe more, but I'm not sure how that could transfer to the memory mapped implementation without using threads or async.

Apart from fieldnorms, we solved all of the problems you mentioned above in a more efficient way here.

That sounds great! Is there documentation about that somewhere?

fulmicoton · 2021-06-10T13:23:41Z

@phiresky no, but you can have a look at the dictionary we use here..
tantivy = { git= "https://github.com/quickwit-inc/tantivy", rev="6d3e9087c"}

The idea is rather simple. Fst suck by essence for that use case, because the locality of the piece of data you need to read to lookup one term is very bad.

Since we know precisely the characteristics of the storage we are dealing with, we just divide our dictionary into a tree of blocks.
The current implem uses sstable into the block because we want a faster iteration over the terms but it could be fsts.

ppodolsky · 2022-09-27T07:58:28Z

For further researchers, I've embedded Tantivy in WASM and IPFS too together with its latest optimizations like CachingDirectory, HotCache etc.

WASM module: https://github.com/izihawa/summa/tree/master/summa-wasm

Web interface: https://github.com/izihawa/summa/tree/master/summa-web

Blog post: https://habr.com/ru/post/690252/

fulmicoton · 2022-09-27T08:18:49Z

@ppodolsky

First of all that's super cool! Google translate did wonders on translating russian. You should take the time to translate it and republish in English, it should interest a lot of people.

It's also impressive that you understood all of our little quickwit tricks to make this possible :).

One point where I am unhappy however:
You copy-pasted our code in your repository. You were kind enough to retain the license header and be transparent about the paternity. The code is under AGPL license which as a copyleft clause. Your repo cannot be under MIT if you use this code.

ppodolsky · 2022-09-27T08:29:54Z

Thank you for the note. I'm not happy with this vendoring too. The only reason of doing it is that quickwit-directory package brings some dependencies that are not compilable in WASM. And you know, while you are experimenting it is hard to wait for accepting patches in upstream, sry.

If you are OK with it, I can refactor it and put some parts of the quickwit-directory under feature flag to make it usable in WASM.

fulmicoton · 2022-09-27T08:35:55Z

@ppodolsky Yes I am not a lawyer but I think putting that code behind a feature flag and somehow clarifying the situation should be ok.

I am not sure what you are doing with summa, but if you want to use & take part in the dev of quickwit instead, I'm happy to discuss how we can offer you a more lenient license.

mre · 2022-11-04T13:21:43Z

What would be the best way to move forward with this?

From what I can see

Make a decision on the usize to Ulen=u64 change. I think this could be moved out and merged as a preliminary step. If needed we can add the mentioned feature-lag for the Ulen change to guarantee backwards compatibility.
Decide if we want to keep the FakeArr trait and in general how the I/O handling is done. With conditional compilation we could build the I/O handling part for wasm targets only at no cost to the existing implementation.

phiresky · 2022-11-04T13:42:44Z

@mre

Since I wrote this PR tantivy has had lots of code changes. At least some of them (the removal of FST) should make WASM support easier. So it probably makes more sense to either start from scratch or look at one of the other approaches above (summa) than to base new work on the code in this PR.

The Usize->u64 change is only needed if you want to load databases > 4GB on 32bit. It could be done in a completely standalone PR but idk if quickwit cares about it.

ppodolsky · 2022-11-04T14:03:30Z

@phiresky
I'm finishing guide and code refactorings for Summa. I hope everything will be production ready next week. Feel free to contact me if you have any questions. I saw your activity in making Wiki (or at least in SQLite in browser, excuse me if confused) on IPFS and we may have something in common here.

ppodolsky · 2022-11-14T18:37:18Z

I've posted documentation on Summa related to its WASM-part.

Sources and docs may contain some valuable hints for those who wanted to launch search index inside the browser and to integrate it with IPFS or any other system that provides Range request to files over HTTP.

https://izihawa.github.io/summa/ipfs-wasm-guide

alzinging · 2022-12-15T07:00:28Z

Is it possible to get this demo and code back?

ppodolsky · 2022-12-15T07:57:05Z

@alzinging

Is it possible to get this demo and code back?

Both code and demo live in repo, together with documentation and guides

phiresky added 8 commits May 18, 2021 15:31

dynamic read implementation

5f08b1a

works!

866a11c

better

0671d24

minor fix.

397689a

Ulen part 1

76be25b

WORKS!

6a0cf7e

some info logs

cecf635

comment

57b9c59

This was referenced May 29, 2021

Make fst work without memory mapping with an arbitrary "fake array" quickwit-inc/fst#15

Closed

wasm RFC #541

Open

ngbrown mentioned this pull request May 29, 2021

How to search for articles? ipfs/distributed-wikipedia-mirror#76

Open

api to get multiple chunks at the same time

6bd8a8d

fmassot mentioned this pull request Jul 19, 2021

IPFS storage and search quickwit-oss/quickwit#308

Open

phiresky mentioned this pull request Jul 30, 2021

Support the Web Browser. matrix-org/seshat#84

Open

fulmicoton force-pushed the main branch 4 times, most recently from 0c60a44 to 9f32b22 Compare August 26, 2021 00:07

This was referenced Sep 16, 2021

How do you speed up a "ORDER BY rank" FTS5 query? phiresky/sql.js-httpvfs#10

Open

Thoughts on using SQLite for the WACZ format webrecorder/specs#62

Open

fmassot mentioned this pull request Nov 1, 2021

Possible to compile to webassembly and run in the browser? #1193

Closed

ang-zeyu mentioned this pull request Apr 21, 2022

Marketing ideas ang-zeyu/infisearch#1

Open

ellenhp mentioned this pull request Jun 7, 2022

Abstracting away underlying storage BurntSushi/fst#138

Open

ChillFish8 mentioned this pull request Jul 11, 2022

Client side web search? #1413

Open

fulmicoton force-pushed the main branch 2 times, most recently from e550a98 to 84e0c75 Compare September 2, 2022 03:01

DawChihLiou mentioned this pull request Jan 2, 2023

Feat: WASM compatibility in Tantivy tantaraio/tantara#18

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make tantivy work in the browser with a statically hosted database and on-demand fetching #1067

Make tantivy work in the browser with a statically hosted database and on-demand fetching #1067

phiresky commented May 29, 2021 •

edited

Loading

phiresky commented May 29, 2021 •

edited

Loading

ngbrown commented May 29, 2021 •

edited

Loading

phiresky commented May 29, 2021

fulmicoton commented May 31, 2021

lidel commented Jun 2, 2021 •

edited

Loading

phiresky commented Jun 10, 2021 •

edited

Loading

fulmicoton commented Jun 10, 2021

fulmicoton commented Jun 10, 2021

phiresky commented Jun 10, 2021

fulmicoton commented Jun 10, 2021

ppodolsky commented Sep 27, 2022

fulmicoton commented Sep 27, 2022

ppodolsky commented Sep 27, 2022

fulmicoton commented Sep 27, 2022

mre commented Nov 4, 2022

phiresky commented Nov 4, 2022

ppodolsky commented Nov 4, 2022 •

edited

Loading

ppodolsky commented Nov 14, 2022

alzinging commented Dec 15, 2022

ppodolsky commented Dec 15, 2022

Make tantivy work in the browser with a statically hosted database and on-demand fetching #1067

Are you sure you want to change the base?

Make tantivy work in the browser with a statically hosted database and on-demand fetching #1067

Conversation

phiresky commented May 29, 2021 • edited Loading

phiresky commented May 29, 2021 • edited Loading

ngbrown commented May 29, 2021 • edited Loading

phiresky commented May 29, 2021

fulmicoton commented May 31, 2021

lidel commented Jun 2, 2021 • edited Loading

phiresky commented Jun 10, 2021 • edited Loading

fulmicoton commented Jun 10, 2021

fulmicoton commented Jun 10, 2021

phiresky commented Jun 10, 2021

fulmicoton commented Jun 10, 2021

ppodolsky commented Sep 27, 2022

fulmicoton commented Sep 27, 2022

ppodolsky commented Sep 27, 2022

fulmicoton commented Sep 27, 2022

mre commented Nov 4, 2022

phiresky commented Nov 4, 2022

ppodolsky commented Nov 4, 2022 • edited Loading

ppodolsky commented Nov 14, 2022

alzinging commented Dec 15, 2022

ppodolsky commented Dec 15, 2022

phiresky commented May 29, 2021 •

edited

Loading

phiresky commented May 29, 2021 •

edited

Loading

ngbrown commented May 29, 2021 •

edited

Loading

lidel commented Jun 2, 2021 •

edited

Loading

phiresky commented Jun 10, 2021 •

edited

Loading

ppodolsky commented Nov 4, 2022 •

edited

Loading