-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC/Research Prototype: Use createLazyFile to mount a read only view of remote sqlite databases #49
Conversation
I also built a fork of some static web file server to also expose the necessary headers too: |
Wow, I did not think this would be possible without completely replacing the Python sqlite3 module! |
The experimental tuned extracted thing explodes on the example databases. Not sure why. Well, that's why this is a draft after all :D. |
I bet that's because Datasette tries to show a count of all of the rows in each table when it shows the list on that page, which triggers a full table scan. Would be great to have a setting that turns that feature off, which could then be exposed as a query string option for Datasette Lite. |
https://github.com/phiresky/sql.js-httpvfs/blob/master/src/lazyFile.ts I think it should be possible to adapt this theoretically more efficient dynamic chunk size version to this approach as well. |
Oh, I should probably explain what I'm trying to query or do. I like looking at CA unclaimed property records with SQLite since it's a much more powerful and faster full text search than what's offered on the government site: https://www.sco.ca.gov/upd_download_property_records.html It's about 35GB when imported into a FTS5 table, optimized, and vacuumed. I was thinking of trying to expose a similar service with something like https://github.com/wilsonzlin/edgesearch but it was a little too foreign, a little way too tied to one vendor's offering and architecture, and not SQLite. |
The performance kind of leaves a lot to be desired still. 50 seconds for that query. Still, it's pretty neat! |
Oh wow I didn't realize you were already aware of this stuff: https://twitter.com/simonw/status/1421497663732604928 I think we could just use "sql.js-httpvfs" implementation's of createLazyFile as-is and we should be good to go actually on using it with datasette-lite. |
This now takes about 30 seconds down from 50 seconds using https://github.com/segfall/static-wiki/tree/master/scripts for database optimization and ideas/thoughts and https://github.com/phiresky/sql.js-httpvfs/blob/master/src/lazyFile.ts for the optimized adjacent pages read lazyfile implementation. I've also projectified the toolkit/generator with a README and so on: https://github.com/nelsonjchen/ca_unclaimed_property_db_generator_toolkit |
So this is all still research and a POC PR that is not meant to be merged in the end but I have some ideas to keep following up on when time permits that might be interesting for other's future proper PRs/RFCs:
|
Threw on a page rule in CF to see what happens if the sqlite database is forced to be cached. nelsonjchen/ca_unclaimed_property_db_generator_toolkit#1 (comment) Unfortunately, I ran into some odd bug where a range request tried to return a 200. All of it, 28GB. At least it tried to before I killed it. Removed the rule. The post is both agonizing and tantalizing since the OP posted a timeline. If Cloudflare gets their stuff together, we could speed up repeated queries and/or cache few levels of popular indirection by as much as ~10x since the hits become 30ms vs 300ms for the current status quo of always missing. It's also agonizing since CF claims the issue was fixed. Thinking of making a periodic and continuous GitHub repo/actions setup with Playwright to test and validate the bug and running it up to CF engineering. |
https://kevincox.ca/2021/06/04/http-range-caching/ Found this great post on caching behavior of ranges in common CDNs. |
Made a chunked version as POC test. It is indeed faster! Hack, with hardcoded overall file size: https://github.com/nelsonjchen/datasette-lite/tree/chunk-hack Something like this definitely needs to have a manifest-like thing though. Upfront, a file size is needed. And I've already expressed my desire for metadata. 4096KB pages, 10MB chunks, ~30ms hits, ~300-500ms misses. Not bad! Cloudflare does seem to expire the cache rather a lot though. At least a page refresh is fast. Though, it seems caching sometimes isn't free and it leans towards 600ms on a miss. On that note. I think I'm going to wind down the experimentation a bit. Hopefully someone else can use these learnings. And I have looked you up @simonw . There is something, not a lot and not pocket change, but there is something. Hopefully the CA state controller won't give you too much trouble. You the real MVP! |
Note: phiresky/sql.js-httpvfs#40 There's an upstream bug in the lazyFile implementation which I've fixed in my test hack of a lab for chunked cacheable database. |
So, with the split database and all that, the cost per month to host and expose this 28GB database with Datasette: https://developers.cloudflare.com/r2/platform/pricing/ $0.42 The cost can increase if there are a lot of queries but this is extremely negligible. |
This is more POC/WIP/curiosity than anything serious. Just wanted to make a draft PR for posterity.
edit The state of the art is this:
https://datasette-lite-lab.mindflakes.com/index.html?url=https://datasette-lite-lab.mindflakes.com/sdb/2022-10-02_93eff57de3573985_ca_unclaimed_property.sqlite#/2022-10-02_93eff57de3573985_ca_unclaimed_property?sql=SELECT+*+FROM+records+WHERE+records.owner_name+MATCH+%22Elon+Musk%22+ORDER+BY+CAST%28CURRENT_CASH_BALANCE+AS+FLOAT%29++DESC%3B
which is the chunked/CDN'd version described further below.
For context, I'm trying to put a 32GB FTS5 sqlite database on the internet to query. I plan to host it on Cloudflare so I do not care about BW costs.
The functionality here is a bit like https://github.com/phiresky/sql.js-httpvfs, but far lazier and inefficient. I originally thought I would have to make my own GUI, but I kind of like the GUI I saw in datasette and wondered if I could reuse it. Unfortunately, current
datasette-lite
seems to pull everything into memory.It does work actually, but I don't know how viable this is. Emscripten is by default reads 1MB chunks but it seems to have some way to recompile it to not:
https://github.com/emscripten-core/emscripten/blob/18bc868cb5242e6816a4b3bde74b1e1dcd6fd818/src/library_fs.js#L1713
azavea/loam#75 (comment)
Some FTS5 queries take 1GB to query while others may take only 66MB. As I've said, I don't care about BW but the great inefficiency harms UX.
Also, this chunked inefficiency means that I have to hack the URL to not load tables of a database as it seems to try to load the whole database when I click on a database.
I think for my goal, I might have to try to recompile pyodide with a small XHR build of emscripten. 😬
As for any dependencies from the remote URLs, they'll need to have the proper CORS headers set, including "expose" headers. If it can't read expose headers or the remote file is GZIPed, it just downloads the whole file, so it does degrade gracefully. However, it does cause a nasty error message in the console to popup saying something denied access to some headers to Emscripten.
Anyway, this is just for fun.