Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could this use the SQLite range header trick? #28

Open
simonw opened this issue May 4, 2022 · 8 comments
Open

Could this use the SQLite range header trick? #28

simonw opened this issue May 4, 2022 · 8 comments
Labels
help wanted Extra attention is needed research

Comments

@simonw
Copy link
Owner

simonw commented May 4, 2022

https://twitter.com/BardakciBaba/status/1521927918603145216

Impressive! Does it download the whole database or only the bits sqlite actually needs - possibly via http range requests a la https://phiresky.github.io/blog/2021/hosting-sqlite-databases-on-github-pages/

Right now Datasette Lite downloads the whole SQLite database. But could it work with the range trick?

This would be difficult to pull off, because it would involve compiling a custom build of Python's sqlite3 module. Could probably use https://github.com/coleifer/pysqlite3 as a starting point for that.

I'm going to leave this open as a "research" and "help wanted" issue, in case anyone else is inspired to try and figure it out!

@simonw simonw added help wanted Extra attention is needed research labels May 4, 2022
@SuperSonicHub1
Copy link
Contributor

SuperSonicHub1 commented May 5, 2022

Pretty sure this would consist of implementing https://github.com/phiresky/sql.js-httpvfs in Python using pyfetch. It seems that APSW, another Python SQlite lib, supports writing virtual file systems out of the box in Python. May be a bit of a bother to move all of Datasette's code to this driver, but I know no one is going to touch C with a ten-foot pole.

@simonw
Copy link
Owner Author

simonw commented May 5, 2022

I wonder if I could instead implement this by teaching Datasette to use sql.js-httpvfs as the database backend?

I've been wanting to add a plugin hook for custom DB backends for a while. This would be a deeply fascinating use case for such a hook.

@canh
Copy link

canh commented May 5, 2022

I haven't really verified this but I assume that Pyodide is just Python compiled to WASM under the hood. Thus, we should be able to leverage the virtual file system interface of emscripten. We just have to mount a '/data/database.db' that will be visible to the Python executable running inside, and emscripten runtime will call our custom XMLHttpRequest code whenever something makes a read() system call on that file. I believe this is also how sql.js-httpvfs is implemented.

https://emscripten.org/docs/porting/files/file_systems_overview.html

If I have time I'll be happy to try this over the weekend.

@vitorio
Copy link

vitorio commented Feb 1, 2024

Here's what it looks like to me. Because Lite relies entirely on the environment provided by Pyodide (instead of translating Datasette to JS or WASM), that's where the patching needs to happen, either:

  • with a shim that replaces the Pyodide sqlite3 module with a WASM version that already supports databases over HTTP,
  • by integrating a native HTTP VFS (because Pyodide treats everything as "native" and just the operating environment is "emscripten") that is patched to use browser-based networking, or
  • by writing a new Emscripten filesystem that supports range requests and mount the database URL there, transparently to SQLite, as @canh suggested.

Providing a shim for Pyodide to replace an existing library is something they already do, e.g. for webbrowser. Their js module has functions to register a JS module with Python, so it may be possible to use e.g. sqlite-wasm-http as-is and it's just a small shim, or maybe it can be called directly with changes to Datasette.

A native HTTP VFS, like sqlite_web_vfs, is going to use a native HTTP library to make its network calls, and that won't work in the Emscripten environment (this particular one uses libcurl). That means when it's compiled under Emscripten, those calls need to be replaced with ones to supported browser-based functions. Here's an example of someone doing it for their own libcurl-based code: emscripten-core/emscripten#3270 (comment). Datasette would still need to know to load that module if it's available, but then I guess that means server-side Datasette would also be able to use remote HTTP resources without downloading them?

A new filesystem really just means adding range request support to an existing Emscripten remote filesystem. There's the original BrowserFS, now unmaintained, and the original issue discussing it. There's its maintained fork, browser-fs, but it doesn't look like it's brought forward a remote HTTP FS module yet, despite the open issue about range requests. Pulling lazyFile out of sql.js-httpvfs might also be reasonable, with the caveat that it's probably been used less in production settings than BrowserFS has.

Adding range request support to BrowserFS is probably going to be the most transparent option, since neither Pyodide nor Datasette need to be touched; it also is a pure JavaScript solution.

Writing a shim for Pyodide to replace native sqlite3 is going to be half-and-half Python and JS, but could be transparent to Datasette.

Making a native HTTP VFS work under Emscripten is probably pretty straightforward for C programmers who understand Emscripten and JS. The Datasette integration is independent, since it's a native VFS module, so it can be written and tested using regular Datasette. Lite will just magically get the remote range ability once the C/Emscripten work is done.

@nelsonjchen
Copy link

Hmm, just wanted to drop this POC I did a long time ago that had some interesting conclusions/observations for preparation/hosting datasette-lite on Cloudflare R2 of a 30GB database:

#49

@vitorio
Copy link

vitorio commented Feb 1, 2024

@nelsonjchen the DB "manifest" you feel like could have been useful sounds a bit like the "helper DBI files" that sqlite_web_vfs can generate and use.

@nelsonjchen
Copy link

That DBI thing sounds very very cool and probably could have made my experiment a 2 file upload to CF R2.

@dr-vortex
Copy link

dr-vortex commented Mar 20, 2024

Hi @vitorio, I'm the current BrowserFS maintainer.

To clarify, browser-fs is not a fork of jvilk/BrowserFS, it is the original project. The change to using an organization was done to keep the repositories organized and so that future maintainers don't have to deal with access issues (like I've been experiencing with the project) since the organization, not an individual, owns the repositories and NPM packages.

With that out of the way, I'd like to point to browser-fs/fetch (npm: @browserfs/fetch). While it does not currently support range requests, I can look into implementing that ahead of some other features.

I think the best way to implement it would be to add a configuration option to the Fetch backend. The actual request can be handled at file read time since the offset and length of the read can determine the range to request. It could be used like so:

import { configure, fs } from '@browserfs/core';
import { Fetch } from '@browserfs/fetch'; 

await configure({
	'/mnt/remote': {
		backend: Fetch,
		// A potential (optional) option that specifies whether to use range requests
		useRanges: true,
	}
});

const veryLargeFile = await fs.promises.open('/mnt/remote/large-file');

// 20kb buffer to store the contents in
let partOfFile = new Uint8Array(20_000);

// read the first 20kb from the file, starting at 10kb, into partOfFile at offset 0
await veryLargeFile.read(partOfFile, 0, 20_000, 10_000);

What would you like to see with range request support?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed research
Projects
None yet
Development

No branches or pull requests

6 participants