Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database sources where each array element is a separate database row #438

Open
ryan-williams opened this issue May 8, 2019 · 6 comments

Comments

@ryan-williams
Copy link
Member

My impression is that the existing DB backends for Zarr use small DB tables of chunks: each row has a string key (that would otherwise be a filesystem path) and a binary blob (that would otherwise be the compressed chunk file at that path).

I wanted to flag a need I keep seeing in single-cell, and have discussed with various folks (incl. @alasla, @mckinsel, @tomwhite, @laserson), which is to put lots of gene-expression matrices in a database (instead of storing each one in an HDF5 file, CSV, or Zarr directory), where each entry in these 2-D sparse matrices is stored as a database row (likely a (cell ID, gene ID, count) triple).

Generalizing, an N-D Zarr dataset can have each entry mapped to a DB row with N integer "key" columns and one "value" column (storing elements of the given Zarr dtype).

You can straightforwardly support existing Zarr access-patterns by indexing such a table on the "key" columns, and letting Zarr page full chunks into memory to operate on, as usual. Fetching a chunk from such a DB table is a simple DB query against an index (with appropriate chunk-size-multiple bounds on each dimension-column), and downstream code need not care that it is being fed a chunk that is entirely virtual.

This model can also trivially simulate concatenation, splitting, and re-chunking Zarr trees, potentially obviating a host of related problems (e.g. #297, #323, #392), and generally leads to a lot of questions about when you should ever store things in a filesystem instead of a database (possibly never 😝), how core filesystem-assumptions are to the essence of Zarr (not very, IMO, though we haven't really hashed this out), etc.

In any case, I am eager to make a Zarr backend for "entry"-level DBs like this, and will post any progress here. Any thoughts are welcome!

@ryan-williams
Copy link
Member Author

I dug into this a bit. The quick route I was hoping for is not possible.

The current interface between the core chunked-array implementation and the storage layer is nice and clean, too much so for what I want to do 😀.

In an example like:

store = zarr.DBElementStore(…)
z = zarr.zeros((10, 10), chunks=(5, 5), store=store, overwrite=True)
z[3, 3] = 42

Ideally I'd like the last line to just perform an update on a single row in the underlying element-wise DB. However, all reads and writes today are at the granularity of whole chunks, and the storage layer only receives and sends those chunks inside a compression context that it is unaware of.

My storage layer could, for each chunk, look up the corresponding .zarray, find the compression codec, decode the chunk, parse the records from the resulting numpy array, and send a batch update query to the DB covering all elements in the chunk. It's not ideal but might be the right way to at least prototype this.

Otherwise, some more radical change to the core↔️storage paths would be required. Compression and dtype-endianness are conceptually irrelevant for this storage backend, and we'd need core to be aware of that.

This seems relevant to zarr-developers/zarr-specs#30.

@jakirkham
Copy link
Member

Have you played with structured arrays at all? Zarr also supports these and this sounds like a potentially good match for what you are describing, but maybe I'm missing something.

@alimanfoo
Copy link
Member

Hi Ryan, I think what you are describing is more like a persistence mechanism for sparse arrays, where there is no need for any concept of chunking or compression (or dtype endianness).

Possibly related: #424 which links to https://github.com/daletovar/zsparse

@alimanfoo
Copy link
Member

Just to add, I think you're describing using a database to store a sparse matrix in COO (coordinate) format. Cf. scipy's coo_matrix.

@ryan-williams
Copy link
Member Author

Thanks all.

@jakirkham I don't think Scipy structured arrays get at my goal here, but it's possible I'm missing something.

@alimanfoo zsparse seems like a wrapper on top of zarr for storing one logical sparse array as three underlying 1-D Zarr arrays. That's definitely a useful abstraction I've wanted in other contexts, so I'm glad to learn of it.

What I want here diverges a bit deeper in the stack. Imagine: I have an existing database table with records that conceptually form a 2-D array (e.g. (idx1, idx2, value) triples), and want to query that table as if it was a 2-D Zarr array.

The concept of chunks can still be meaningful in this world, but they would be virtual. The storage layer just gives a uniform API for accessing all the elements in the array, but each call-site could nevertheless interact with that formless layer in terms of a chunk-shape (reflecting e.g. a desire about how to parallelize over the full array).

Compressed-chunk byte-blobs are not meaningful in this context, but the current Zarr storage interface is implemented entirely in terms of them, so that's what I'm wrestling with.

It's interesting to think about how Zarr can encompass arrays where no physical manifestation of [compressed, whole-chunk blobs] exists in the underlying storage medium. I hoped I could just splice into the storage layer and support this, but now I'm understanding that more changes would be required.

@alimanfoo
Copy link
Member

FWIW I think what you're after is essentially a scipy sparse COO matrix but using database columns to store the rows, columns and values, rather than numpy arrays. In this case I think the zarr abstractions of chunks and the key/value storage interface are just getting in the way. You might as well try to implement the numpy array API (at least __getitem__ with slices) directly with your own custom logic for how to retrieve array values for a given region. I.e., I'm not sure there's anything in zarr that helps you get a leg up here. (Could be wrong though :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants