Database sources where each array element is a separate database row #438

ryan-williams · 2019-05-08T23:34:37Z

My impression is that the existing DB backends for Zarr use small DB tables of chunks: each row has a string key (that would otherwise be a filesystem path) and a binary blob (that would otherwise be the compressed chunk file at that path).

I wanted to flag a need I keep seeing in single-cell, and have discussed with various folks (incl. @alasla, @mckinsel, @tomwhite, @laserson), which is to put lots of gene-expression matrices in a database (instead of storing each one in an HDF5 file, CSV, or Zarr directory), where each entry in these 2-D sparse matrices is stored as a database row (likely a (cell ID, gene ID, count) triple).

Generalizing, an N-D Zarr dataset can have each entry mapped to a DB row with N integer "key" columns and one "value" column (storing elements of the given Zarr dtype).

You can straightforwardly support existing Zarr access-patterns by indexing such a table on the "key" columns, and letting Zarr page full chunks into memory to operate on, as usual. Fetching a chunk from such a DB table is a simple DB query against an index (with appropriate chunk-size-multiple bounds on each dimension-column), and downstream code need not care that it is being fed a chunk that is entirely virtual.

This model can also trivially simulate concatenation, splitting, and re-chunking Zarr trees, potentially obviating a host of related problems (e.g. #297, #323, #392), and generally leads to a lot of questions about when you should ever store things in a filesystem instead of a database (possibly never 😝), how core filesystem-assumptions are to the essence of Zarr (not very, IMO, though we haven't really hashed this out), etc.

In any case, I am eager to make a Zarr backend for "entry"-level DBs like this, and will post any progress here. Any thoughts are welcome!

ryan-williams · 2019-05-29T13:55:22Z

I dug into this a bit. The quick route I was hoping for is not possible.

The current interface between the core chunked-array implementation and the storage layer is nice and clean, too much so for what I want to do 😀.

In an example like:

store = zarr.DBElementStore(…)
z = zarr.zeros((10, 10), chunks=(5, 5), store=store, overwrite=True)
z[3, 3] = 42

Ideally I'd like the last line to just perform an update on a single row in the underlying element-wise DB. However, all reads and writes today are at the granularity of whole chunks, and the storage layer only receives and sends those chunks inside a compression context that it is unaware of.

My storage layer could, for each chunk, look up the corresponding .zarray, find the compression codec, decode the chunk, parse the records from the resulting numpy array, and send a batch update query to the DB covering all elements in the chunk. It's not ideal but might be the right way to at least prototype this.

Otherwise, some more radical change to the core↔️storage paths would be required. Compression and dtype-endianness are conceptually irrelevant for this storage backend, and we'd need core to be aware of that.

This seems relevant to zarr-developers/zarr-specs#30.

jakirkham · 2019-05-29T14:13:58Z

Have you played with structured arrays at all? Zarr also supports these and this sounds like a potentially good match for what you are describing, but maybe I'm missing something.

alimanfoo · 2019-05-29T14:43:08Z

Hi Ryan, I think what you are describing is more like a persistence mechanism for sparse arrays, where there is no need for any concept of chunking or compression (or dtype endianness).

Possibly related: #424 which links to https://github.com/daletovar/zsparse

alimanfoo · 2019-05-29T14:57:00Z

Just to add, I think you're describing using a database to store a sparse matrix in COO (coordinate) format. Cf. scipy's coo_matrix.

ryan-williams · 2019-05-29T15:18:23Z

Thanks all.

@jakirkham I don't think Scipy structured arrays get at my goal here, but it's possible I'm missing something.

@alimanfoo zsparse seems like a wrapper on top of zarr for storing one logical sparse array as three underlying 1-D Zarr arrays. That's definitely a useful abstraction I've wanted in other contexts, so I'm glad to learn of it.

What I want here diverges a bit deeper in the stack. Imagine: I have an existing database table with records that conceptually form a 2-D array (e.g. (idx1, idx2, value) triples), and want to query that table as if it was a 2-D Zarr array.

The concept of chunks can still be meaningful in this world, but they would be virtual. The storage layer just gives a uniform API for accessing all the elements in the array, but each call-site could nevertheless interact with that formless layer in terms of a chunk-shape (reflecting e.g. a desire about how to parallelize over the full array).

Compressed-chunk byte-blobs are not meaningful in this context, but the current Zarr storage interface is implemented entirely in terms of them, so that's what I'm wrestling with.

It's interesting to think about how Zarr can encompass arrays where no physical manifestation of [compressed, whole-chunk blobs] exists in the underlying storage medium. I hoped I could just splice into the storage layer and support this, but now I'm understanding that more changes would be required.

alimanfoo · 2019-05-29T18:54:45Z

FWIW I think what you're after is essentially a scipy sparse COO matrix but using database columns to store the rows, columns and values, rather than numpy arrays. In this case I think the zarr abstractions of chunks and the key/value storage interface are just getting in the way. You might as well try to implement the numpy array API (at least __getitem__ with slices) directly with your own custom logic for how to retrieve array values for a given region. I.e., I'm not sure there's anything in zarr that helps you get a leg up here. (Could be wrong though :-)

stavrospapadopoulos mentioned this issue Dec 10, 2019

Question/possible enhancement: Relationship to N5 + Arrow/Parquet? #515

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database sources where each array element is a separate database row #438

Database sources where each array element is a separate database row #438

ryan-williams commented May 8, 2019

ryan-williams commented May 29, 2019

jakirkham commented May 29, 2019

alimanfoo commented May 29, 2019

alimanfoo commented May 29, 2019

ryan-williams commented May 29, 2019

alimanfoo commented May 29, 2019

Database sources where each array element is a separate database row #438

Database sources where each array element is a separate database row #438

Comments

ryan-williams commented May 8, 2019

ryan-williams commented May 29, 2019

jakirkham commented May 29, 2019

alimanfoo commented May 29, 2019

alimanfoo commented May 29, 2019

ryan-williams commented May 29, 2019

alimanfoo commented May 29, 2019