Convert numpy matrix stored in Zarr directorystore to a CSR matrix #723

sanjaysrikakulam · 2021-04-26T09:53:18Z

Hi,

I have a huge NumPy 2D matrix stored in a Zarr directorystore (chunkwise), this is a boolean matrix of shape for example 600M rows, 100K columns. My goal is to access random rows, and since this is stored chunkwise, it will not be ideal to directly access random row via Zarr, so I am trying to index it in MySQL or some other database for fast querying. Since this matrix will be sparse, I would like to know if there is any way to convert the on-disk matrix into a CSR for indexing in a database where I can do fast queries.

Any suggestion or pointers would be really helpful.

Thank you.

Mubarraqqq · 2022-10-31T03:52:06Z

Hi @joshmoore and @MSanKeys963, I want to ask if this issue is still relevant. If yes, I will like to work on it immediately

joshmoore · 2022-10-31T18:56:17Z

There's still definitely a lot of interest in this (cc: @ivirshup @martindurant et al), @Mubarraqqq, but it's a pretty sizable project.

Mubarraqqq · 2022-11-01T00:05:34Z

No problem. I'm still willing to work on the issue regardless.
I would update you if I have any suggestions regarding this and also make contact with @ivirshup and @martindurant along the way for any help too.

martindurant · 2022-11-02T18:02:43Z

Perhaps this one is the other way around from what we've been talking about. It is of interest to use zarr as a storage for the various sparse layouts; but here you have a dense zarr and want to insert sparse rows into a DB. I don't see why you couldn't do this block-wise with dask or even serially.

ivirshup · 2022-11-02T20:18:19Z

Here's a slightly simplified version of the code for doing this in anndata:

Example implementation

from scipy import sparse
import zarr


def idx_chunks_along_axis(shape: tuple, axis: int, chunk_size: int):
    """\
    Gives indexer tuples chunked along an axis.
    Params
    ------
    shape
        Shape of array to be chunked
    axis
        Axis to chunk along
    chunk_size
        Size of chunk along axis
    Returns
    -------
    An iterator of tuples for indexing into an array of passed shape.
    """
    total = shape[axis]
    cur = 0
    mutable_idx = [slice(None) for i in range(len(shape))]
    while cur + chunk_size < total:
        mutable_idx[axis] = slice(cur, cur + chunk_size)
        yield tuple(mutable_idx)
        cur += chunk_size
    mutable_idx[axis] = slice(cur, None)
    yield tuple(mutable_idx)


def read_dense_as_csr(array: zarr.Array) -> sparse.csr_matrix:
    axis_chunk = array.chunks[0]
    sub_matrices = []
    for idx in idx_chunks_along_axis(array.shape, 0, axis_chunk):
        dense_chunk = array[idx]
        sub_matrix = sparse.csr_matrix(dense_chunk)
        sub_matrices.append(sub_matrix)
    return sparse.vstack(sub_matrices, format="csr")

Could probably be improved with parallelization/ tailoring to your memory needs. Original code here

sanjaysrikakulam changed the title ~~Clarification-Convert numpy matrix stored in Zarr directorystore to a CSR matrix~~ Convert numpy matrix stored in Zarr directorystore to a CSR matrix Apr 26, 2021

joshmoore mentioned this issue Sep 23, 2021

Outreachy project proposals (Oct. 2021) zarr-developers/community#39

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert numpy matrix stored in Zarr directorystore to a CSR matrix #723

Convert numpy matrix stored in Zarr directorystore to a CSR matrix #723

sanjaysrikakulam commented Apr 26, 2021

Mubarraqqq commented Oct 31, 2022

joshmoore commented Oct 31, 2022

Mubarraqqq commented Nov 1, 2022

martindurant commented Nov 2, 2022

ivirshup commented Nov 2, 2022

Convert numpy matrix stored in Zarr directorystore to a CSR matrix #723

Convert numpy matrix stored in Zarr directorystore to a CSR matrix #723

Comments

sanjaysrikakulam commented Apr 26, 2021

Mubarraqqq commented Oct 31, 2022

joshmoore commented Oct 31, 2022

Mubarraqqq commented Nov 1, 2022

martindurant commented Nov 2, 2022

ivirshup commented Nov 2, 2022