Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert numpy matrix stored in Zarr directorystore to a CSR matrix #723

Open
sanjaysrikakulam opened this issue Apr 26, 2021 · 5 comments
Open

Comments

@sanjaysrikakulam
Copy link

Hi,

I have a huge NumPy 2D matrix stored in a Zarr directorystore (chunkwise), this is a boolean matrix of shape for example 600M rows, 100K columns. My goal is to access random rows, and since this is stored chunkwise, it will not be ideal to directly access random row via Zarr, so I am trying to index it in MySQL or some other database for fast querying. Since this matrix will be sparse, I would like to know if there is any way to convert the on-disk matrix into a CSR for indexing in a database where I can do fast queries.

Any suggestion or pointers would be really helpful.

Thank you.

@sanjaysrikakulam sanjaysrikakulam changed the title Clarification-Convert numpy matrix stored in Zarr directorystore to a CSR matrix Convert numpy matrix stored in Zarr directorystore to a CSR matrix Apr 26, 2021
@Mubarraqqq
Copy link

Hi @joshmoore and @MSanKeys963, I want to ask if this issue is still relevant. If yes, I will like to work on it immediately

@joshmoore
Copy link
Member

There's still definitely a lot of interest in this (cc: @ivirshup @martindurant et al), @Mubarraqqq, but it's a pretty sizable project.

@Mubarraqqq
Copy link

No problem. I'm still willing to work on the issue regardless.
I would update you if I have any suggestions regarding this and also make contact with @ivirshup and @martindurant along the way for any help too.

@martindurant
Copy link
Member

Perhaps this one is the other way around from what we've been talking about. It is of interest to use zarr as a storage for the various sparse layouts; but here you have a dense zarr and want to insert sparse rows into a DB. I don't see why you couldn't do this block-wise with dask or even serially.

@ivirshup
Copy link

ivirshup commented Nov 2, 2022

Here's a slightly simplified version of the code for doing this in anndata:

Example implementation
from scipy import sparse
import zarr


def idx_chunks_along_axis(shape: tuple, axis: int, chunk_size: int):
    """\
    Gives indexer tuples chunked along an axis.
    Params
    ------
    shape
        Shape of array to be chunked
    axis
        Axis to chunk along
    chunk_size
        Size of chunk along axis
    Returns
    -------
    An iterator of tuples for indexing into an array of passed shape.
    """
    total = shape[axis]
    cur = 0
    mutable_idx = [slice(None) for i in range(len(shape))]
    while cur + chunk_size < total:
        mutable_idx[axis] = slice(cur, cur + chunk_size)
        yield tuple(mutable_idx)
        cur += chunk_size
    mutable_idx[axis] = slice(cur, None)
    yield tuple(mutable_idx)


def read_dense_as_csr(array: zarr.Array) -> sparse.csr_matrix:
    axis_chunk = array.chunks[0]
    sub_matrices = []
    for idx in idx_chunks_along_axis(array.shape, 0, axis_chunk):
        dense_chunk = array[idx]
        sub_matrix = sparse.csr_matrix(dense_chunk)
        sub_matrices.append(sub_matrix)
    return sparse.vstack(sub_matrices, format="csr")

Could probably be improved with parallelization/ tailoring to your memory needs. Original code here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants