Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

numba njit support #771

Open
aero-108 opened this issue Jun 8, 2021 · 8 comments
Open

numba njit support #771

aero-108 opened this issue Jun 8, 2021 · 8 comments

Comments

@aero-108
Copy link

aero-108 commented Jun 8, 2021

Hello,
is there are any plans to support numba njit?
Can be very useful.

import numba as nb
import zarr
import numpy as np

@nb.njit()
def test(arr):
    for i in range(arr.shape[0]):
        arr[i] = 5.0


zarr_arr = zarr.full((100,), fill_value=np.nan, dtype='float64')

test(zarr_arr[0:100])

 zarr_arr[0:100]
Out[4]: 
array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan])
@jakirkham
Copy link
Member

Not sure I follow. Zarr is a storage format for arrays. One can use those arrays in other pipelines with Numba, Dask, etc.

@aero-108
Copy link
Author

aero-108 commented Jun 8, 2021

@jakirkham
as mentioned in example numba not really friendly with zarr arrays and any modifications wont pass back.

@jakirkham
Copy link
Member

Yeah wouldn't really expect in-place modifications to work. There's a lot of encoding, decompression, chunk aggregation, etc. logic that needs to happen

zarr_arr[0:100] will return a NumPy array. Modifying that in-place won't affect Zarr. One could pass the whole Zarr array into that function, but then will need to explicitly write that out

If one loaded this with Dask, could use map_blocks to apply the function over the array and then write that back out to Zarr

@jakirkham
Copy link
Member

If I were to try with that code, would do something like this...

import numba as nb
import zarr
import numpy as np

@nb.njit()
def test(arr):
    for i in range(arr.shape[0]):
        arr[i] = 5.0


zarr_arr = zarr.full((100,), fill_value=np.nan, dtype='float64')

arr = zarr_arr[0:100]
test(arr)
zarr_arr[0:100] = arr

(should add I haven't run this. so please check as well)

@ivirshup
Copy link

To pick up this thread a bit – I think numba support would be really great.

I think the biggest use case here working with very large amounts of data as quickly as possible. Ideally I could iterate through chunks of my zarr array and compute something from them without having to swap back and forth between numba and python. This process is a bit of a pain, and has a fair bit of overhead.

I can see how this would be difficult. Not sure how much of the stack (e.g. possibly all of numcodecs?) would have to be compiled to make this work.

@rabernat
Copy link
Contributor

rabernat commented Dec 16, 2021

I think the biggest use case here working with very large amounts of data as quickly as possible.

This is generally what everyone wants to do. But I'm not sure that this proposed integration is needed for it. Zarr and Numba solve orthogonal problems. Zarr accelerates data I/O, which can speed up I/O bound problems. Zarr will help get data from files or object storage into memory quickly. Numba accelerates computation, which can speed up compute bound problems. Numba operates on in-memory data.

If you want to process a lot of data quickly using Zarr and Numba:

  • Use Zarr to read the data into memory as a numpy array
  • Call a Numba function on that numpy array

Can you explain why this workflow does not meet your needs?

A more sophisticated use case would involve using Dask to coordinate and schedule many simultaneous reading / processing tasks.

@jakirkham
Copy link
Member

+1 to everything Ryan said.

A more sophisticated use case would involve using Dask to coordinate and schedule many simultaneous reading / processing tasks.

I think an interesting question for users asking about this would be, are you using or have you tried using Zarr + Dask + Numba? If so, what painpoints have you experienced when doing that? What do your workflows look like?

If we find enough common use cases of such a workflow above, we might be able to dive deeper into how these could be improved.

@ivirshup
Copy link

I wanted to write up a longer response to this with an example for indexing into an on disk sparse matrix, but that requires me digging up some old branches. I think I've got a nice small example though.

I would like to be able to efficiently search a sorted set of genomic intervals. E.g. bedtools intersect, but with my genomic ranges stored in a chunked columnar format. My current use case is ATAC-seq data, and would involve a file similar to how ranges are stored by ArchR.

Basically I would be needing to do a pair of binary searches over the start and end columns. I would like to just have one implementation of the search.

I may also want to find all sets of overlaps, by iterating through a pair of interval sets.

This code gets quite messy if there has to be a function barrier between the numba code, and the code that retrieves the chunks from files. In both of these cases I would be dynamically choosing which chunks are read when, so these are not good fits for dask.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants