numba njit support #771

aero-108 · 2021-06-08T11:32:36Z

Hello,
is there are any plans to support numba njit?
Can be very useful.

import numba as nb
import zarr
import numpy as np

@nb.njit()
def test(arr):
    for i in range(arr.shape[0]):
        arr[i] = 5.0


zarr_arr = zarr.full((100,), fill_value=np.nan, dtype='float64')

test(zarr_arr[0:100])

 zarr_arr[0:100]
Out[4]: 
array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan])

jakirkham · 2021-06-08T20:00:33Z

Not sure I follow. Zarr is a storage format for arrays. One can use those arrays in other pipelines with Numba, Dask, etc.

aero-108 · 2021-06-08T20:04:11Z

@jakirkham
as mentioned in example numba not really friendly with zarr arrays and any modifications wont pass back.

jakirkham · 2021-06-08T20:09:54Z

Yeah wouldn't really expect in-place modifications to work. There's a lot of encoding, decompression, chunk aggregation, etc. logic that needs to happen

zarr_arr[0:100] will return a NumPy array. Modifying that in-place won't affect Zarr. One could pass the whole Zarr array into that function, but then will need to explicitly write that out

If one loaded this with Dask, could use map_blocks to apply the function over the array and then write that back out to Zarr

jakirkham · 2021-06-08T20:11:33Z

If I were to try with that code, would do something like this...

import numba as nb
import zarr
import numpy as np

@nb.njit()
def test(arr):
    for i in range(arr.shape[0]):
        arr[i] = 5.0


zarr_arr = zarr.full((100,), fill_value=np.nan, dtype='float64')

arr = zarr_arr[0:100]
test(arr)
zarr_arr[0:100] = arr

(should add I haven't run this. so please check as well)

ivirshup · 2021-12-16T19:31:53Z

To pick up this thread a bit – I think numba support would be really great.

I think the biggest use case here working with very large amounts of data as quickly as possible. Ideally I could iterate through chunks of my zarr array and compute something from them without having to swap back and forth between numba and python. This process is a bit of a pain, and has a fair bit of overhead.

I can see how this would be difficult. Not sure how much of the stack (e.g. possibly all of numcodecs?) would have to be compiled to make this work.

rabernat · 2021-12-16T19:43:07Z

I think the biggest use case here working with very large amounts of data as quickly as possible.

This is generally what everyone wants to do. But I'm not sure that this proposed integration is needed for it. Zarr and Numba solve orthogonal problems. Zarr accelerates data I/O, which can speed up I/O bound problems. Zarr will help get data from files or object storage into memory quickly. Numba accelerates computation, which can speed up compute bound problems. Numba operates on in-memory data.

If you want to process a lot of data quickly using Zarr and Numba:

Use Zarr to read the data into memory as a numpy array
Call a Numba function on that numpy array

Can you explain why this workflow does not meet your needs?

A more sophisticated use case would involve using Dask to coordinate and schedule many simultaneous reading / processing tasks.

jakirkham · 2021-12-16T19:53:33Z

+1 to everything Ryan said.

A more sophisticated use case would involve using Dask to coordinate and schedule many simultaneous reading / processing tasks.

I think an interesting question for users asking about this would be, are you using or have you tried using Zarr + Dask + Numba? If so, what painpoints have you experienced when doing that? What do your workflows look like?

If we find enough common use cases of such a workflow above, we might be able to dive deeper into how these could be improved.

ivirshup · 2021-12-21T12:46:23Z

I wanted to write up a longer response to this with an example for indexing into an on disk sparse matrix, but that requires me digging up some old branches. I think I've got a nice small example though.

I would like to be able to efficiently search a sorted set of genomic intervals. E.g. bedtools intersect, but with my genomic ranges stored in a chunked columnar format. My current use case is ATAC-seq data, and would involve a file similar to how ranges are stored by ArchR.

Basically I would be needing to do a pair of binary searches over the start and end columns. I would like to just have one implementation of the search.

I may also want to find all sets of overlaps, by iterating through a pair of interval sets.

This code gets quite messy if there has to be a function barrier between the numba code, and the code that retrieves the chunks from files. In both of these cases I would be dynamically choosing which chunks are read when, so these are not good fits for dask.

ivirshup mentioned this issue Apr 19, 2022

Protocol extensions for awkward arrays zarr-developers/zarr-specs#62

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

numba njit support #771

numba njit support #771

aero-108 commented Jun 8, 2021

jakirkham commented Jun 8, 2021

aero-108 commented Jun 8, 2021

jakirkham commented Jun 8, 2021

jakirkham commented Jun 8, 2021

ivirshup commented Dec 16, 2021

rabernat commented Dec 16, 2021 •

edited

Loading

jakirkham commented Dec 16, 2021

ivirshup commented Dec 21, 2021

numba njit support #771

numba njit support #771

Comments

aero-108 commented Jun 8, 2021

jakirkham commented Jun 8, 2021

aero-108 commented Jun 8, 2021

jakirkham commented Jun 8, 2021

jakirkham commented Jun 8, 2021

ivirshup commented Dec 16, 2021

rabernat commented Dec 16, 2021 • edited Loading

jakirkham commented Dec 16, 2021

ivirshup commented Dec 21, 2021

rabernat commented Dec 16, 2021 •

edited

Loading