Of the handling of sparse arrays #8859

JulienBrn · 2024-03-20T16:40:45Z

JulienBrn
Mar 20, 2024

Hi,

Intro

I have been using xarray in the last past months for neuroscience data-analysis and it has been great. Viewing data as just a big set of arrays with dimension names and coordinates is awesome.

Problem

However, things get more complicated when data is kind of sparse.

Let us consider a basic simple example (not sparse):

Three species
10 subjects per species
50 trials per subject
each trial is a recording of 100s sampled at 100Hz

I would naturally structure this as a DataArray with 4 dimensions: (species: 3, subject_num: 10, trial_num: 50, t: 10**4) and everything would be fairly easy using a mixture of apply_func based functions, sel and where.

However, if suddenly the array becomes unbalanced, for example some of the subjects have 200 trials, or the trial recording time is not always the same, ... then this approach might cause issues, the main one being memory. And then the code need to be written differently, usually in a much less elegant way...

Vocab: A dimension is "dense" if any slice of the array on the other dimension if always fully nan or all full. Otherwise, it is said sparse.

Current solutions

It seems that there are several supported approaches:

Use the sparse array backend. The main drawback of this approach if that if you have a mix of sparse and dense dimensions then you do not take advantage of the speedup for the dense dimensions (i.e. apply_ufunc when the input_core_dims are dense). Furthermore, in my small test of it, sparse arrays support a subset of the operations numpy arrays support.
Stack the sparse dimensions using xarray. This works fine but the code gets more complicated: one needs to use groupby and then apply_ufunc based code, ... Simply put, it is not transparent at all.
On a side note, is there any use case to the xarray stack/unstack functions but this one (i.e. when would one prefer a multiindex then a dimension if the resulting array was not sparse) ?
Use the dask backend to avoid memory issues. This works fairly well and requires very small changes in code but can have huge performance impacts depending on how sparse the array is.
Use a hierarchical structure where each hierarchy is "regular". For example if in our example only dimension t is sparse, we can use an xarray with the dimensions (species, subject_num, trial_num) whose values are xarrays with dimension t. The drawback of this approach, is that manipulations with input_core_dims with child and parent dimensions would be very inefficient. However, the rest would be quite nice.

A Better Approach?

I feel like the best solution would use the idea of 2) but not on the xarray side, but on the array backend. That is, have a numpy like array where one of the dimensions is actually a COO sparse dimension. This would of course require lots of work (and I am not suggesting xarray developers should work in that direction), and I may be completely wrong about the drawbacks of this solution, but would that approach feel good ?

Side Note: masks

On a side note, sparse arrays also have another drawback: if their dtype is not object or float, it becomes quite annoying: returning np.nan casts the array to float... Furthermore, I would argue that a nan due to computation (for example division by zero) is semantically different than "this cell of the array does not exists". Does xarray work with numpy mask array backend? Is that considered the best solution?

Has xarray considered solutions with respect to this problem (I believe pandas now has one)

TomNicholas · 2024-03-20T17:59:18Z

TomNicholas
Mar 20, 2024
Maintainer

I might not be understanding properly, but suggestion (4) might be related to "ragged" arrays, see #7988

1 reply

JulienBrn Mar 20, 2024
Author

I believe it is related. However, I think that if solution 4) is implemented, it should be a library on top of xarray and not in the array backend: otherwise hierarchical datasets would have issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Of the handling of sparse arrays #8859

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Of the handling of sparse arrays #8859

JulienBrn Mar 20, 2024

Intro

Problem

Current solutions

A Better Approach?

Side Note: masks

Replies: 1 comment · 1 reply

TomNicholas Mar 20, 2024 Maintainer

JulienBrn Mar 20, 2024 Author

JulienBrn
Mar 20, 2024

Replies: 1 comment 1 reply

TomNicholas
Mar 20, 2024
Maintainer

JulienBrn Mar 20, 2024
Author