Replies: 1 comment 1 reply
-
I might not be understanding properly, but suggestion (4) might be related to "ragged" arrays, see #7988 |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
Intro
I have been using xarray in the last past months for neuroscience data-analysis and it has been great. Viewing data as just a big set of arrays with dimension names and coordinates is awesome.
Problem
However, things get more complicated when data is kind of sparse.
Let us consider a basic simple example (not sparse):
I would naturally structure this as a DataArray with 4 dimensions:
(species: 3, subject_num: 10, trial_num: 50, t: 10**4)
and everything would be fairly easy using a mixture ofapply_func
based functions,sel
andwhere
.However, if suddenly the array becomes unbalanced, for example some of the subjects have 200 trials, or the trial recording time is not always the same, ... then this approach might cause issues, the main one being memory. And then the code need to be written differently, usually in a much less elegant way...
Vocab: A dimension is "dense" if any slice of the array on the other dimension if always fully nan or all full. Otherwise, it is said sparse.
Current solutions
It seems that there are several supported approaches:
Use the sparse array backend. The main drawback of this approach if that if you have a mix of sparse and dense dimensions then you do not take advantage of the speedup for the dense dimensions (i.e. apply_ufunc when the input_core_dims are dense). Furthermore, in my small test of it, sparse arrays support a subset of the operations numpy arrays support.
Stack the sparse dimensions using xarray. This works fine but the code gets more complicated: one needs to use groupby and then apply_ufunc based code, ... Simply put, it is not transparent at all.
On a side note, is there any use case to the xarray stack/unstack functions but this one (i.e. when would one prefer a multiindex then a dimension if the resulting array was not sparse) ?
Use the dask backend to avoid memory issues. This works fairly well and requires very small changes in code but can have huge performance impacts depending on how sparse the array is.
Use a hierarchical structure where each hierarchy is "regular". For example if in our example only dimension t is sparse, we can use an xarray with the dimensions (species, subject_num, trial_num) whose values are xarrays with dimension t. The drawback of this approach, is that manipulations with input_core_dims with child and parent dimensions would be very inefficient. However, the rest would be quite nice.
A Better Approach?
I feel like the best solution would use the idea of 2) but not on the xarray side, but on the array backend. That is, have a numpy like array where one of the dimensions is actually a COO sparse dimension. This would of course require lots of work (and I am not suggesting xarray developers should work in that direction), and I may be completely wrong about the drawbacks of this solution, but would that approach feel good ?
Side Note: masks
On a side note, sparse arrays also have another drawback: if their dtype is not object or float, it becomes quite annoying: returning np.nan casts the array to float... Furthermore, I would argue that a nan due to computation (for example division by zero) is semantically different than "this cell of the array does not exists". Does xarray work with numpy mask array backend? Is that considered the best solution?
Has xarray considered solutions with respect to this problem (I believe pandas now has one)
Beta Was this translation helpful? Give feedback.
All reactions