Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicit indexes in xarray's data-model (Future of MultiIndex) #1603

Open
fujiisoup opened this issue Oct 4, 2017 · 56 comments
Open

Explicit indexes in xarray's data-model (Future of MultiIndex) #1603

fujiisoup opened this issue Oct 4, 2017 · 56 comments

Comments

@fujiisoup
Copy link
Member

@fujiisoup fujiisoup commented Oct 4, 2017

I think we can continue the discussion we have in #1426 about MultiIndex here.

In comment , @shoyer recommended to remove MultiIndex from public API.

I agree with this, as long as my codes work with this improvement.

I think if we could have a list of possible MultiIndex use cases here,
it would be easier to deeply discuss and arrive at a consensus of the future API.

Current limitations of MultiIndex are

  • It drops scalar coordinate after selection #1408, #1491
  • It does not support to serialize to NetCDF #1077
  • Stack/unstack behaviors are inconsistent #1431
@fujiisoup
Copy link
Member Author

@fujiisoup fujiisoup commented Oct 4, 2017

I'm using MultiIndex a lot,
but I noticed that it is just a workaround to index along multiple kinds of coordinate.

Consider the following example,

In [1]: import numpy as np
   ...: import xarray as xr
   ...: da = xr.DataArray(np.arange(5), dims=['x'],
   ...:                   coords={'experiment': ('x', [0, 0, 0, 1, 1]),
   ...:                           'time': ('x', [0.0, 0.1, 0.2, 0.0, 0.15])})
   ...: 

In [2]: da
Out[2]: 
<xarray.DataArray (x: 5)>
array([0, 1, 2, 3, 4])
Coordinates:
    experiment  (x) int64 0 0 0 1 1 
    time        (x) float64 0.0 0.1 0.2 0.0 0.15
Dimensions without coordinates: x

I want to do something like this

da.sel(experiment=0).sel(time=0.1)

but it cannot.
MultiIndexing enables this,

In [2]: da = da.set_index(exp_time=['experiment', 'time'])
   ...: da
   ...: 
Out[2]: 
<xarray.DataArray (x: 5)>
array([0, 1, 2, 3, 4])
Coordinates:
  * exp_time    (exp_time) MultiIndex
  - experiment  (exp_time) int64 0 0 0 1 1 
  - time        (exp_time) float64 0.0 0.1 0.2 0.0 0.15
Dimensions without coordinates: x

If we could make a selection from a non-index coordinate,
MultiIndex is not necessary for this case.

I think there should be other important usecases of MultiIndex.
I would be happy if anyone could list them in this issue.

@shoyer
Copy link
Member

@shoyer shoyer commented Oct 4, 2017

One API design challenge here is that I think we still want a explicit notation of "indexed" variables. We could possibly allow operations like .sel() on non-indexed variables, but they would be slower, because we would not want to create expensive hash-tables (i.e., pandas.Index) in a non-transparent fashion.

@shoyer
Copy link
Member

@shoyer shoyer commented Oct 4, 2017

I sometimes find it helpful to think about what the right repr() looks right, and then work backwards from there to the right data model.

For example, we might imagine that "Indexes" are no longer coordinates, but instead their own entry in the repr:

<xarray.Dataset (exp_time: 5)>
Coordinates:
  * experiment  (exp_time) int64 0 0 0 1 1 
  * time        (exp_time) float64 0.0 0.1 0.2 0.0 0.15
Indexes:
    exp_time: pandas.MultiIndex[experiment, time]

"Indexes" might not even need to be part of the main Dataset.__repr__, but it would certainly be the repr for Dataset.indexes. Other entries could include:

    time: pandas.Datetime64Index[time]
    space: scipy.spatial.KDTree[latitude, longitude]

In this model:

  1. We would promote "Indexes" to a first-class concept in the xarray data model:
    (a) The levels of a MultiIndex would have corresponding Variable objects and be found in coords.
    (b) In contrast, theMultiIndex would not have a corresponding Variable object or be part of coords, though it could still be returned upon __getitem__ access (computed on demand from .indexes).
    (c) Dataset and DataArray would gain an indexes argument in their constructors, which could be used for passing indexes on to new xarray objects.
  2. Coordinates marked with * are part of an index. They can't be modified, unless all corresponding indexes ares removed.
  3. Indexes would still be propagated, like coordinates.
@fujiisoup
Copy link
Member Author

@fujiisoup fujiisoup commented Oct 4, 2017

I think we currently assume variables[dim] is an Index.
Does your proposal means that Dataset will keep an additional attribute indexes, and indexes[dim] gives a pd.Index (or pd.MultiIndex, KDTree)?

It sounds a much cleaner data model.

@shoyer
Copy link
Member

@shoyer shoyer commented Oct 4, 2017

Does your proposal means that Dataset will keep an additional attribute indexes, and indexes[dim] gives a pd.Index (or pd.MultiIndex, KDTree)?

Yes, exactly. We actually already have an attribute that works like this, but it's current computed lazily, from either Dataset._variables or DataArray._coords.

@shoyer
Copy link
Member

@shoyer shoyer commented Oct 4, 2017

@benbovy
Copy link
Member

@benbovy benbovy commented Oct 4, 2017

I think that promoting "Indexes" to a first-class concept is indeed a very good idea, at both internal and public levels, even if at the latter level it would be another concept for users (it should be already familiar for pandas users, though). IMHO the "coordinate" and "index" concepts are different enough to consider them separately.

I like the proposed repr for Dataset.indexes. I wouldn't mind if it is not included in Dataset.__repr__, considering that multi-indexes, kdtree, etc. only represent a few use cases. In too many cases it could result in a long, uninformative list of simple pandas.Index.

I have to think a bit more about the details but I like the idea.

@fujiisoup
Copy link
Member Author

@fujiisoup fujiisoup commented Oct 4, 2017

@shoyer, could you add more details of this idea?
I think I do not yet fully understand the practical difference between dim and index.

  1. Use cases of the independent Index and dims
    Would it be general cases where dimension and index are independent?
    (It is the case only for MultiIndex and KDtree)?

  2. MultiIndex implementation
    In MultiIndex case, will a xarray object store a MultiIndex object and also the level variables as Variable objects (there will be some duplicates)?
    If indexes[dim] returns multiple Variables, which realizes a MultiIndex-like structure without pd.MultiIndex, indexes would be very different from dim,
    because a single dimension can have multiple indexes.

@shoyer
Copy link
Member

@shoyer shoyer commented Oct 4, 2017

  1. Use cases of the independent Index and dims
    Would it be general cases where dimension and index are independent?
    (It is the case only for MultiIndex and KDtree)?

We would still assign default indexes (using a normal pandas.Index) when you assign a 1D coordinate with matching name and dimension. But in general, yes, it seems like you should be able to make an index even for variables that aren't dimensions, including for a 1D variable whose name does not match a dimension. The rule would be that any coordinates can be part of an index.

Another aspect to consider how to handle alignment when you have indexes along non-dimension coordinates. Probably the most elegant rule would again be to check all indexed variables for exact matches.

Directly assigning indexes rather than using this default or set_index() would be an advanced feature, not recommended for everyday use. The main use case is routines which create a new xarray object based on an existing one, and want to re-use old indexes.

For performance reasons, we probably do not want to actually check the values of manually assigned indexes, although we should verify that the shape matches. (We would have a clear disclaimer that if you manually assign an index with mismatched values the behavior is not well defined.)

In principle, this data model would allow for two mostly equivalent indexing schemes: MultiIndex[time, space] vs two indexes Index[time] and Index[space]. We would need to figure out how to propagate and compare indexes like this. (I suppose if the coordinate values match, the result could have the union of all indexes from input arguments.)

  1. MultiIndex implementation
    In MultiIndex case, will a xarray object store a MultiIndex object and also the level variables as Variable objects (there will be some duplicates)?

Yes, this is a little unfortunate. We could potentially make a custom wrapper for use in IndexVariable._data on the level variabless that lazily computes values from the MultiIndex (similar to our LazilyIndexedArray class), but I'm not certain yet that this is necessary.

If indexes[dim] returns multiple Variables, which realizes a MultiIndex-like structure without pd.MultiIndex, indexes would be very different from dim,
because a single dimension can have multiple indexes.

Every entry in indexes should be a single pandas.Index or subclass, including MultiIndex (possibly eventually allowing for index-like objects such as something based on a KDTree).

@shoyer shoyer mentioned this issue Oct 9, 2017
4 of 4 tasks complete
@fujiisoup
Copy link
Member Author

@fujiisoup fujiisoup commented Oct 13, 2017

Thanks for the details.
(Sorry for my late responce. It took a long for me to understand what does it look like.)

I am wondering what the advantageous cases which are realized with this Index concept are.
As far as my understanding is correct,

  1. It will enable more flexible indexing, e.g. more than one Indexes are associated with one dimension and we can select from these coordinate values very flexibly.
  2. It will naturally integrate more advanced Indexes such as KDTree

Are they correct?

Probably the most elegant rule would again be to check all indexed variables for exact matches.

That sounds reasonable.

In principle, this data model would allow for two mostly equivalent indexing schemes: MultiIndex[time, space] vs two indexes Index[time] and Index[space].

I like the latter one, as it is easier to understand even for non-pandas users.

What does the actual implementation look like?
xr.Dataset.indexes will be an OrderedDict that maps from variable's name to its associated dimension?
Actual instance of Index will be one of xr.Dataset.variables?

@shoyer
Copy link
Member

@shoyer shoyer commented Oct 13, 2017

I am wondering what the advantageous cases which are realized with this Index concept are.

The other advantage is that it solves many of the issues with the current MultiIndex implementation. Making MultiIndex levels their own variables considerably simplifies the data model, and means that many features (including serialization) should "just work".

In principle, this data model would allow for two mostly equivalent indexing schemes: MultiIndex[time, space] vs two indexes Index[time] and Index[space].
I like the latter one, as it is easier to understand even for non-pandas users.

I agree, but there are probably some advantages to using a MultiIndex internally. For example, it allows for looking up on multiple levels at the same time.

What does the actual implementation look like?
xr.Dataset.indexes will be an OrderedDict that maps from variable's name to its associated dimension?
Actual instance of Index will be one of xr.Dataset.variables?

I think we could get away with making xr.Dataset.indexes simply a dict, with keys given by index names and values given by a pandas.Index instance. We should enforce that Index.name or MultiIndex.names corresponds to coordinate variables.

For KDTree, this means we'll have to write our own wrapper KDTreeIndex that adds a names property, but we would probably need to add special methods like get_indexer anyways.

@alimanfoo
Copy link
Contributor

@alimanfoo alimanfoo commented Oct 23, 2017

Just to say I'm interested in how MultiIndexes are handled also. In our use case, we have two variables conventionally named CHROM (chromosome) and POS (position) which together describe a location in a genome. I want to combine both variables into a multi-index so I can, e.g., select all data from some data variable for chromosome X between positions 100,000-200,000. For all our data variables, this genome location multi-index would be used to index the first dimension.

@jjpr-mit
Copy link

@jjpr-mit jjpr-mit commented Oct 27, 2017

Will the new API preserve the order of the levels? One of the features that's necessary for MultiIndex to be truly hierarchical is that there is a defined order to the levels.

@shoyer
Copy link
Member

@shoyer shoyer commented Oct 27, 2017

@jjpr-mit can you explain your use case a little more? What sort of order dependent queries do you want to do? The one that comes to mind for me are range based queries, e.g, [('bar', 1) : ('foo', 9)].

I think it is still relatively easy to ensure a unique ordering between levels, based on the order of coordinate variables in the xarray dataset.

A bigger challenge is that for efficiency, these sorts of queries depend critically on having an actual MultiIndex. This means that if indexes for each of the levels arise from different arguments that were merged together, we might need to "merge" the separate indexes into a joint MultiIndex. This could potentially be slightly expensive.

@shoyer shoyer changed the title Future of MultiIndex Indexes as an explicit part of xarray's data-model (Future of MultiIndex) Jan 5, 2018
@shoyer shoyer changed the title Indexes as an explicit part of xarray's data-model (Future of MultiIndex) Explicit indexes in xarray's data-model (Future of MultiIndex) Jan 5, 2018
@shoyer shoyer modified the milestones: 0.10.1, 1.0 Jan 31, 2018
@shoyer
Copy link
Member

@shoyer shoyer commented Aug 21, 2019

Explicitly propagating indexes requires going through most of xarray's source code and auditing each time we create a Dataset or DataArray object with low-level operations. We have some pretty decent testing functions for this in the form of xarray.testing._assert_internal_invariants, so this is now a pretty mechanical process -- you know it's working if you're now setting indexes explicitly and xarray's test suite passes.

Here's our current progress:

  • most of dataset.py
  • alignment.py
  • merge.py (#3234)
  • concat.py
  • dataarray.py (#3519, #3481)
  • computation.py
  • groupby.py
  • resample.py
  • rolling.py
  • everything else!
@keewis keewis mentioned this issue Sep 15, 2019
16 of 16 tasks complete
@max-sixty max-sixty mentioned this issue Oct 31, 2019
0 of 3 tasks complete
@dcherian
Copy link
Contributor

@dcherian dcherian commented Nov 3, 2019

@shoyer I was thinking of starting on one of the listed files. Do you have any tips? Are you working on any of those at present? What might be the easiest one to begin?

@shoyer
Copy link
Member

@shoyer shoyer commented Nov 3, 2019

I'm not working on any of these right now. You might start with a few of the dataarray.py methods (no need to do them all at once) to get a sense of what piping these arguments around looks like. I suspect you could get quite a few of these working just by handling indexes in _to_temp_dataset/_from_temp_dataset.

@NowanIlfideme
Copy link

@NowanIlfideme NowanIlfideme commented Nov 22, 2019

I've noticed that basically all my current troubles with xarray lead to this issue (lack of MultiIndex support). I use xarray for machine learning/data science/econometrics. My current problem requires a semi-hierarchical indexing on one of the dimensions, and slicing/aggregation along some levels of those dimensions.

My first attempt was to just assume each dimension was orthogonal, which resulted in out-of-memory errors. I ended up using a MultiIndex for the hierarchy dimension to have a "dense" representation of a sparse subspace. Unfortunately, currently .sel() and such will cut out MultiIndex dimensions, and I've had to do boolean masking to keep all the dimensions I need.

Multidimensional groupby, especially within the MultiIndex, is a headache as it currently stands. I had to resort to making auxilliary dimensions with one-hot encoded levels (dummy variables) and doing multiply-aggregate operations by hand.

xarray is really beautiful and should be used more by data scientists, but it's really difficult to recommend it to colleagues when not all the familiar pandas-style operations are supported.

@rabernat
Copy link
Contributor

@rabernat rabernat commented Nov 22, 2019

Thanks @NowanIlfideme for your feedback.

Could you perhaps share a gist of code related to your use case?

@dcherian
Copy link
Contributor

@dcherian dcherian commented Nov 22, 2019

My first attempt was to just assume each dimension was orthogonal, which resulted in out-of-memory errors

We have experimental support for https://sparse.pydata.org/en/latest/index.html that may help but no documentation unfortunately. There are some details here: #3213 and #3484

@NowanIlfideme
Copy link

@NowanIlfideme NowanIlfideme commented Nov 22, 2019

Thanks @NowanIlfideme for your feedback.

Could you perhaps share a gist of code related to your use case?

The first example in this comment is similar to my use case: #3213 (comment) . There are several "core" dimensions, but some part of the coordinates may be hierarchical or cross-defined (e.g. country > province > city > building, but also country > province > voting district > building). We might have a full or nearly-full panel in the MultiIndex representation, but have a huge cross product (even if we keep strictly hierarchical dimensions out).

Meanwhile using a true COO sparse representation (as I understand it) will likely end up with slower operations overall, since nearly all machine learning models (think: linear regression) require a dense array input anyways.

I'll make an example of this when I find some free time, along with a contrasting one in Pandas. :)

@max-sixty
Copy link
Collaborator

@max-sixty max-sixty commented Nov 22, 2019

I'll make an example of this when I find some free time, along with a contrasting one in Pandas. :)

👍

@keewis keewis mentioned this issue Dec 4, 2019
13 of 18 tasks complete
@keewis keewis mentioned this issue Dec 21, 2019
3 of 3 tasks complete
@TomNicholas TomNicholas mentioned this issue Apr 2, 2020
2 of 4 tasks complete
@TomNicholas TomNicholas added this to To do in Explicit Indexes Apr 2, 2020
@Hoeze
Copy link

@Hoeze Hoeze commented Apr 19, 2021

Many array types do have implicit indices.
For example, sparse arrays do have their coordinates / CSR representation as primary index (.sel()) while dense array's primary index is the position (.isel()).
Every labeled dimension is therefore just a separate mapping of a string to the index position in the array.

Going one step further, one could have continuous dimensions where positional indexing (.isel()) does not really make sense.
Looking at TileDB's dimensions provides an example for this.

=> Having explicit and implicit indices on arrays would be awesome, even if they don't support all xarray features!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked pull requests

Successfully merging a pull request may close this issue.

None yet