New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicit indexes in xarray's data-model (Future of MultiIndex) #1603

Open
fujiisoup opened this Issue Oct 4, 2017 · 45 comments

Comments

Projects
None yet
6 participants
@fujiisoup
Copy link
Member

fujiisoup commented Oct 4, 2017

I think we can continue the discussion we have in #1426 about MultiIndex here.

In comment , @shoyer recommended to remove MultiIndex from public API.

I agree with this, as long as my codes work with this improvement.

I think if we could have a list of possible MultiIndex use cases here,
it would be easier to deeply discuss and arrive at a consensus of the future API.

Current limitations of MultiIndex are

  • It drops scalar coordinate after selection #1408, #1491
  • It does not support to serialize to NetCDF #1077
  • Stack/unstack behaviors are inconsistent #1431
@fujiisoup

This comment has been minimized.

Copy link
Member

fujiisoup commented Oct 4, 2017

I'm using MultiIndex a lot,
but I noticed that it is just a workaround to index along multiple kinds of coordinate.

Consider the following example,

In [1]: import numpy as np
   ...: import xarray as xr
   ...: da = xr.DataArray(np.arange(5), dims=['x'],
   ...:                   coords={'experiment': ('x', [0, 0, 0, 1, 1]),
   ...:                           'time': ('x', [0.0, 0.1, 0.2, 0.0, 0.15])})
   ...: 

In [2]: da
Out[2]: 
<xarray.DataArray (x: 5)>
array([0, 1, 2, 3, 4])
Coordinates:
    experiment  (x) int64 0 0 0 1 1 
    time        (x) float64 0.0 0.1 0.2 0.0 0.15
Dimensions without coordinates: x

I want to do something like this

da.sel(experiment=0).sel(time=0.1)

but it cannot.
MultiIndexing enables this,

In [2]: da = da.set_index(exp_time=['experiment', 'time'])
   ...: da
   ...: 
Out[2]: 
<xarray.DataArray (x: 5)>
array([0, 1, 2, 3, 4])
Coordinates:
  * exp_time    (exp_time) MultiIndex
  - experiment  (exp_time) int64 0 0 0 1 1 
  - time        (exp_time) float64 0.0 0.1 0.2 0.0 0.15
Dimensions without coordinates: x

If we could make a selection from a non-index coordinate,
MultiIndex is not necessary for this case.

I think there should be other important usecases of MultiIndex.
I would be happy if anyone could list them in this issue.

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Oct 4, 2017

One API design challenge here is that I think we still want a explicit notation of "indexed" variables. We could possibly allow operations like .sel() on non-indexed variables, but they would be slower, because we would not want to create expensive hash-tables (i.e., pandas.Index) in a non-transparent fashion.

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Oct 4, 2017

I sometimes find it helpful to think about what the right repr() looks right, and then work backwards from there to the right data model.

For example, we might imagine that "Indexes" are no longer coordinates, but instead their own entry in the repr:

<xarray.Dataset (exp_time: 5)>
Coordinates:
  * experiment  (exp_time) int64 0 0 0 1 1 
  * time        (exp_time) float64 0.0 0.1 0.2 0.0 0.15
Indexes:
    exp_time: pandas.MultiIndex[experiment, time]

"Indexes" might not even need to be part of the main Dataset.__repr__, but it would certainly be the repr for Dataset.indexes. Other entries could include:

    time: pandas.Datetime64Index[time]
    space: scipy.spatial.KDTree[latitude, longitude]

In this model:

  1. We would promote "Indexes" to a first-class concept in the xarray data model:
    (a) The levels of a MultiIndex would have corresponding Variable objects and be found in coords.
    (b) In contrast, theMultiIndex would not have a corresponding Variable object or be part of coords, though it could still be returned upon __getitem__ access (computed on demand from .indexes).
    (c) Dataset and DataArray would gain an indexes argument in their constructors, which could be used for passing indexes on to new xarray objects.
  2. Coordinates marked with * are part of an index. They can't be modified, unless all corresponding indexes ares removed.
  3. Indexes would still be propagated, like coordinates.
@fujiisoup

This comment has been minimized.

Copy link
Member

fujiisoup commented Oct 4, 2017

I think we currently assume variables[dim] is an Index.
Does your proposal means that Dataset will keep an additional attribute indexes, and indexes[dim] gives a pd.Index (or pd.MultiIndex, KDTree)?

It sounds a much cleaner data model.

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Oct 4, 2017

Does your proposal means that Dataset will keep an additional attribute indexes, and indexes[dim] gives a pd.Index (or pd.MultiIndex, KDTree)?

Yes, exactly. We actually already have an attribute that works like this, but it's current computed lazily, from either Dataset._variables or DataArray._coords.

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Oct 4, 2017

@benbovy

This comment has been minimized.

Copy link
Member

benbovy commented Oct 4, 2017

I think that promoting "Indexes" to a first-class concept is indeed a very good idea, at both internal and public levels, even if at the latter level it would be another concept for users (it should be already familiar for pandas users, though). IMHO the "coordinate" and "index" concepts are different enough to consider them separately.

I like the proposed repr for Dataset.indexes. I wouldn't mind if it is not included in Dataset.__repr__, considering that multi-indexes, kdtree, etc. only represent a few use cases. In too many cases it could result in a long, uninformative list of simple pandas.Index.

I have to think a bit more about the details but I like the idea.

@fujiisoup

This comment has been minimized.

Copy link
Member

fujiisoup commented Oct 4, 2017

@shoyer, could you add more details of this idea?
I think I do not yet fully understand the practical difference between dim and index.

  1. Use cases of the independent Index and dims
    Would it be general cases where dimension and index are independent?
    (It is the case only for MultiIndex and KDtree)?

  2. MultiIndex implementation
    In MultiIndex case, will a xarray object store a MultiIndex object and also the level variables as Variable objects (there will be some duplicates)?
    If indexes[dim] returns multiple Variables, which realizes a MultiIndex-like structure without pd.MultiIndex, indexes would be very different from dim,
    because a single dimension can have multiple indexes.

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Oct 4, 2017

  1. Use cases of the independent Index and dims
    Would it be general cases where dimension and index are independent?
    (It is the case only for MultiIndex and KDtree)?

We would still assign default indexes (using a normal pandas.Index) when you assign a 1D coordinate with matching name and dimension. But in general, yes, it seems like you should be able to make an index even for variables that aren't dimensions, including for a 1D variable whose name does not match a dimension. The rule would be that any coordinates can be part of an index.

Another aspect to consider how to handle alignment when you have indexes along non-dimension coordinates. Probably the most elegant rule would again be to check all indexed variables for exact matches.

Directly assigning indexes rather than using this default or set_index() would be an advanced feature, not recommended for everyday use. The main use case is routines which create a new xarray object based on an existing one, and want to re-use old indexes.

For performance reasons, we probably do not want to actually check the values of manually assigned indexes, although we should verify that the shape matches. (We would have a clear disclaimer that if you manually assign an index with mismatched values the behavior is not well defined.)

In principle, this data model would allow for two mostly equivalent indexing schemes: MultiIndex[time, space] vs two indexes Index[time] and Index[space]. We would need to figure out how to propagate and compare indexes like this. (I suppose if the coordinate values match, the result could have the union of all indexes from input arguments.)

  1. MultiIndex implementation
    In MultiIndex case, will a xarray object store a MultiIndex object and also the level variables as Variable objects (there will be some duplicates)?

Yes, this is a little unfortunate. We could potentially make a custom wrapper for use in IndexVariable._data on the level variabless that lazily computes values from the MultiIndex (similar to our LazilyIndexedArray class), but I'm not certain yet that this is necessary.

If indexes[dim] returns multiple Variables, which realizes a MultiIndex-like structure without pd.MultiIndex, indexes would be very different from dim,
because a single dimension can have multiple indexes.

Every entry in indexes should be a single pandas.Index or subclass, including MultiIndex (possibly eventually allowing for index-like objects such as something based on a KDTree).

@shoyer shoyer referenced this issue Oct 9, 2017

Closed

WIP: indexing with broadcasting #1473

4 of 4 tasks complete
@fujiisoup

This comment has been minimized.

Copy link
Member

fujiisoup commented Oct 13, 2017

Thanks for the details.
(Sorry for my late responce. It took a long for me to understand what does it look like.)

I am wondering what the advantageous cases which are realized with this Index concept are.
As far as my understanding is correct,

  1. It will enable more flexible indexing, e.g. more than one Indexes are associated with one dimension and we can select from these coordinate values very flexibly.
  2. It will naturally integrate more advanced Indexes such as KDTree

Are they correct?

Probably the most elegant rule would again be to check all indexed variables for exact matches.

That sounds reasonable.

In principle, this data model would allow for two mostly equivalent indexing schemes: MultiIndex[time, space] vs two indexes Index[time] and Index[space].

I like the latter one, as it is easier to understand even for non-pandas users.

What does the actual implementation look like?
xr.Dataset.indexes will be an OrderedDict that maps from variable's name to its associated dimension?
Actual instance of Index will be one of xr.Dataset.variables?

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Oct 13, 2017

I am wondering what the advantageous cases which are realized with this Index concept are.

The other advantage is that it solves many of the issues with the current MultiIndex implementation. Making MultiIndex levels their own variables considerably simplifies the data model, and means that many features (including serialization) should "just work".

In principle, this data model would allow for two mostly equivalent indexing schemes: MultiIndex[time, space] vs two indexes Index[time] and Index[space].
I like the latter one, as it is easier to understand even for non-pandas users.

I agree, but there are probably some advantages to using a MultiIndex internally. For example, it allows for looking up on multiple levels at the same time.

What does the actual implementation look like?
xr.Dataset.indexes will be an OrderedDict that maps from variable's name to its associated dimension?
Actual instance of Index will be one of xr.Dataset.variables?

I think we could get away with making xr.Dataset.indexes simply a dict, with keys given by index names and values given by a pandas.Index instance. We should enforce that Index.name or MultiIndex.names corresponds to coordinate variables.

For KDTree, this means we'll have to write our own wrapper KDTreeIndex that adds a names property, but we would probably need to add special methods like get_indexer anyways.

@alimanfoo

This comment has been minimized.

Copy link
Contributor

alimanfoo commented Oct 23, 2017

Just to say I'm interested in how MultiIndexes are handled also. In our use case, we have two variables conventionally named CHROM (chromosome) and POS (position) which together describe a location in a genome. I want to combine both variables into a multi-index so I can, e.g., select all data from some data variable for chromosome X between positions 100,000-200,000. For all our data variables, this genome location multi-index would be used to index the first dimension.

@jjpr-mit

This comment has been minimized.

Copy link

jjpr-mit commented Oct 27, 2017

Will the new API preserve the order of the levels? One of the features that's necessary for MultiIndex to be truly hierarchical is that there is a defined order to the levels.

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Oct 27, 2017

@jjpr-mit can you explain your use case a little more? What sort of order dependent queries do you want to do? The one that comes to mind for me are range based queries, e.g, [('bar', 1) : ('foo', 9)].

I think it is still relatively easy to ensure a unique ordering between levels, based on the order of coordinate variables in the xarray dataset.

A bigger challenge is that for efficiency, these sorts of queries depend critically on having an actual MultiIndex. This means that if indexes for each of the levels arise from different arguments that were merged together, we might need to "merge" the separate indexes into a joint MultiIndex. This could potentially be slightly expensive.

@shoyer shoyer changed the title Future of MultiIndex Indexes as an explicit part of xarray's data-model (Future of MultiIndex) Jan 5, 2018

@shoyer shoyer changed the title Indexes as an explicit part of xarray's data-model (Future of MultiIndex) Explicit indexes in xarray's data-model (Future of MultiIndex) Jan 5, 2018

@shoyer shoyer modified the milestones: 0.10.1, 1.0 Jan 31, 2018

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented May 29, 2018

@alimanfoo

This comment has been minimized.

Copy link
Contributor

alimanfoo commented May 29, 2018

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented May 29, 2018

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Nov 28, 2018

I've been thinking about this a little more in the context of starting on the implementation (in #2195).

In particular, I no longer agree with this "Separate indexers without a MultiIndex should be prohibited" from my original proposal. The problem is that the semantics of a MultiIndex are not quite the same as separate indexes, and I don't think all use-cases are well solved by always using a MultiIndex. For example, I don't think it's possible to do point-wise indexing along anything other than the first level of a MultiIndex. (note: this is not true, see #1603 (comment))

Instead, I think we should make the model transparent by retaining an xarray variable for the MultiIndex, and provide APIs for explicitly converting index types.

e.g., for the repr with a MultiIndex:

Coordinates:
  * x        (x) MultiIndex[level_1, level_2]
  * level_1  (x) object 'a' 'a' 'b' 'b'
  * level_2  (x) int64 1 2 1 2

and without a MultiIndex:

Coordinates:
  * level_1  (x) object 'a' 'a' 'b' 'b'
  * level_2  (x) int64 1 2 1 2

The main way in which this could get confusing is if you explicitly mutate the Dataset to remove some but not all of the variables corresponding to the MultiIndex (e.g., x but not level_1 or vise-versa). We have a few potential options here:

  1. Don't worry about it: if you mutate objects, you can potentially end up in slightly confusing internal states. If you care about whether level_1 uses a pandas.Index or pandas.MultiIndex, you can find out for sure by checking ds.indexes['level_1'].
  2. Prohibit it in our data model: either (a) raise an error if you try to manually delete a single variable or (b) automatically delete all associated variables, too. Encourage using various explicit APIs that return new objects with a new index.
  3. Use a different indicator than * for marking "indirect" indexes, so it's more obvious if some coordinates get removed, e.g.,
Coordinates:
  * x        (x) MultiIndex[level_1, level_2]
  + level_1  (x) object 'a' 'a' 'b' 'b'
  + level_2  (x) int64 1 2 1 2

The different indicator might make sense regardless but I am also partial to "Prohibit it in our data model." The main downside is that this adds a little more complexity to the logic for determining indexes resulting from an operation (namely, verifying that all MultiIndex levels still correspond to coordinates).

@max-sixty

This comment has been minimized.

Copy link
Collaborator

max-sixty commented Nov 28, 2018

Potentially this is too much 'stepping back' now we're at the implementation stage - my perception is that @shoyer is leading this without much support, so weighting having some additional viewpoints, some questions:

Is a MultiIndex a feature of the schema or the implementation?

I had thought of an MI being an implementation detail in code, rather than in the data schema. We use it as a container for all the indexes along a dimension, rather than representing any properties about the data it contains.

One exception to that would be if we wanted multiple groups of indexes along the same dimension, for example:

Coordinates:
  * xa         (x) MultiIndex[level_a_1, level_a_2]
  * level_a_1  (x) object 'a' 'a' 'b' 'b'
  * level_a_2  (x) int64 1 2 1 2

  * xb         (x) MultiIndex[level_b_1, level_b_2]
  * level_b_1  (x) object 'a' 'a' 'b' 'b'
  * level_b_2  (x) int64 1 2 1 2

But is that common / required?

MultiIndex as an implementation detail

If it's an implementation detail, is there a benefit to investing in allowing both separate and MIs?
While it may not be possible to do pointwise indexing with the current implementation of MI, am I mistaken that it's not an API issue, assuming we pass in index names? e.g.:

[ins] In [22]: da = xr.DataArray(np.arange(12).reshape((3, 4)), dims=['x', 'y'], 
    coords=dict(x=list('abc'), y=pd.MultiIndex.from_product([list('ab'),[1,2]])))

[ins] In [23]: da
Out[23]:
<xarray.DataArray (x: 3, y: 4)>
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
Coordinates:
  * x          (x) <U1 'a' 'b' 'c'
  * y          (y) MultiIndex
  - y_level_0  (y) object 'a' 'a' 'b' 'b'
  - y_level_1  (y) int64 1 2 1 2


[ins] In [26]: da.sel(x=xr.DataArray(['a','c'],dims=['z']),
    y_level_0=xr.DataArray(['a','b'],dims=['z'])
    y_level_1=xr.DataArray([1,1],dims=['z']))

Out[80]: # hypothetical
<xarray.DataArray (z: 3)>
array([ 0,  10])
Dimensions without coordinates: z

If that's the case, could we instead force all indexes along a dimension to be in a MI, tolerate the short-term constraints of the current MI implementation, and where needed build out additional features?

That would (ideally) leave us uncoupled to MIs - if we built a better in-memory data structure, we could transition. The contract would be around the cases above.

--

...and as mentioned above, these are intended as questions rather than high-confident views.

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Nov 29, 2018

@max-sixty I like your schema vs. implementation breakdown. In general, I agree with you that it would be nice to have MultiIndex has an implementation detail rather than part of xarray's schema. But I'm not entirely sure that's feasible.

Let's try to list out the pros/cons. Consider a MultiIndex 'multi' with levels 'x' and 'y':

  • Advantages of MultiIndex as part of the data schema:
    • There is an explicit coordinate (of tuples) corresponding to MultiIndex values, which can be returned from ds.coords['multi']. This is inherently not that useful compared to the separable variables, but is a cleaner solution that creating ds.coords['multi'] as a "virtual" variable on the fly (which we would need for backwards compatibility).
    • We don't need to do full "normalization" when multiple indexes along the same dimension are encountered, e.g., in an operation that combines two different indexes, we would simply put both on the result instead of building a MultiIndex (which would require allocating a whole new array of integer codes).
    • The nature of the MultiIndex is more transparent as part of the data model. For example, if x and y are numeric, it could make sense to use either a MultiIndex or KDTree for indexing. Explicit APIs (e.g., set_multiindex and set_kdtree) would allow users a high level of control.
    • For advanced use-cases, it is potentially easier to work around the limitations of a MultiIndex, e.g., the way that some operations require lex-sorted-ness.
  • Advantages of MultiIndex as an implementation detail:
    • Simpler data model (for users). There are few good use cases for multiple indexes that aren't a MultiIndex.
    • Easier to do automatic alignment: we know that indexes will always have the same normalized form (in a MultiIndex). Otherwise, we would have to do this on the fly, or request that users explicitly setup compatible indexes.
    • More flexibility for xarray: we can potentially swap out indexing without changing the user-facing API. We might have something like a "hybrid" MultiIndex/KDTree that chooses the appropriate index based on the requested operation.
    • We don't need to create an explicit array of tuples for the MultiIndex variable (but we could still have a variable corresponding to a MultiIndex and only construct the .data array in a "lazy" fashion).
    • There's no need to name extraneous variables that only exist for the sake of a MultiIndex.
    • There's no need to support indexing like ds.sel(multi=list_of_pairs). Indexing like ds.sel(x=..., y=...) solves the same use case and looks nicer. That said, this would be a minor backwards compatibility break (this currently works in xarray).

P.S. I haven't made much progress on this yet so there's definitely still time to figure out the right decision -- thanks for your engagement on this!

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Nov 29, 2018

For example, I don't think it's possible to do point-wise indexing along anything other than the first level of a MultiIndex.

This is clearly not true, since it works in pandas:

import pandas as pd
index = pd.MultiIndex.from_product([list('ab'),[1,2]])
series = pd.Series(range(4), index)
print(series.loc[:, [1, 2]])

That said, I still don't know how to use public MultiIndex methods for this. Neither index.get_loc_level([1, 2], level=1) nor index.get_loc((slice(None), [1, 2])) work.

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Nov 29, 2018

That said, I still don't know how to use public MultiIndex methods for this. Neither index.get_loc_level([1, 2], level=1) nor index.get_loc((slice(None), [1, 2])) work.

The answer is the index.get_locs() method: index.get_locs([slice(None), 1, 2]]) works.

It's painfully slow for large numbers of points due to a Python loop over each point, but presumably that could be optimized:

x = np.arange(10000)
index = pd.MultiIndex.from_arrays([x])
%timeit index.get_locs((x,))  # 1.31 s per loop
%timeit index.levels[0].get_indexer(x)  # 93 µs per loop
@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Nov 29, 2018

There's no need to support indexing like ds.sel(multi=list_of_pairs). Indexing like ds.sel(x=..., y=...) solves the same use case and looks nicer.

This needs an important caveat: it's only true that you use ds.sel(x=..., y=...) to emulate ds.sel(multi=list_of_pairs) if you do explicit vectorized indexing like in @max-sixty's example above (#1603 (comment)). It would be nice to preserve a way to select a list of particular points that didn't require constructing explicit DataArray objects as the indexers. (But maybe this is a somewhat niche use-case and it isn't worth the trouble.)

Let me make a tentative proposal: we should model a MultiIndex in xarray as exactly equivalent to a sparse multi-dimensional array, except with missing elements modeled implicitly (by omission) instead of explicitly (with NaN). If we do this, I think MultiIndex semantics could be defined to be identical to those of separable Index objects.

One challenge is that we will definitely have to make some intentional deviations from the behavior of pandas, at least when dealing with array indexing of a MultiIndex level. Pandas has some strange behaviors with array indexing of a MultiIndex level, and I'm honestly not sure if they are bugs or features:

Fortunately, the MultiIndex data model is not that complicated, and it is quite straightforward to remap indexing results from sub-Index levels onto integer codes. I suspect we will find it easier to rewrite some of these routines than to change pandas, both because pandas may not agree with different semantics and because the pandas indexing code is an unholy mess.

For example, we can reproduce the above issues:

import pandas as pd
index = pd.MultiIndex.from_arrays([['a', 'b', 'c']])
print(index.get_locs((['a', 'a'],)))  # [0]
print(index.get_locs((['a', 'd'],)))  # [0]

We actually want something more like:

def get_locs(index, key):
  return index.get_indexer(pd.MultiIndex.from_product(key))

print(get_locs(index, (['a', 'a'],)))  # [0, 0]
print(get_locs(index, (['a', 'd'],)))  # [0, -1]
@max-sixty

This comment has been minimized.

Copy link
Collaborator

max-sixty commented Nov 29, 2018

Let me make a tentative proposal: we should model a MultiIndex in xarray as exactly equivalent to a sparse multi-dimensional array, except with missing elements modeled implicitly (by omission) instead of explicitly (with NaN).

💯- that very much resonates!
And it leaves the implementation flexible if we want to iterate.

I'll try to think of some dissenting cases to the proposal / helpful responses to the above.

@benbovy

This comment has been minimized.

Copy link
Member

benbovy commented Nov 29, 2018

we will definitely have to make some intentional deviations from the behavior of pandas

Looking at the reported issues related to multi-indexes in xarray, I have the same feeling. Simply reusing pandas.MultiIndex in xarray where slightly different semantics are generally expected has shown to be painful. It seems easier to have our own baked solution and deal with differences during xarray<-> pandas conversion if needed.

If we re-design indexes so that we allow 3rd-party indexes, maybe we could support both and let the user choose the one (xarray or pandas baked) that best suits his needs?

Regarding MultiIndex as part of the data schema vs an implementation detail, if we support extending indexes (and already given the different kinds of multi-coordinate indexes: MultiIndex, KDTree, etc.), then I think that it should be transparent to the user.

However, I don't really see why a multi-coordinate index should have its own variable (with tuples of values). I don't want to speak for others, but IMHO ds.sel(multi=list_of_pairs) is rather a edge case and I'm not sure if we really need to support it. Using ds.sel(x=..., y=...) with DataArray objects is certainly more code to write, but this form of indexing is very powerful and it might not be a bad idea to encourage it.

If a variable for each multi-coordinate index is "just" for data schema consistency, then why not showing all those indexes in a separate section of the repr? For example:

Coordinates:
  * level_1  (x) object 'a' 'a' 'b' 'b'
  * level_2  (x) int64 1 2 1 2
Multi-indexes:
    pandas.MultiIndex [level_1, level_2]

It is equally transparent, not more verbose, and it is clear that multi-indexes are not part of the coordinates (in fact there is no need of "virtual" coordinates either, nor to name the index). I don't think single indexes should be shown here as it would results in duplicated, uninformative lines.

More generally, here is how I would see indexes handled in xarray (I might be missing important aspects, though):

  • Default behavior: all 1-dimensional coordinates each have their own, single index (pandas.Index), unless explicitly stated.
  • Explicit API is used for setting new, possibly multi-coordinate indexes. Note the absence of keyword argument below to specify the variables: This is actually more consistent with the pandas API but this would be a breaking change and I don't know how a smooth transition could look like.
    • set_index(['x', 'y'], kind='multiindex') # xarray built-in index
    • set_index(['x', 'y'], kind='kdtree') # xarray built-in index
    • set_index('x', kind=ASingleIndexWrapperClass) # 3rd-party index
  • If a coordinate is removed from the Dataset or if its index is reset or changed:
    • If the coordinate had a single index, no problem
    • If the coordinate was part of a multi-coordinate index: a new index is built from all remaining coordinates that were also part of the original index, if it is supported. Otherwise, the original index is removed and the default behavior (single pandas.Index) is reset for all those remaining coordinates.
@fujiisoup

This comment has been minimized.

Copy link
Member

fujiisoup commented Nov 29, 2018

I am late for the party (but still only have time to write a short comment).
I am a big fan of MultiIndex and like @shoyer 's idea.

ds.sel(multi=list_of_pairs) can probably be replaced by ds.sel(x=..., y=...), but how about reindex along MultiIndex?
I have encountered its use cases several times.
I also think it would be nice to have MultiIndex as a variable.

@max-sixty

This comment has been minimized.

Copy link
Collaborator

max-sixty commented Nov 29, 2018

I broadly agree with @benbovy 's proposal.

One question that I think is worth being clear on is what additional contracts do multiple indexes on a dimension have over individual indexes?

e.g. re:

Coordinates:
  * level_1  (x) object 'a' 'a' 'b' 'b'
  * level_2  (x) int64 1 2 1 2
Multi-indexes:
    pandas.MultiIndex [level_1, level_2]

Am I right in thinking the Multi-indexes is only a helpful note to users, rather than conveying anything about how data is accessed?

@fujiisoup 's poses a good case of this question:

ds.sel(multi=list_of_pairs) can probably be replaced by ds.sel(x=..., y=...), but how about reindex along MultiIndex?

(and separately, I think we can do much of this before adding the ability to set custom indexes, which would be cool but further from where we are, I think)

@max-sixty

This comment has been minimized.

Copy link
Collaborator

max-sixty commented Nov 29, 2018

And broadening out further:

Default behavior: all 1-dimensional coordinates each have their own, single index (pandas.Index), unless explicitly stated.

This is basically how I think of indexes - as a performant lookup data structure, rather than a feature of the schema. An RDBMS in a good corollary there.

Now, maybe there's enough overlap between the data access and the data schema that we should let them couple - e.g. would you want to be able to run .sel on any coord, even 2D? While it's possible in concept, it could guide users to inefficient operations.

We probably don't need to answer this question to proceed, but I'd be interested whether others see indexes as a property of the schema / I'm missing something.

@benbovy

This comment has been minimized.

Copy link
Member

benbovy commented Nov 29, 2018

ds.sel(multi=list_of_pairs) can probably be replaced by ds.sel(x=..., y=...), but how about reindex along MultiIndex?

Indeed I haven't really thought about reindex and alignment in my suggestion above.

How do you currently reindex along a multi-index dimension?

Contrary to .sel, ds.reindex(multi=list_of_pairs) doesn't seem to work (the list of n-length tuples being interpreted as a n-dim 2-d array). The only way I've found to make it work is to pass another pandas.MultiIndex. Wouldn't be it rather confusing if we choose to go with our own implementation of MultiIndex for xarray instead of pandas.MultiIndex?

Wouldn't be possible to easily support ds.reindex(x=..., y=...) within the new data model proposed here?

Am I right in thinking the Multi-indexes is only a helpful note to users, rather than conveying anything about how data is accessed?

This is a good question.

A related question: apart from ds.sel(multi=list_of_pairs) and ds.reindex(multi=list_of_pairs) use cases discussed so far, is there other reasons of having a variable for a multi-index?

I think we can do much of this before adding the ability to set custom indexes, which would be cool but further from where we are, I think.

I agree, although whether or not we will eventually support custom indexes might influence the design choices that we have to do now, IMO.

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Nov 29, 2018

Looking at the reported issues related to multi-indexes in xarray, I have the same feeling. Simply reusing pandas.MultiIndex in xarray where slightly different semantics are generally expected has shown to be painful. It seems easier to have our own baked solution and deal with differences during xarray<-> pandas conversion if needed.

I think the pandas.MultiIndex is a pretty solid data structure on a fundamental level, it just has some weird semantics for some indexing edge cases. Whether or not we write xarray.MultiIndex structure, we can achieve most of what we want with a thin layer over pandas.MultiIndex.

If a variable for each multi-coordinate index is "just" for data schema consistency, then why not showing all those indexes in a separate section of the repr?

Yes, I like this! Generally I like @benbovy's entire proposal :).

@fujiisoup can you clarity the use-cases you have for a MultiIndex as a variable?

Am I right in thinking the Multi-indexes is only a helpful note to users, rather than conveying anything about how data is accessed?

From a data perspective, the only thing having an Index and/or MultiIndex should change is that the data is immutable.

But by necessity the nature of the index will determine which indexing operations are possible/efficient. For example, if you want to do nearest-neighbor indexing with multiple coordinates you'll need a KDTree. We should not be afraid to raise errors if an indexing operation can't be done efficiently.


With regards to reindexing: I don't think this needs any special handling versus normal indexing (sel()). The rules basically fall out of those for normal indexing, except we handle missing values differently (by filling with NaN).

Another issue: how do automatic alignment with multiple indexes? Let me suggest a straw-man proposal: We always align indexed coordinates. If a coordinate is used in different types of indexes (e.g., a base Index in one argument and a MultiIndex level in another), we can either:

  1. create a MultiIndex with the variable on the fly (this could be slightly expensive), or
  2. fall back to only supporting "exact" indexing
@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Nov 29, 2018

It occurs to me that for the case of "multiple single indexes" along the same dimension there is no good way to use them simultaneously for indexing/reindexing at the same time. We should explicitly raise if you try to do this.

I guess we have a few options for automatic alignment with multiple single indexes, too:

  1. We could only support "exact" indexing
  2. We could require that aligning each index separately gives the same result

(2) seems least restrictive and is probably the right choice.


One advantage of not having MultiIndex objects as variables is that the serialization story gets simpler. The rule becomes "multi-indexes don't get saved".


What should the default behavior of set_index(['x', 'y']) without an explicit kind argument be?

  • Should this mean individual indexes or a combined MultiIndex? The later might be more surprising but is arguably more useful. It would make sense if the model is that set_index() always creates a single index object.
  • We could potentially automatically pick an index type using simple heuristics. For example, if the arguments are 1D, you get get a MultiIndex by default. If the arguments have two or more dimensions, you get a KDTree.
@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Nov 30, 2018

I wonder if we should also change the default value of the append argument in set_index() to append=None, which means something like "append if creating a MultiIndex". For most users, keeping a single MultiIndex is the most usable way to use multiple indexes along a dimension, and our default behavior should reflect that.

@benbovy

This comment has been minimized.

Copy link
Member

benbovy commented Nov 30, 2018

A couple of thoughts:

If nothing useful can be done in the case of "multiple single indexes", would it make sense to discourage users explicitly creating multiple single indexes along a dimension? "Multiple single indexes" would be just a default situation when nothing specific as been defined yet or resulting from a failback.

For example, why not requiring that set_index(['x', 'y']) (with a list as argument) should always result in a multi-index regardless of the kind argument, i.e., raise if a single index is given? This is close to the current behavior, I think. This would require calling set_index for each single index that we want to (re)define, but I don't think setting a lot of single indexes at the same time is something that often happens.

Hence, would it be possible to avoid append=None and instead change the default to append=True?

@max-sixty

This comment has been minimized.

Copy link
Collaborator

max-sixty commented Nov 30, 2018

How should dimension names interact with index names - i.e. the "Mapping indexes into pandas" in @shoyer 's comment

I'd suggest that option (3) should be invalid, and that da[dim_name] should return all the indexes on that dimension

@benbovy

This comment has been minimized.

Copy link
Member

benbovy commented Dec 4, 2018

It occurs to me that for the case of "multiple single indexes" along the same dimension there is no good way to use them simultaneously for indexing/reindexing at the same time.

Sorry for maybe asking this again but I'm a bit confused now: is there any good reason of supporting "multiple single indexes" along the same dimension?

After all, perhaps better defaults would be to set indexes (pandas.Index) only for 1-d coordinates matching dimension names, like it is the case now.

If you want a different behavior, then you need to use .set_index(), which would raise if it results in multiple single indexes along a dimension. We could also add a new indexes argument to the Dataset / DataArray constructors to save some typing (and avoid the creation of in-memory pandas.Index for very long coordinates if an out-of-core alternative is later supported).

da[dim_name] should return all the indexes on that dimension

I think that one big source of confusion has been so far mixing coordinates/variables and indexes. These are really two separate concepts, and the indexes refactoring should address that IMHO.

For example, I think that da[some_name] should never return indexes but only coordinates (and/or data variables for Dataset). That would be much simpler.

Take for example

>>> da = xr.DataArray(np.random.rand(2, 2),
...                   dims=('one', 'two'),
...                   coords={'one_labels': ('one', ['a', 'b'])})
>>> da
<xarray.DataArray (one: 2, two: 2)>
array([[ 0.536028,  0.291895],
       [ 0.682108,  0.926003]])
Coordinates:
    one_labels  (one) <U1 'a' 'b'
Dimensions without coordinates: one, two

I find it so weird being able to do this:

>>> da['one']
<xarray.DataArray 'one' (one: 2)>
array([0, 1])
Coordinates:
    one_labels  (one) <U1 'a' 'b'
Dimensions without coordinates: one

Where does come from array([0, 1])? I wouldn't have been surprised if a KeyError was raised instead. Perhaps this specific case was initially for backward compatibility when the "dimensions without indexes" feature has been introduced, but it was a long time ago and I'm not sure this is still necessary.

I might be a good thing explicitly requiring da.set_index('one_labels') to enable indexing/alignment (edit: label indexing/alignment) along dimension one in the example above.

@alimanfoo

This comment has been minimized.

Copy link
Contributor

alimanfoo commented Dec 4, 2018

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Dec 4, 2018

Sorry for maybe asking this again but I'm a bit confused now: is there any good reason of supporting "multiple single indexes" along the same dimension?

After all, perhaps better defaults would be to set indexes (pandas.Index) only for 1-d coordinates matching dimension names, like it is the case now.

If you want a different behavior, then you need to use .set_index(), which would raise if it results in multiple single indexes along a dimension. We could also add a new indexes argument to the Dataset / DataArray constructors to save some typing (and avoid the creation of in-memory pandas.Index for very long coordinates if an out-of-core alternative is later supported).

I discussed this is a little bit above in #1603 (comment), under "MultiIndex as part of the data schema".

I agree that the default behavior should still be to create automatic indexes only for 1d coordinates matching dimension names. But we still will have (rare?) cases where "multiple single indexes" could arise from combining arguments with different indexes.

For example, suppose the station dimension has an index for station_name in one dataset and city in another. Should the result be:

  • A MultiIndex with levels station_name and city? This would be most useful for future operations.
  • Two individual indexes for station_name and city? This would be the cheapest result to construct.
  • An error? This is arguably too strict, because there are no conflicts in either of the indexes.

I guess the error is probably the best idea.

Where does come from array([0, 1])? I wouldn't have been surprised if a KeyError was raised instead. Perhaps this specific case was initially for backward compatibility when the "dimensions without indexes" feature has been introduced, but it was a long time ago and I'm not sure this is still necessary.

This is indeed the historical genesis, but I agree that this is confusing and we should deprecate/remove it.

@benbovy

This comment has been minimized.

Copy link
Member

benbovy commented Dec 5, 2018

I guess the error is probably the best idea.

Agreed. It seems very strict indeed, but it will be easier to relax this later than the other way. There is also a (very rare?) case where the two indexed coordinates have the same labels but are named differently in the two datasets (e.g., station_name and sname). In that case an error is probably better too. It would be a sort of indication that the most useful thing to do for future operations is to rename one of those coordinates first.

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Jan 1, 2019

I'm starting to make these changes incrementally -- the first step is in #2639.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment