Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing a RangeIndexed' DataArray with a RangeIndex returns a deprecated Int64Index #6256

Open
hrzn opened this issue Feb 9, 2022 · 2 comments

Comments

@hrzn
Copy link
Contributor

hrzn commented Feb 9, 2022

What happened?

First, apology if this is not actually a bug - I'm not too sure of what the intended behaviour should be. But I find this counter-intuitive.

When indexing a DataArray that is indexed using a RangeIndex, the resulting index is an Int64Index:

my_da.get_index('time')
>>> RangeIndex(start=0, stop=100, step=1, name='time')

a = my_da.sel({'time': pd.RangeIndex(0,2)})
a.get_index('time')
>>> Int64Index([0, 1], dtype='int64', name='time')

Setting the index to the desired RangeIndex using assign_coords() then works. But I find it a bit problematic that sel() returns an Int64Index even when used with a RangeIndex. Also because Int64Index has been recently deprecated in Pandas 1.4.

What did you expect to happen?

I would have expected the resulting DataArray to be indexed with the same RangeIndex used in sel().

Minimal Complete Verifiable Example

import xarray as xr
import numpy as np
import pandas as pd

my_da = xr.DataArray(np.random.rand(100,),
                     dims=('time'),
                     coords={'time': pd.RangeIndex(0, 100)})

print(my_da.get_index('time'))
a = my_da.sel({'time': pd.RangeIndex(0,2)})
print(a.get_index('time'))

Relevant log output

RangeIndex(start=0, stop=100, step=1, name='time')
Int64Index([0, 1], dtype='int64', name='time')

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.8.5 (default, Sep 4 2020, 02:22:02)
[Clang 10.0.0 ]
python-bits: 64
OS: Darwin
OS-release: 20.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: (None, 'UTF-8')
libhdf5: None
libnetcdf: None

xarray: 0.20.2
pandas: 1.4.0
numpy: 1.22.1
scipy: 1.7.3
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: 3.5.1
cartopy: None
seaborn: None
numbagg: None
fsspec: 2021.11.1
cupy: None
pint: None
sparse: None
setuptools: 59.5.0
pip: 21.3.1
conda: None
pytest: 6.2.5
IPython: 8.0.1
sphinx: 4.3.2

@hrzn hrzn added bug needs triage Issue that has not been reviewed by xarray team member labels Feb 9, 2022
@mathause
Copy link
Collaborator

mathause commented Feb 21, 2022

Thanks for the report - I guess RangeIndex was never very thoroughly tested. This may or may not change with #5692 (which is hopefully merged in the near future). So I suggest to wait for this.

@mathause mathause added topic-indexing topic-internals and removed needs triage Issue that has not been reviewed by xarray team member labels Feb 21, 2022
@benbovy
Copy link
Member

benbovy commented Feb 21, 2022

This is still the same behavior with #5692.

We would need to handle pd.RangeIndex (and perhaps range?) label indexers similarly to slice label indexers, i.e., use pd.Index.slice_indexer internally to return integer indexers as slices (*).

b = my_da.sel(time=slice(0, 2))
b.get_index('time')
# RangeIndex(start=0, stop=3, step=1, name='time')

Otherwise, label indexers get internally converted to arrays. Note that the conversion to an Int64Index is done in pandas (nothing specific is done on the Xarray side), so I expect that this will be eventually addressed in pandas. This conversion may not be too problematic if we consider this as an implementation detail (although I might be missing some important aspect).

idx = pd.RangeIndex(0, 100)

idx[slice(0, 3)]
# RangeIndex(start=0, stop=3, step=1)

idx[[0, 1, 2]]
# Int64Index([0, 1, 2], dtype='int64')

(*) One major difference is that in Xarray slice label indexers are upper-bound inclusive, while pd.RangeIndex and range are not!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants