Skip to content

PeriodIndex causes severe slow down #1255

@max-sixty

Description

@max-sixty

I need some guidance on how to handle this.

Background

PeriodIndex has a 'non-numpy' dtype now:

In [2]: i = pd.PeriodIndex(start=2000, freq='A', periods=10)

In [3]: i
Out[3]:
PeriodIndex(['2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007',
             '2008', '2009'],
            dtype='period[A-DEC]', freq='A-DEC')

In [6]: i.dtype
Out[6]: period[A-DEC]

When .values or .__array__() are called, the Periods are boxed, which is really slow. The underlying ints are stored in ._values:

In [25]: i.values
Out[25]:
array([Period('2000', 'A-DEC'), Period('2001', 'A-DEC'),
       Period('2002', 'A-DEC'), Period('2003', 'A-DEC'),
       Period('2004', 'A-DEC'), Period('2005', 'A-DEC'),
       Period('2006', 'A-DEC'), Period('2007', 'A-DEC'),
       Period('2008', 'A-DEC'), Period('2009', 'A-DEC')], dtype=object)

In [27]: all(i.__array__()==i.values)
Out[27]: True

# underlying:
In [28]: i._values
Out[28]: array([30, 31, 32, 33, 34, 35, 36, 37, 38, 39])

Problem

In pandas, we limit directly calling .values from outside Index, instead accessing Index functions through a smaller API.

But in xarray, I think there are a fair few functions that call .values or implicitly call .__array__() by passing the index into numpy.

As a result, there is a severe slow down when using PeriodIndex. As an example:

In [51]: indexes = [pd.PeriodIndex(start=str((1776 + i)), freq='A', periods=300) for i in range(50)]
In [53]: das = [xr.DataArray(range(300), coords=[index]) for index in indexes]

In [54]: %timeit xr.concat(das)

# 1 loop, best of 3: 1.38 s per loop

vs DTI:

In [55]: indexes_dt = [pd.DatetimeIndex(start=str((1776 + i)), freq='A', periods=300) for i in range(50)]
In [56]: das_dt = [xr.DataArray(range(300), coords=[index]) for index in indexes_dt]
In [57]: %timeit xr.concat(das_dt)
# 10 loops, best of 3: 69.2 ms per loop

...a 20x slowdown, on fairly short indexes

@shoyer do you have any ideas of how to resolve this? Is it feasible to not pass Indexes directly into numpy? I haven't gone through in enough depth to have a view there, given I was hoping you could cut through the options. Thank you.

ref pandas-dev/pandas#14822
CC @sinhkrs @jreback

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions