Skip to content

Slicing DataArray can take longer than not slicing #2004

@WeatherGod

Description

@WeatherGod

Code Sample, a copy-pastable example if possible

In [1]: import xarray as xr

In [2]: radmax_ds = xr.open_dataset('tests/radmax_baseline.nc')

In [3]: radmax_ds
Out[3]: 
<xarray.Dataset>
Dimensions:    (latitude: 5650, longitude: 12050, time: 3)
Coordinates:
  * latitude   (latitude) float32 13.505002 13.515002 13.525002 13.535002 ...
  * longitude  (longitude) float32 -170.495 -170.485 -170.475 -170.465 ...
  * time       (time) datetime64[ns] 2017-03-07T01:00:00 2017-03-07T02:00:00 ...
Data variables:
    RadarMax   (time, latitude, longitude) float32 ...
Attributes:
    start_date:   03/07/2017 01:00
    end_date:     03/07/2017 01:55
    elapsed:      60
    data_rights:  Respond (TM) Confidential Data. (c) Insurance Services Offi...

In [4]: %timeit foo = radmax_ds.RadarMax.load()
The slowest run took 35509.20 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 216 µs per loop

In [5]: 216 * 35509.2
Out[5]: 7669987.199999999

So, without any slicing, it takes approximately 7.5 seconds for me to load this complete file into memory. Now, let's see what happens when I slice the DataArray and load it:

In [1]: import xarray as xr

In [2]: radmax_ds = xr.open_dataset('tests/radmax_baseline.nc')

In [3]: %timeit foo = radmax_ds.RadarMax[::1, ::1, ::1].load()
1 loop, best of 3: 7.56 s per loop

In [4]: radmax_ds.close()

In [5]: radmax_ds = xr.open_dataset('tests/radmax_baseline.nc')

In [6]: %timeit foo = radmax_ds.RadarMax[::1, ::10, ::10].load()

I killed this session after 17 minutes. top did not report any unusual io wait, and memory usage was not out of control. I am using v0.10.2 of xarray. My suspicion is that there is something wrong with the indexing system that is causing xarray to read in the data in a bad order. Notice that if I slice all the data, then the timing works out the same as reading it all in straight-up. Not shown here is a run where if I slice every 100 lats and 100 longitudes, then the timing is shorter again, but not to the same amount of time as reading it all in at once.

Let me know if you want a copy of the file. It is a compressed netcdf4, taking up only 1.7MB.

I wonder if this is related to #1985?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions