netCDF4 indexing: `reindex_like` is very slow if dataset not loaded into memory #8945

brendan-m-murphy · 2024-04-15T13:26:08Z

What is your issue?

Reindexing a dataset without loading it into memory seems to be very slow (about 1000x slower than reindexing after loading into memory).

Here is a minimum working example:

times = 100
nlat = 200
nlon = 300

fp = xr.Dataset({"fp": (["time", "lat", "lon"], np.arange(times * nlat * nlon).reshape(times, nlat, nlon))}, coords={"time": pd.date_range(start="2019-01-01T02:00:00", periods=times, freq="1H"), "lat": np.arange(nlat), "lon": np.arange(nlon)})

flux = xr.Dataset({"flux": (["time", "lat", "lon"], np.arange(nlat * nlon).reshape(1, nlat, nlon))}, coords={"time": [pd.to_datetime("2019-01-01")], "lat": np.arange(nlat) + np.random.normal(0.0, 0.01, nlat), "lon": np.arange(nlon) + np.random.normal(0.0, 0.01, nlon)})

fp.to_netcdf("combine_datasets_tests/fp.nc")
flux.to_netcdf("combine_datasets_tests/flux.nc")

fp1 = xr.open_dataset("combine_datasets_tests/fp.nc")
flux1 = xr.open_dataset("combine_datasets_tests/flux.nc")

Then

flux1 = flux1.reindex_like(fp1, method="ffill", tolerance=None)

takes over a minute, while

flux1 = flux1.load().reindex_like(fp1, method="ffill", tolerance=None)

is almost instantaneous (timeit says 91ms, including opening the dataset... I'm not sure if caching is influencing this).

Profiling the "reindex without load" cell:

        804936 function calls (804622 primitive calls) in 93.285 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   92.211   92.211   93.191   93.191 {built-in method _operator.getitem}
        1    0.289    0.289    0.980    0.980 utils.py:81(_StartCountStride)
        6    0.239    0.040    0.613    0.102 shape_base.py:267(apply_along_axis)
    72656    0.109    0.000    0.109    0.000 utils.py:429(<lambda>)
    72656    0.085    0.000    0.136    0.000 utils.py:430(<lambda>)
    72661    0.051    0.000    0.051    0.000 {built-in method numpy.arange}
   145318    0.048    0.000    0.115    0.000 shape_base.py:370(<genexpr>)
        2    0.045    0.023    0.046    0.023 indexing.py:1334(__getitem__)
        6    0.044    0.007    0.044    0.007 numeric.py:136(ones)
   145318    0.044    0.000    0.067    0.000 index_tricks.py:690(__next__)
       14    0.033    0.002    0.033    0.002 {built-in method numpy.empty}
145333/145325    0.023    0.000    0.023    0.000 {built-in method builtins.next}
        1    0.020    0.020   93.275   93.275 duck_array_ops.py:317(where)
       21    0.018    0.001    0.018    0.001 {method 'astype' of 'numpy.ndarray' objects}
   145330    0.013    0.000    0.013    0.000 {built-in method numpy.asanyarray}
        1    0.002    0.002    0.002    0.002 {built-in method _functools.reduce}
        1    0.002    0.002   93.279   93.279 variable.py:821(_getitem_with_mask)
       18    0.001    0.000    0.001    0.000 {built-in method numpy.zeros}
        1    0.000    0.000    0.000    0.000 file_manager.py:226(close)

The getitem call at the top is from xarray.backends.netCDF4_.py, line 114. Because of the jittered coordinates in flux, I'm assuming that the index passed to netCDF4 is not consecutive/strictly monotonic integers (0, 1, 2, 3, ...). In the past, this has caused issues: Unidata/netcdf4-python#680.

In my venv, netCDF4 was installed from a wheel with the following versions:

netcdf4-python version: 1.6.5
HDF5 lib version:       1.12.2
netcdf lib version:     4.9.3-development

This is with xarray version 2023.12.0, numpy 1.26, and pandas 1.5.3.

I will try to investigate more and hopefully simplify the example. (Can't quite justify spending more time on it at work because this is just to tag a version that was used in some experiments before we switch to zarr as a backend, so hopefully it won't be relevant at that point.)

The text was updated successfully, but these errors were encountered:

dcherian · 2024-04-21T03:04:18Z

Can you try h5netcdf?

brendan-m-murphy · 2024-04-21T06:54:59Z

Can you try h5netcdf?

Using engine="h5netcdf" in the call to open_dataset for flux1 does fix this (negligible difference between loading/not loading into memory first).

So I guess the issue is with netCDF4, or how xarray uses netCDF4. In the netcdf4-python issue I linked above, I think the problem wasn't present in h5netcdf as well. That issue is slightly different, since the slow down was from loading with an index like 0, 2, 4, ... . Reindexing with ffill usually results in repeated values in the index.

I'll try some variations/simplifying the example later.

dcherian · 2024-04-23T15:33:36Z

Interestingly, we actually treat h5netcdf and netCDF4 arrays differently in terms of indexing:

xarray/xarray/backends/netCDF4_.py

Lines 100 to 103 in 5a35ca4

    
           def __getitem__(self, key): 
        
               return indexing.explicit_indexing_adapter( 
        
                   key, self.shape, indexing.IndexingSupport.OUTER, self._getitem 
        
               )

xarray/xarray/backends/h5netcdf_.py

Lines 51 to 54 in 5a35ca4

    
           def __getitem__(self, key): 
        
               return indexing.explicit_indexing_adapter( 
        
                   key, self.shape, indexing.IndexingSupport.OUTER_1VECTOR, self._getitem 
        
               )

But in this example, the underling arrays are being indexed with the same tuple, so yes I'm inclined to blame netCDF4.

(array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 array([  0,   0,   1,   3,   4,   4,   5,   7,   8,   8,   9,  10,  12,
         12,  13,  15,  15,  16,  17,  19,  20,  20,  21,  22,  24,  25,
         26,  26,  28,  28,  30,  30,  31,  33,  34,  35,  36,  37,  38,
         38,  39,  40,  42,  42,  43,  45,  45,  47,  48,  48,  50,  50,
         52,  53,  53,  54,  56,  57,  58,  58,  59,  60,  62,  62,  64,
         65,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  76,  77,
         78,  78,  80,  81,  82,  82,  83,  85,  85,  87,  88,  88,  90,
         91,  91,  92,  94,  94,  95,  96,  98,  98, 100, 101, 101, 102,
        104, 104, 106, 107, 107, 108, 109, 110, 111, 113, 113, 114, 115,
        117, 117, 118, 119, 120, 121, 123, 123, 124, 126, 126, 128, 129,
        129, 131, 131, 132, 133, 134, 135, 136, 137, 138, 139, 141, 142,
        142, 144, 144, 145, 146, 148, 149, 149, 150, 152, 152, 154, 155,
        156, 157, 158, 159, 160, 161, 161, 163, 164, 165, 165, 166, 168,
        168, 169, 171, 172, 172, 174, 175, 175, 176, 177, 178, 180, 180,
        182, 183, 183, 185, 185, 186, 188, 189, 189, 191, 192, 192, 193,
        194, 195, 197, 198, 198]),
 array([  0,   1,   2,   3,   3,   4,   6,   6,   7,   8,  10,  11,  12,
         12,  14,  15,  16,  17,  17,  19,  19,  21,  21,  22,  23,  24,
         25,  26,  27,  29,  30,  30,  31,  33,  33,  34,  35,  37,  37,
         39,  39,  40,  42,  43,  43,  44,  46,  47,  47,  48,  50,  51,
         52,  52,  53,  55,  55,  56,  57,  59,  59,  61,  61,  62,  63,
         64,  66,  66,  67,  69,  70,  71,  71,  73,  73,  75,  75,  76,
         77,  78,  79,  80,  82,  82,  83,  84,  86,  87,  88,  88,  89,
         91,  91,  92,  94,  95,  95,  97,  98,  99, 100, 101, 101, 103,
        103, 104, 106, 106, 107, 108, 109, 111, 112, 112, 113, 114, 115,
        116, 118, 119, 119, 120, 121, 123, 124, 124, 126, 126, 127, 129,
        129, 130, 132, 133, 133, 134, 136, 137, 138, 139, 140, 141, 142,
        142, 144, 144, 145, 147, 147, 148, 149, 150, 152, 153, 153, 155,
        156, 157, 158, 158, 160, 161, 162, 162, 163, 164, 166, 167, 167,
        169, 169, 170, 171, 173, 174, 175, 176, 176, 178, 178, 180, 180,
        182, 182, 184, 184, 185, 187, 188, 189, 190, 191, 192, 193, 194,
        194, 196, 197, 198, 199, 200, 201, 202, 203, 203, 204, 205, 207,
        207, 209, 210, 210, 211, 213, 213, 215, 216, 216, 218, 219, 219,
        221, 222, 223, 223, 224, 225, 227, 227, 229, 229, 231, 232, 232,
        234, 235, 235, 236, 238, 239, 240, 240, 242, 242, 243, 245, 245,
        246, 248, 248, 249, 250, 252, 252, 253, 255, 255, 256, 257, 259,
        260, 260, 261, 263, 264, 264, 266, 267, 267, 269, 269, 270, 271,
        273, 274, 275, 276, 277, 278, 278, 279, 280, 281, 282, 283, 285,
        285, 287, 288, 288, 290, 290, 292, 293, 293, 295, 295, 296, 297,
        299]))

I'm going to close since there doesn't seem like much to do here, except switch to h5netcdf

brendan-m-murphy · 2024-04-23T21:49:27Z

Thanks for looking into it!

brendan-m-murphy added the needs triage Issue that has not been reviewed by xarray team member label Apr 15, 2024

brendan-m-murphy mentioned this issue Apr 15, 2024

Iss55 fix outer regions openghg/openghg_inversions#87

Merged

max-sixty added topic-performance and removed needs triage Issue that has not been reviewed by xarray team member labels Apr 15, 2024

dcherian changed the title ~~reindex_like is very slow if dataset not loaded into memory~~ netCDF4 indexing: reindex_like is very slow if dataset not loaded into memory Apr 23, 2024

dcherian closed this as completed Apr 23, 2024

brendan-m-murphy mentioned this issue Jul 3, 2024

Fixes for combine datasets (ported from openghg inversions) openghg/openghg#1047

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

netCDF4 indexing: `reindex_like` is very slow if dataset not loaded into memory #8945

netCDF4 indexing: `reindex_like` is very slow if dataset not loaded into memory #8945

brendan-m-murphy commented Apr 15, 2024

dcherian commented Apr 21, 2024

brendan-m-murphy commented Apr 21, 2024

dcherian commented Apr 23, 2024

brendan-m-murphy commented Apr 23, 2024

netCDF4 indexing: reindex_like is very slow if dataset not loaded into memory #8945

netCDF4 indexing: reindex_like is very slow if dataset not loaded into memory #8945

Comments

brendan-m-murphy commented Apr 15, 2024

What is your issue?

dcherian commented Apr 21, 2024

brendan-m-murphy commented Apr 21, 2024

dcherian commented Apr 23, 2024

brendan-m-murphy commented Apr 23, 2024

netCDF4 indexing: `reindex_like` is very slow if dataset not loaded into memory #8945

netCDF4 indexing: `reindex_like` is very slow if dataset not loaded into memory #8945