Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

netCDF4 indexing: reindex_like is very slow if dataset not loaded into memory #8945

Closed
brendan-m-murphy opened this issue Apr 15, 2024 · 4 comments

Comments

@brendan-m-murphy
Copy link

What is your issue?

Reindexing a dataset without loading it into memory seems to be very slow (about 1000x slower than reindexing after loading into memory).

Here is a minimum working example:

times = 100
nlat = 200
nlon = 300

fp = xr.Dataset({"fp": (["time", "lat", "lon"], np.arange(times * nlat * nlon).reshape(times, nlat, nlon))}, coords={"time": pd.date_range(start="2019-01-01T02:00:00", periods=times, freq="1H"), "lat": np.arange(nlat), "lon": np.arange(nlon)})

flux = xr.Dataset({"flux": (["time", "lat", "lon"], np.arange(nlat * nlon).reshape(1, nlat, nlon))}, coords={"time": [pd.to_datetime("2019-01-01")], "lat": np.arange(nlat) + np.random.normal(0.0, 0.01, nlat), "lon": np.arange(nlon) + np.random.normal(0.0, 0.01, nlon)})

fp.to_netcdf("combine_datasets_tests/fp.nc")
flux.to_netcdf("combine_datasets_tests/flux.nc")

fp1 = xr.open_dataset("combine_datasets_tests/fp.nc")
flux1 = xr.open_dataset("combine_datasets_tests/flux.nc")

Then

flux1 = flux1.reindex_like(fp1, method="ffill", tolerance=None)

takes over a minute, while

flux1 = flux1.load().reindex_like(fp1, method="ffill", tolerance=None)

is almost instantaneous (timeit says 91ms, including opening the dataset... I'm not sure if caching is influencing this).

Profiling the "reindex without load" cell:

        804936 function calls (804622 primitive calls) in 93.285 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   92.211   92.211   93.191   93.191 {built-in method _operator.getitem}
        1    0.289    0.289    0.980    0.980 utils.py:81(_StartCountStride)
        6    0.239    0.040    0.613    0.102 shape_base.py:267(apply_along_axis)
    72656    0.109    0.000    0.109    0.000 utils.py:429(<lambda>)
    72656    0.085    0.000    0.136    0.000 utils.py:430(<lambda>)
    72661    0.051    0.000    0.051    0.000 {built-in method numpy.arange}
   145318    0.048    0.000    0.115    0.000 shape_base.py:370(<genexpr>)
        2    0.045    0.023    0.046    0.023 indexing.py:1334(__getitem__)
        6    0.044    0.007    0.044    0.007 numeric.py:136(ones)
   145318    0.044    0.000    0.067    0.000 index_tricks.py:690(__next__)
       14    0.033    0.002    0.033    0.002 {built-in method numpy.empty}
145333/145325    0.023    0.000    0.023    0.000 {built-in method builtins.next}
        1    0.020    0.020   93.275   93.275 duck_array_ops.py:317(where)
       21    0.018    0.001    0.018    0.001 {method 'astype' of 'numpy.ndarray' objects}
   145330    0.013    0.000    0.013    0.000 {built-in method numpy.asanyarray}
        1    0.002    0.002    0.002    0.002 {built-in method _functools.reduce}
        1    0.002    0.002   93.279   93.279 variable.py:821(_getitem_with_mask)
       18    0.001    0.000    0.001    0.000 {built-in method numpy.zeros}
        1    0.000    0.000    0.000    0.000 file_manager.py:226(close)

The getitem call at the top is from xarray.backends.netCDF4_.py, line 114. Because of the jittered coordinates in flux, I'm assuming that the index passed to netCDF4 is not consecutive/strictly monotonic integers (0, 1, 2, 3, ...). In the past, this has caused issues: Unidata/netcdf4-python#680.

In my venv, netCDF4 was installed from a wheel with the following versions:

netcdf4-python version: 1.6.5
HDF5 lib version:       1.12.2
netcdf lib version:     4.9.3-development

This is with xarray version 2023.12.0, numpy 1.26, and pandas 1.5.3.

I will try to investigate more and hopefully simplify the example. (Can't quite justify spending more time on it at work because this is just to tag a version that was used in some experiments before we switch to zarr as a backend, so hopefully it won't be relevant at that point.)

@brendan-m-murphy brendan-m-murphy added the needs triage Issue that has not been reviewed by xarray team member label Apr 15, 2024
@max-sixty max-sixty added topic-performance and removed needs triage Issue that has not been reviewed by xarray team member labels Apr 15, 2024
@dcherian
Copy link
Contributor

Can you try h5netcdf?

@brendan-m-murphy
Copy link
Author

Can you try h5netcdf?

Using engine="h5netcdf" in the call to open_dataset for flux1 does fix this (negligible difference between loading/not loading into memory first).

So I guess the issue is with netCDF4, or how xarray uses netCDF4. In the netcdf4-python issue I linked above, I think the problem wasn't present in h5netcdf as well. That issue is slightly different, since the slow down was from loading with an index like 0, 2, 4, ... . Reindexing with ffill usually results in repeated values in the index.

I'll try some variations/simplifying the example later.

@dcherian dcherian changed the title reindex_like is very slow if dataset not loaded into memory netCDF4 indexing: reindex_like is very slow if dataset not loaded into memory Apr 23, 2024
@dcherian
Copy link
Contributor

Interestingly, we actually treat h5netcdf and netCDF4 arrays differently in terms of indexing:

def __getitem__(self, key):
return indexing.explicit_indexing_adapter(
key, self.shape, indexing.IndexingSupport.OUTER, self._getitem
)

def __getitem__(self, key):
return indexing.explicit_indexing_adapter(
key, self.shape, indexing.IndexingSupport.OUTER_1VECTOR, self._getitem
)

But in this example, the underling arrays are being indexed with the same tuple, so yes I'm inclined to blame netCDF4.

(array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 array([  0,   0,   1,   3,   4,   4,   5,   7,   8,   8,   9,  10,  12,
         12,  13,  15,  15,  16,  17,  19,  20,  20,  21,  22,  24,  25,
         26,  26,  28,  28,  30,  30,  31,  33,  34,  35,  36,  37,  38,
         38,  39,  40,  42,  42,  43,  45,  45,  47,  48,  48,  50,  50,
         52,  53,  53,  54,  56,  57,  58,  58,  59,  60,  62,  62,  64,
         65,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  76,  77,
         78,  78,  80,  81,  82,  82,  83,  85,  85,  87,  88,  88,  90,
         91,  91,  92,  94,  94,  95,  96,  98,  98, 100, 101, 101, 102,
        104, 104, 106, 107, 107, 108, 109, 110, 111, 113, 113, 114, 115,
        117, 117, 118, 119, 120, 121, 123, 123, 124, 126, 126, 128, 129,
        129, 131, 131, 132, 133, 134, 135, 136, 137, 138, 139, 141, 142,
        142, 144, 144, 145, 146, 148, 149, 149, 150, 152, 152, 154, 155,
        156, 157, 158, 159, 160, 161, 161, 163, 164, 165, 165, 166, 168,
        168, 169, 171, 172, 172, 174, 175, 175, 176, 177, 178, 180, 180,
        182, 183, 183, 185, 185, 186, 188, 189, 189, 191, 192, 192, 193,
        194, 195, 197, 198, 198]),
 array([  0,   1,   2,   3,   3,   4,   6,   6,   7,   8,  10,  11,  12,
         12,  14,  15,  16,  17,  17,  19,  19,  21,  21,  22,  23,  24,
         25,  26,  27,  29,  30,  30,  31,  33,  33,  34,  35,  37,  37,
         39,  39,  40,  42,  43,  43,  44,  46,  47,  47,  48,  50,  51,
         52,  52,  53,  55,  55,  56,  57,  59,  59,  61,  61,  62,  63,
         64,  66,  66,  67,  69,  70,  71,  71,  73,  73,  75,  75,  76,
         77,  78,  79,  80,  82,  82,  83,  84,  86,  87,  88,  88,  89,
         91,  91,  92,  94,  95,  95,  97,  98,  99, 100, 101, 101, 103,
        103, 104, 106, 106, 107, 108, 109, 111, 112, 112, 113, 114, 115,
        116, 118, 119, 119, 120, 121, 123, 124, 124, 126, 126, 127, 129,
        129, 130, 132, 133, 133, 134, 136, 137, 138, 139, 140, 141, 142,
        142, 144, 144, 145, 147, 147, 148, 149, 150, 152, 153, 153, 155,
        156, 157, 158, 158, 160, 161, 162, 162, 163, 164, 166, 167, 167,
        169, 169, 170, 171, 173, 174, 175, 176, 176, 178, 178, 180, 180,
        182, 182, 184, 184, 185, 187, 188, 189, 190, 191, 192, 193, 194,
        194, 196, 197, 198, 199, 200, 201, 202, 203, 203, 204, 205, 207,
        207, 209, 210, 210, 211, 213, 213, 215, 216, 216, 218, 219, 219,
        221, 222, 223, 223, 224, 225, 227, 227, 229, 229, 231, 232, 232,
        234, 235, 235, 236, 238, 239, 240, 240, 242, 242, 243, 245, 245,
        246, 248, 248, 249, 250, 252, 252, 253, 255, 255, 256, 257, 259,
        260, 260, 261, 263, 264, 264, 266, 267, 267, 269, 269, 270, 271,
        273, 274, 275, 276, 277, 278, 278, 279, 280, 281, 282, 283, 285,
        285, 287, 288, 288, 290, 290, 292, 293, 293, 295, 295, 296, 297,
        299]))

I'm going to close since there doesn't seem like much to do here, except switch to h5netcdf

@brendan-m-murphy
Copy link
Author

Thanks for looking into it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants