Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid loading any data for reprs #6722

Closed
4 tasks done
dcherian opened this issue Jun 24, 2022 · 5 comments · Fixed by #7203
Closed
4 tasks done

Avoid loading any data for reprs #6722

dcherian opened this issue Jun 24, 2022 · 5 comments · Fixed by #7203

Comments

@dcherian
Copy link
Contributor

dcherian commented Jun 24, 2022

What happened?

For "small" datasets, we load in to memory when displaying the repr. For cloud backed datasets with large number of "small" variables, this can use a lot of time sequentially loading O(100) variables just for a repr.

elif array._in_memory or array.size < 1e5:
return short_numpy_repr(array)

What did you expect to happen?

Fast reprs!

Minimal Complete Verifiable Example

This dataset has 48 "small" variables

import xarray as xr

dc1 = xr.open_dataset('s3://its-live-data/datacubes/v02/N40E080/ITS_LIVE_vel_EPSG32645_G0120_X250000_Y4750000.zarr', engine= 'zarr', storage_options = {'anon':True})
dc1._repr_html_()

On 2022.03.0 this repr takes 36.4s
If I comment the array.size condition I get 6μs.

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.10.4 | packaged by conda-forge | (main, Mar 24 2022, 17:43:32) [Clang 12.0.1 ] python-bits: 64 OS: Darwin OS-release: 21.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: None libnetcdf: None

xarray: 2022.3.0
pandas: 1.4.2
numpy: 1.22.4
scipy: 1.8.1
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.11.3
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.2.10
cfgrib: None
iris: None
bottleneck: None
dask: 2022.05.2
distributed: None
matplotlib: 3.5.2
cartopy: 0.20.2
seaborn: 0.11.2
numbagg: None
fsspec: 2022.5.0
cupy: None
pint: None
sparse: None
setuptools: 62.3.2
pip: 22.1.2
conda: None
pytest: None
IPython: 8.4.0
sphinx: 4.5.0

@dcherian
Copy link
Contributor Author

cc @e-marshall @scottyhq

@TomNicholas
Copy link
Contributor

So what's the solution here? Add another condition checking for more than a certain number of variables? Somehow check whether a dataset is cloud-backed?

@dcherian
Copy link
Contributor Author

I think the best thing to do is to not load anything unless asked to. So delete the array.size < 1e5 condition.

@scottyhq
Copy link
Contributor

This would be a pretty small change and only applies for loading data into numpy arrays, for example current repr for a variable followed by modified for the example dataset above (which already happens for large arrays):

Screen Shot 2022-06-24 at 4 38 19 PM


Screen Shot 2022-06-24 at 4 37 26 PM

Seeing a few values at the edges can be nice, so this makes me realize how data summaries in the metadata (Zarr or STAC) is great for large datasets on cloud storage.

@Illviljan
Copy link
Contributor

Is the print still slow if somewhere just before the load the array was masked to only show a few start and end elements, array[[0, 1, -2, -1]]?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants