Avoid loading any data for reprs #6722

dcherian · 2022-06-24T19:04:30Z

What happened?

For "small" datasets, we load in to memory when displaying the repr. For cloud backed datasets with large number of "small" variables, this can use a lot of time sequentially loading O(100) variables just for a repr.

xarray/xarray/core/formatting.py

Lines 548 to 549 in 6c8db5e

    
           elif array._in_memory or array.size < 1e5: 
        
               return short_numpy_repr(array)

What did you expect to happen?

Fast reprs!

Minimal Complete Verifiable Example

This dataset has 48 "small" variables

import xarray as xr

dc1 = xr.open_dataset('s3://its-live-data/datacubes/v02/N40E080/ITS_LIVE_vel_EPSG32645_G0120_X250000_Y4750000.zarr', engine= 'zarr', storage_options = {'anon':True})
dc1._repr_html_()

On 2022.03.0 this repr takes 36.4s
If I comment the array.size condition I get 6μs.

MVCE confirmation

Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
Complete example — the example is self-contained, including all data and the text of any traceback.
Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.10.4 | packaged by conda-forge | (main, Mar 24 2022, 17:43:32) [Clang 12.0.1 ] python-bits: 64 OS: Darwin OS-release: 21.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: None libnetcdf: None

xarray: 2022.3.0
pandas: 1.4.2
numpy: 1.22.4
scipy: 1.8.1
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.11.3
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.2.10
cfgrib: None
iris: None
bottleneck: None
dask: 2022.05.2
distributed: None
matplotlib: 3.5.2
cartopy: 0.20.2
seaborn: 0.11.2
numbagg: None
fsspec: 2022.5.0
cupy: None
pint: None
sparse: None
setuptools: 62.3.2
pip: 22.1.2
conda: None
pytest: None
IPython: 8.4.0
sphinx: 4.5.0

The text was updated successfully, but these errors were encountered:

dcherian · 2022-06-24T19:05:38Z

cc @e-marshall @scottyhq

TomNicholas · 2022-06-24T21:59:57Z

So what's the solution here? Add another condition checking for more than a certain number of variables? Somehow check whether a dataset is cloud-backed?

dcherian · 2022-06-24T22:09:56Z

I think the best thing to do is to not load anything unless asked to. So delete the array.size < 1e5 condition.

scottyhq · 2022-06-25T00:37:46Z

This would be a pretty small change and only applies for loading data into numpy arrays, for example current repr for a variable followed by modified for the example dataset above (which already happens for large arrays):

Seeing a few values at the edges can be nice, so this makes me realize how data summaries in the metadata (Zarr or STAC) is great for large datasets on cloud storage.

Illviljan · 2022-06-26T09:41:00Z

Is the print still slow if somewhere just before the load the array was masked to only show a few start and end elements, array[[0, 1, -2, -1]]?

dcherian added bug topic-html-repr labels Jun 24, 2022

Illviljan mentioned this issue Oct 24, 2022

Avoid loading any data for reprs #7203

Merged

3 tasks

dcherian closed this as completed in #7203 Oct 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid loading any data for reprs #6722

Avoid loading any data for reprs #6722

dcherian commented Jun 24, 2022 •

edited

dcherian commented Jun 24, 2022

TomNicholas commented Jun 24, 2022

dcherian commented Jun 24, 2022

scottyhq commented Jun 25, 2022

Illviljan commented Jun 26, 2022

Avoid loading any data for reprs #6722

Avoid loading any data for reprs #6722

Comments

dcherian commented Jun 24, 2022 • edited

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

MVCE confirmation

Relevant log output

Anything else we need to know?

Environment

dcherian commented Jun 24, 2022

TomNicholas commented Jun 24, 2022

dcherian commented Jun 24, 2022

scottyhq commented Jun 25, 2022

Illviljan commented Jun 26, 2022

dcherian commented Jun 24, 2022 •

edited