Excessive memory consumption by to_dataframe() #6561

sgdecker · 2022-05-02T15:33:33Z

What happened?

This is a reincarnation of #2534 with a reproduceable example.

A 51 MB netCDF file leads to to_dataframe() requesting 23 GB.

What did you expect to happen?

I expect to_dataframe() to require much less than 23 GB of memory for this operation.

Minimal Complete Verifiable Example

import urllib.request
import xarray as xr

url = 'http://people.envsci.rutgers.edu/decker/Surface_METAR_20220501_0000.nc'
fname = 'metar.nc'
urllib.request.urlretrieve(url, filename=fname)
ncdata = xr.open_dataset(fname)
df = ncdata.to_dataframe()

MVCE confirmation

Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
Complete example — the example is self-contained, including all data and the text of any traceback.
Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

Traceback (most recent call last):
  File "/chariton/decker/test/bug/xarraymem.py", line 8, in <module>
    df = ncdata.to_dataframe()
  File "/home/decker/local/miniconda3/envs/xarraybug/lib/python3.10/site-packages/xarray/core/dataset.py", line 5399, in to_dataframe
    return self._to_dataframe(ordered_dims=ordered_dims)
  File "/home/decker/local/miniconda3/envs/xarraybug/lib/python3.10/site-packages/xarray/core/dataset.py", line 5363, in _to_dataframe
    data = [
  File "/home/decker/local/miniconda3/envs/xarraybug/lib/python3.10/site-packages/xarray/core/dataset.py", line 5364, in <listcomp>
    self._variables[k].set_dims(ordered_dims).values.reshape(-1)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 23.3 GiB for an array with shape (5021, 127626) and data type |S39

Anything else we need to know?

No response

Environment

/home/decker/local/miniconda3/envs/xarraybug/lib/python3.10/site-packages/_distutils_hack/init.py:30: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")

INSTALLED VERSIONS

commit: None
python: 3.10.4 | packaged by conda-forge | (main, Mar 24 2022, 17:39:04) [GCC 10.3.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-1160.62.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.1
libnetcdf: 4.8.1

xarray: 2022.3.0
pandas: 1.4.2
numpy: 1.22.3
scipy: None
netCDF4: 1.5.8
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.6.0
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: None
setuptools: 62.1.0
pip: 22.0.4
conda: None
pytest: None
IPython: None
sphinx: None

max-sixty · 2022-05-02T22:09:40Z

Great, thanks for the example @sgdecker .

I think this is happening because there are variables of different dimensions that are getting broadcast together:

In [5]: ncdata[['lastChild']].to_dataframe()
Out[5]:
         lastChild
station
0         127265.0
1              NaN
2         127492.0
3         124019.0
4              NaN
...            ...
5016      124375.0
5017      126780.0
5018      126781.0
5019      124902.0
5020       93468.0

[5021 rows x 1 columns]

In [6]: ncdata[['lastChild','snowfall_amount']].to_dataframe()
Out[6]:
                lastChild  snowfall_amount
station recNum
0       0        127265.0              NaN
        1        127265.0              NaN
        2        127265.0              NaN
        3        127265.0              NaN
        4        127265.0              NaN
...                   ...              ...
5020    127621    93468.0              NaN
        127622    93468.0              NaN
        127623    93468.0              NaN
        127624    93468.0              NaN
        127625    93468.0              NaN

[640810146 rows x 2 columns]

640810146 rows is the giveaway.

I'm not sure what we could do here — I don't think there's a way of producing a 2D dataframe without blowing this out?

We could offer a warning on this behavior beyond a certain size — we'd take a PR for that...

sgdecker · 2022-05-03T17:13:02Z

Thanks for the feedback and explanation. It seems the poorly constructed netCDF file is fundamentally to blame for triggering this behavior. A warning is a good idea, though.

max-sixty · 2022-05-03T17:19:23Z

Thanks for the feedback and explanation. It seems the poorly constructed netCDF file is fundamentally to blame for triggering this behavior. A warning is a good idea, though.

I'm not sure it's necessarily poorly constructed — it can be quite useful to structure data like this — having aligned data of different dimensions in a single dataset is great. But the attribute of the data that makes datasets a good format also makes it bad for a single table.

Probably what we'd want is to_dataframes(), which would create a dataframe for each combination of dimensions...

max-sixty · 2023-12-09T05:15:15Z

I think there's a potential to_dataframes feature (maybe with a different name), but without something more specific, we can probably close this until we have a real proposal.

sgdecker added bug needs triage Issue that has not been reviewed by xarray team member labels May 2, 2022

sgdecker mentioned this issue May 2, 2022

Going from TDSCatalog to pandas via xarray uses excessive memory Unidata/siphon#304

Closed

max-sixty removed bug needs triage Issue that has not been reviewed by xarray team member labels May 21, 2022

max-sixty added the plan to close May be closeable, needs more eyeballs label Dec 9, 2023

max-sixty closed this as completed Dec 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive memory consumption by to_dataframe() #6561

Excessive memory consumption by to_dataframe() #6561

sgdecker commented May 2, 2022

INSTALLED VERSIONS

max-sixty commented May 2, 2022

sgdecker commented May 3, 2022

max-sixty commented May 3, 2022

max-sixty commented Dec 9, 2023

Excessive memory consumption by to_dataframe() #6561

Excessive memory consumption by to_dataframe() #6561

Comments

sgdecker commented May 2, 2022

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

MVCE confirmation

Relevant log output

Anything else we need to know?

Environment

INSTALLED VERSIONS

max-sixty commented May 2, 2022

sgdecker commented May 3, 2022

max-sixty commented May 3, 2022

max-sixty commented Dec 9, 2023