Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive memory consumption by to_dataframe() #6561

Closed
4 tasks done
sgdecker opened this issue May 2, 2022 · 4 comments
Closed
4 tasks done

Excessive memory consumption by to_dataframe() #6561

sgdecker opened this issue May 2, 2022 · 4 comments
Labels
plan to close May be closeable, needs more eyeballs

Comments

@sgdecker
Copy link

sgdecker commented May 2, 2022

What happened?

This is a reincarnation of #2534 with a reproduceable example.

A 51 MB netCDF file leads to to_dataframe() requesting 23 GB.

What did you expect to happen?

I expect to_dataframe() to require much less than 23 GB of memory for this operation.

Minimal Complete Verifiable Example

import urllib.request
import xarray as xr

url = 'http://people.envsci.rutgers.edu/decker/Surface_METAR_20220501_0000.nc'
fname = 'metar.nc'
urllib.request.urlretrieve(url, filename=fname)
ncdata = xr.open_dataset(fname)
df = ncdata.to_dataframe()

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

Traceback (most recent call last):
  File "/chariton/decker/test/bug/xarraymem.py", line 8, in <module>
    df = ncdata.to_dataframe()
  File "/home/decker/local/miniconda3/envs/xarraybug/lib/python3.10/site-packages/xarray/core/dataset.py", line 5399, in to_dataframe
    return self._to_dataframe(ordered_dims=ordered_dims)
  File "/home/decker/local/miniconda3/envs/xarraybug/lib/python3.10/site-packages/xarray/core/dataset.py", line 5363, in _to_dataframe
    data = [
  File "/home/decker/local/miniconda3/envs/xarraybug/lib/python3.10/site-packages/xarray/core/dataset.py", line 5364, in <listcomp>
    self._variables[k].set_dims(ordered_dims).values.reshape(-1)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 23.3 GiB for an array with shape (5021, 127626) and data type |S39

Anything else we need to know?

No response

Environment

/home/decker/local/miniconda3/envs/xarraybug/lib/python3.10/site-packages/_distutils_hack/init.py:30: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")

INSTALLED VERSIONS

commit: None
python: 3.10.4 | packaged by conda-forge | (main, Mar 24 2022, 17:39:04) [GCC 10.3.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-1160.62.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.1
libnetcdf: 4.8.1

xarray: 2022.3.0
pandas: 1.4.2
numpy: 1.22.3
scipy: None
netCDF4: 1.5.8
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.6.0
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: None
setuptools: 62.1.0
pip: 22.0.4
conda: None
pytest: None
IPython: None
sphinx: None

@sgdecker sgdecker added bug needs triage Issue that has not been reviewed by xarray team member labels May 2, 2022
@max-sixty
Copy link
Collaborator

Great, thanks for the example @sgdecker .

I think this is happening because there are variables of different dimensions that are getting broadcast together:

In [5]: ncdata[['lastChild']].to_dataframe()
Out[5]:
         lastChild
station
0         127265.0
1              NaN
2         127492.0
3         124019.0
4              NaN
...            ...
5016      124375.0
5017      126780.0
5018      126781.0
5019      124902.0
5020       93468.0

[5021 rows x 1 columns]

In [6]: ncdata[['lastChild','snowfall_amount']].to_dataframe()
Out[6]:
                lastChild  snowfall_amount
station recNum
0       0        127265.0              NaN
        1        127265.0              NaN
        2        127265.0              NaN
        3        127265.0              NaN
        4        127265.0              NaN
...                   ...              ...
5020    127621    93468.0              NaN
        127622    93468.0              NaN
        127623    93468.0              NaN
        127624    93468.0              NaN
        127625    93468.0              NaN

[640810146 rows x 2 columns]

640810146 rows is the giveaway.

I'm not sure what we could do here — I don't think there's a way of producing a 2D dataframe without blowing this out?

We could offer a warning on this behavior beyond a certain size — we'd take a PR for that...

@sgdecker
Copy link
Author

sgdecker commented May 3, 2022

Thanks for the feedback and explanation. It seems the poorly constructed netCDF file is fundamentally to blame for triggering this behavior. A warning is a good idea, though.

@max-sixty
Copy link
Collaborator

Thanks for the feedback and explanation. It seems the poorly constructed netCDF file is fundamentally to blame for triggering this behavior. A warning is a good idea, though.

I'm not sure it's necessarily poorly constructed — it can be quite useful to structure data like this — having aligned data of different dimensions in a single dataset is great. But the attribute of the data that makes datasets a good format also makes it bad for a single table.

Probably what we'd want is to_dataframes(), which would create a dataframe for each combination of dimensions...

@max-sixty max-sixty removed bug needs triage Issue that has not been reviewed by xarray team member labels May 21, 2022
@max-sixty max-sixty added the plan to close May be closeable, needs more eyeballs label Dec 9, 2023
@max-sixty
Copy link
Collaborator

I think there's a potential to_dataframes feature (maybe with a different name), but without something more specific, we can probably close this until we have a real proposal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to close May be closeable, needs more eyeballs
Projects
None yet
Development

No branches or pull requests

2 participants