New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Excessive memory consumption by to_dataframe() #6561
Comments
Great, thanks for the example @sgdecker . I think this is happening because there are variables of different dimensions that are getting broadcast together: In [5]: ncdata[['lastChild']].to_dataframe()
Out[5]:
lastChild
station
0 127265.0
1 NaN
2 127492.0
3 124019.0
4 NaN
... ...
5016 124375.0
5017 126780.0
5018 126781.0
5019 124902.0
5020 93468.0
[5021 rows x 1 columns]
In [6]: ncdata[['lastChild','snowfall_amount']].to_dataframe()
Out[6]:
lastChild snowfall_amount
station recNum
0 0 127265.0 NaN
1 127265.0 NaN
2 127265.0 NaN
3 127265.0 NaN
4 127265.0 NaN
... ... ...
5020 127621 93468.0 NaN
127622 93468.0 NaN
127623 93468.0 NaN
127624 93468.0 NaN
127625 93468.0 NaN
[640810146 rows x 2 columns]
I'm not sure what we could do here — I don't think there's a way of producing a 2D dataframe without blowing this out? We could offer a warning on this behavior beyond a certain size — we'd take a PR for that... |
Thanks for the feedback and explanation. It seems the poorly constructed netCDF file is fundamentally to blame for triggering this behavior. A warning is a good idea, though. |
I'm not sure it's necessarily poorly constructed — it can be quite useful to structure data like this — having aligned data of different dimensions in a single dataset is great. But the attribute of the data that makes datasets a good format also makes it bad for a single table. Probably what we'd want is |
I think there's a potential |
What happened?
This is a reincarnation of #2534 with a reproduceable example.
A 51 MB netCDF file leads to to_dataframe() requesting 23 GB.
What did you expect to happen?
I expect to_dataframe() to require much less than 23 GB of memory for this operation.
Minimal Complete Verifiable Example
MVCE confirmation
Relevant log output
Anything else we need to know?
No response
Environment
/home/decker/local/miniconda3/envs/xarraybug/lib/python3.10/site-packages/_distutils_hack/init.py:30: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")
INSTALLED VERSIONS
commit: None
python: 3.10.4 | packaged by conda-forge | (main, Mar 24 2022, 17:39:04) [GCC 10.3.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-1160.62.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.1
libnetcdf: 4.8.1
xarray: 2022.3.0
pandas: 1.4.2
numpy: 1.22.3
scipy: None
netCDF4: 1.5.8
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.6.0
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: None
setuptools: 62.1.0
pip: 22.0.4
conda: None
pytest: None
IPython: None
sphinx: None
The text was updated successfully, but these errors were encountered: