-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory issue merging NetCDF files using xarray.open_mfdataset and to_netcdf #7397
Comments
By the way, Using |
Thanks for this bug report. FWIW I have also seen this bug recently when helping out a student. The question here is whether this is an xarray, numpy, or a netcdf bug (or some combo). Can you reproduce the problem using |
IIUC the amount of memory is quite what the dimensions suggest (assuming 4byte dtype): (280 * 200 * 277 * 754 * 4 bytes) / 1024³ = 43.57 GB I'm not that familiar with the data flow in Some questions @benoitespinola : Can you show the repr's of the single file Dataset's and the repr of the combined? Further suggestions: If you have multiple data variables, drop all but one prior to saving. Is the behaviour consistent for each of your variables? |
A single file (from ncdump -h):
And after the merge, the only difference is in the time dimension that goes from 28 to 280 (or so) |
Just tested with to_zarr and it goes through:
I did an extra run using a memory profiler as such:
The profiled code was also completed with great success:
Here is the outcome for the memory profiling:
PS: in this test I just realized I loaded 8 files instead of 5. |
Answering to the question 'Did you do some processing with the data, changing attributes/encoding etc?': I try now to do an MCVE with dummy data. |
By the way, prior to writing this ticket, I also did the following (which did not help): |
Because I want to have a worry-free holidays, I wrote a bit of code that basically creates a new NetCDF file from scratch. I load the data from Xarray, change the data to Numpy arrays and use the NetCDF4 library to write the files (does what I want). In the process, I also slice the data and drop unwanted variables to keep just the bits I want (unlike my original post). If I call .load() or .compute() on my xarray variable, the memory goes crazy (even if I am dropping unwanted variables - which I would expect to release memory). The same happens for slicing followed by .compute(). Unfortunately, the MCVE will have to wait until I am back from my holidays. Happy holidays to all! |
@benoitespinola Did you get along with this? If yes, the solution would be useful to other users. If not, an MCVE is always appreciated. |
What happened?
I have 5 NetCDF files (1 GiB each). They have 4 dimensions: time, depth, lat, lon. All the files have exactly the same depth, lat, lon. The time axis have the same interval and there are no gaps on this axis for all the 5 files (and there is continuity in the axis between files).
All I am doing is merging the files along the time-axis and saving it to a new NetCDF file.
Running the script, I allocated 185 GiB of memory (the maximum in my cluster).
The program runs until the to_netcdf() function is called. I get an error stating there is not enough memory.
What did you expect to happen?
As the 5 files are 1 GiB each, and I allocated 185 GiB (far more than 5² GiB), I expected the program to run and not require more than the allocated memory (after all, I gave 37 times the combined size of the files).
Minimal Complete Verifiable Example
MVCE confirmation
Relevant log output
Anything else we need to know?
I allocated 185 GiB for this job, from my understanding, this means that merging 5 datasets with 1 GiB each requires more than 185 GiB memory. It sounds like a memory leak to me.
I am not the only one with this issue, cf: #4890
Environment
/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/_distutils_hack/init.py:33: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")
INSTALLED VERSIONS
commit: None
python: 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:35:26) [GCC 10.4.0]
python-bits: 64
OS: Linux
OS-release: 4.18.0-372.26.1.el8_6.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.8.1
xarray: 2022.6.0
pandas: 1.4.4
numpy: 1.23.2
scipy: 1.9.1
netCDF4: 1.6.0
pydap: None
h5netcdf: None
h5py: 3.7.0
Nio: None
zarr: None
cftime: 1.6.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2022.9.0
distributed: 2022.9.0
matplotlib: 3.5.3
cartopy: None
seaborn: 0.12.0
numbagg: None
fsspec: 2022.8.2
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 65.3.0
pip: 22.2.2
conda: None
pytest: 7.1.3
IPython: 7.33.0
sphinx: 5.1.1
The text was updated successfully, but these errors were encountered: