Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory issue merging NetCDF files using xarray.open_mfdataset and to_netcdf #7397

Open
2 of 4 tasks
benoitespinola opened this issue Dec 21, 2022 · 9 comments
Open
2 of 4 tasks

Comments

@benoitespinola
Copy link

benoitespinola commented Dec 21, 2022

What happened?

I have 5 NetCDF files (1 GiB each). They have 4 dimensions: time, depth, lat, lon. All the files have exactly the same depth, lat, lon. The time axis have the same interval and there are no gaps on this axis for all the 5 files (and there is continuity in the axis between files).

All I am doing is merging the files along the time-axis and saving it to a new NetCDF file.

Running the script, I allocated 185 GiB of memory (the maximum in my cluster).

The program runs until the to_netcdf() function is called. I get an error stating there is not enough memory.

What did you expect to happen?

As the 5 files are 1 GiB each, and I allocated 185 GiB (far more than 5² GiB), I expected the program to run and not require more than the allocated memory (after all, I gave 37 times the combined size of the files).

Minimal Complete Verifiable Example

path = './data/data_*.nc' # files are: data_1.nc data_2.nc data_3.nc data_4.nc data_5.nc
data = xr.open_mfdataset(path)

data = data.load() # uses 5 GiB - tested with a memory profiler

data.to_netcdf('./output/combined.nc') #

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

Traceback (most recent call last):
  File "/users/me/code/par2.py", line 78, in <module>
    preprocess_data(year, month)
  File "/users/me/code/par2.py", line 69, in preprocess_data
    data.to_netcdf(path=outpath)
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/xarray/core/dataset.py", line 1882, in to_netcdf
    return to_netcdf(  # type: ignore  # mypy cannot resolve the overloads:(
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/xarray/backends/api.py", line 1210, in to_netcdf
    dump_to_store(
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/xarray/backends/api.py", line 1257, in dump_to_store
    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/xarray/backends/common.py", line 263, in store
    variables, attributes = self.encode(variables, attributes)
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/xarray/backends/common.py", line 352, in encode
    variables, attributes = cf_encoder(variables, attributes)
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/xarray/conventions.py", line 864, in cf_encoder
    new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()}
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/xarray/conventions.py", line 864, in <dictcomp>
    new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()}
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/xarray/conventions.py", line 273, in encode_cf_variable
    var = coder.encode(var, name=name)
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/xarray/coding/variables.py", line 170, in encode
    data = duck_array_ops.fillna(data, fill_value)
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/xarray/core/duck_array_ops.py", line 283, in fillna
    return where(notnull(data), data, other)
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/xarray/core/duck_array_ops.py", line 270, in where
    return _where(condition, *as_shared_dtype([x, y]))
  File "<__array_function__ internals>", line 180, in where
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 43.6 GiB for an array with shape (280, 200, 277, 754) and data type float32

Anything else we need to know?

I allocated 185 GiB for this job, from my understanding, this means that merging 5 datasets with 1 GiB each requires more than 185 GiB memory. It sounds like a memory leak to me.

I am not the only one with this issue, cf: #4890

Environment

/CSC_CONTAINER/miniconda/envs/env1/lib/python3.10/site-packages/_distutils_hack/init.py:33: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")

INSTALLED VERSIONS

commit: None
python: 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:35:26) [GCC 10.4.0]
python-bits: 64
OS: Linux
OS-release: 4.18.0-372.26.1.el8_6.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.8.1

xarray: 2022.6.0
pandas: 1.4.4
numpy: 1.23.2
scipy: 1.9.1
netCDF4: 1.6.0
pydap: None
h5netcdf: None
h5py: 3.7.0
Nio: None
zarr: None
cftime: 1.6.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2022.9.0
distributed: 2022.9.0
matplotlib: 3.5.3
cartopy: None
seaborn: 0.12.0
numbagg: None
fsspec: 2022.8.2
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 65.3.0
pip: 22.2.2
conda: None
pytest: 7.1.3
IPython: 7.33.0
sphinx: 5.1.1

@benoitespinola benoitespinola added bug needs triage Issue that has not been reviewed by xarray team member labels Dec 21, 2022
@benoitespinola
Copy link
Author

By the way,

Using .encoding to my data yields to 'complevel': 1.

@TomNicholas
Copy link
Contributor

Thanks for this bug report. FWIW I have also seen this bug recently when helping out a student.

The question here is whether this is an xarray, numpy, or a netcdf bug (or some combo). Can you reproduce the problem using to_zarr()? If so that would rule out netcdf as the culprit.

@TomNicholas TomNicholas added topic-backends topic-combine combine/concat/merge io and removed needs triage Issue that has not been reviewed by xarray team member labels Dec 22, 2022
@kmuehlbauer
Copy link
Contributor

IIUC the amount of memory is quite what the dimensions suggest (assuming 4byte dtype):

(280 * 200 * 277 * 754 * 4 bytes) / 1024³ = 43.57 GB

I'm not that familiar with the data flow in to_netcdf but it's clear that the whole data is read into memory for some reason. The error happens at backend level, so assuming engine=netcdf4. You might try with engine="h5netcdf" or consider @TomNicholas suggestion of using to_zarr to possibly get the backends out of the equation.

Some questions @benoitespinola :

Can you show the repr's of the single file Dataset's and the repr of the combined?
Are your final data variables of that size (time: 280, depth: 200, lat: 277, lon: 754)?
Did you do some processing with the data, changing attributes/encoding etc?
Is it possible to create your source data files from scratch with random data? An MCVE showing that would help.

Further suggestions:

If you have multiple data variables, drop all but one prior to saving. Is the behaviour consistent for each of your variables?
Try to be explicit in the call to open_mfdataset (eg. adding keyword chunks etc.).
Try to open individual files and use xr.merge/xr.concat.

@benoitespinola
Copy link
Author

A single file (from ncdump -h):

dimensions:
	axis_nbounds = 2 ;
	x = 754 ;
	y = 277 ;
	deptht = 200 ;
	time_counter = UNLIMITED ; // (28 currently)
variables:
	float nav_lat(y, x) ;
		nav_lat:standard_name = "latitude" ;
		nav_lat:long_name = "Latitude" ;
		nav_lat:units = "degrees_north" ;
	float nav_lon(y, x) ;
		nav_lon:standard_name = "longitude" ;
		nav_lon:long_name = "Longitude" ;
		nav_lon:units = "degrees_east" ;
	float deptht(deptht) ;
		deptht:name = "deptht" ;
		deptht:long_name = "Vertical T levels" ;
		deptht:units = "m" ;
		deptht:positive = "down" ;
		deptht:bounds = "deptht_bounds" ;
	float deptht_bounds(deptht, axis_nbounds) ;
		deptht_bounds:units = "m" ;
	double time_centered(time_counter) ;
		time_centered:standard_name = "time" ;
		time_centered:long_name = "Time axis" ;
		time_centered:calendar = "gregorian" ;
		time_centered:units = "seconds since 1900-01-01 00:00:00" ;
		time_centered:time_origin = "1900-01-01 00:00:00" ;
		time_centered:bounds = "time_centered_bounds" ;
	double time_centered_bounds(time_counter, axis_nbounds) ;
	double time_counter(time_counter) ;
		time_counter:axis = "T" ;
		time_counter:standard_name = "time" ;
		time_counter:long_name = "Time axis" ;
		time_counter:calendar = "gregorian" ;
		time_counter:units = "seconds since 1900-01-01 00:00:00" ;
		time_counter:time_origin = "1900-01-01 00:00:00" ;
		time_counter:bounds = "time_counter_bounds" ;
	double time_counter_bounds(time_counter, axis_nbounds) ;
	float toce(time_counter, deptht, y, x) ;
		toce:standard_name = "sea_water_potential_temperature" ;
		toce:long_name = "temperature" ;
		toce:units = "degC" ;
		toce:online_operation = "average" ;
		toce:interval_operation = "60 s" ;
		toce:interval_write = "6 h" ;
		toce:cell_methods = "time: mean (interval: 60 s)" ;
		toce:_FillValue = 1.e+20f ;
		toce:missing_value = 1.e+20f ;
		toce:coordinates = "time_centered nav_lat nav_lon" ;
	float soce(time_counter, deptht, y, x) ;
		soce:standard_name = "sea_water_practical_salinity" ;
		soce:long_name = "salinity" ;
		soce:units = "1e-3" ;
		soce:online_operation = "average" ;
		soce:interval_operation = "60 s" ;
		soce:interval_write = "6 h" ;
		soce:cell_methods = "time: mean (interval: 60 s)" ;
		soce:_FillValue = 1.e+20f ;
		soce:missing_value = 1.e+20f ;
		soce:coordinates = "time_centered nav_lat nav_lon" ;
	float taum(time_counter, y, x) ;
		taum:standard_name = "magnitude_of_surface_downward_stress" ;
		taum:long_name = "wind stress module" ;
		taum:units = "N/m2" ;
		taum:online_operation = "average" ;
		taum:interval_operation = "120 s" ;
		taum:interval_write = "6 h" ;
		taum:cell_methods = "time: mean (interval: 120 s)" ;
		taum:_FillValue = 1.e+20f ;
		taum:missing_value = 1.e+20f ;
		taum:coordinates = "time_centered nav_lat nav_lon" ;
	float wspd(time_counter, y, x) ;
		wspd:standard_name = "wind_speed" ;
		wspd:long_name = "wind speed module" ;
		wspd:units = "m/s" ;
		wspd:online_operation = "average" ;
		wspd:interval_operation = "120 s" ;
		wspd:interval_write = "6 h" ;
		wspd:cell_methods = "time: mean (interval: 120 s)" ;
		wspd:_FillValue = 1.e+20f ;
		wspd:missing_value = 1.e+20f ;
		wspd:coordinates = "time_centered nav_lat nav_lon" ;

And after the merge, the only difference is in the time dimension that goes from 28 to 280 (or so)

@benoitespinola
Copy link
Author

Just tested with to_zarr and it goes through:

State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 00:07:55
CPU Efficiency: 63.00% of 00:12:34 core-walltime
Job Wall-clock time: 00:06:17
Memory Utilized: 164.89 GB
Memory Efficiency: 44.56% of 370.00 GB

I did an extra run using a memory profiler as such:

import xarray as xr
import zarr
from memory_profiler import profile

@profile
def main():
    path = './data/data_*.nc' # files are: data_1.nc data_2.nc data_3.nc data_4.nc data_5.nc
    data = xr.open_mfdataset(path)

    data = data.load()
    data = data.compute()

    data.to_zarr()

if __name__=='__main__':
    main()

The profiled code was also completed with great success:

State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 00:07:52
CPU Efficiency: 63.61% of 00:12:22 core-walltime
Job Wall-clock time: 00:06:11
Memory Utilized: 165.53 GB
Memory Efficiency: 44.74% of 370.00 GB

Here is the outcome for the memory profiling:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    5    156.9 MiB    156.9 MiB           1   @profile
    6                                         def main():
    7    156.9 MiB      0.0 MiB           1       path = './data/data_*.nc' # files are: data_1.nc data_2.nc data_3.nc data_4.nc data_5.nc

    8    209.3 MiB     52.4 MiB           1       data = xr.open_mfdataset(path)
    9                                         
    10  82150.1 MiB  81940.8 MiB           1       data = data.load()
    11  82101.2 MiB    -49.0 MiB           1       data = data.compute()
    12                                         
    13  90091.2 MiB   7990.0 MiB           1       data.to_zarr()

PS: in this test I just realized I loaded 8 files instead of 5.

@benoitespinola
Copy link
Author

Answering to the question 'Did you do some processing with the data, changing attributes/encoding etc?':
No processing. I do ask xarray to load the data (and I tried also loading + computing) and the final outcome is the same.

I try now to do an MCVE with dummy data.

@benoitespinola
Copy link
Author

By the way, prior to writing this ticket, I also did the following (which did not help):
Drop variables I do not care, keeping dimensions only and toce + soce ; I would expect to need less memory after that.

@benoitespinola
Copy link
Author

benoitespinola commented Dec 23, 2022

Because I want to have a worry-free holidays, I wrote a bit of code that basically creates a new NetCDF file from scratch. I load the data from Xarray, change the data to Numpy arrays and use the NetCDF4 library to write the files (does what I want).

In the process, I also slice the data and drop unwanted variables to keep just the bits I want (unlike my original post).

If I call .load() or .compute() on my xarray variable, the memory goes crazy (even if I am dropping unwanted variables - which I would expect to release memory). The same happens for slicing followed by .compute().

Unfortunately, the MCVE will have to wait until I am back from my holidays.

Happy holidays to all!

@kmuehlbauer
Copy link
Contributor

@benoitespinola Did you get along with this? If yes, the solution would be useful to other users. If not, an MCVE is always appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants