Extremely Large Memory usage for a very small variable #5604

tommy307507 · 2021-07-15T04:52:35Z

What happened:
Variable that takes up very little memory in actual data size uses over 1000x the memory
What you expected to happen:
It should take the order of the memory of the data size
Minimal Complete Verifiable Example:

# Put your MCVE code here
>>> import xarray as xr
>>>a  = ['./CLM/u_2021-06-01T00_regridded.nc', './CLM/u_2021-06-01T03_regridded.nc', './CLM/u_2021-06-01T06_regridded.nc', './CLM/u_2021-06-01T09_regridded.nc', './CLM/u_2021-06-01T12_regridded.nc', './CLM/u_2021-06-01T15_regridded.nc', './CLM/u_2021-06-01T18_regridded.nc', './CLM/u_2021-06-01T21_regridded.nc', './CLM/u_2021-06-02T00_regridded.nc', './CLM/u_2021-06-02T03_regridded.nc', './CLM/u_2021-06-02T06_regridded.nc', './CLM/u_2021-06-02T09_regridded.nc', './CLM/u_2021-06-02T12_regridded.nc', './CLM/u_2021-06-02T15_regridded.nc', './CLM/u_2021-06-02T18_regridded.nc', './CLM/u_2021-06-02T21_regridded.nc', './CLM/u_2021-06-03T00_regridded.nc', './CLM/u_2021-06-03T03_regridded.nc', './CLM/u_2021-06-03T06_regridded.nc', './CLM/u_2021-06-03T09_regridded.nc', './CLM/u_2021-06-03T12_regridded.nc', './CLM/u_2021-06-03T15_regridded.nc', './CLM/u_2021-06-03T18_regridded.nc', './CLM/u_2021-06-03T21_regridded.nc', './CLM/u_2021-06-04T00_regridded.nc', './CLM/u_2021-06-04T03_regridded.nc', './CLM/u_2021-06-04T06_regridded.nc', './CLM/u_2021-06-04T09_regridded.nc', './CLM/u_2021-06-04T12_regridded.nc', './CLM/u_2021-06-04T15_regridded.nc', './CLM/u_2021-06-04T18_regridded.nc', './CLM/u_2021-06-04T21_regridded.nc', './CLM/u_2021-06-05T00_regridded.nc', './CLM/u_2021-06-05T03_regridded.nc', './CLM/u_2021-06-05T06_regridded.nc', './CLM/u_2021-06-05T09_regridded.nc', './CLM/u_2021-06-05T12_regridded.nc', './CLM/u_2021-06-05T15_regridded.nc', './CLM/u_2021-06-05T18_regridded.nc', './CLM/u_2021-06-05T21_regridded.nc', './CLM/u_2021-06-06T00_regridded.nc', './CLM/u_2021-06-06T03_regridded.nc', './CLM/u_2021-06-06T06_regridded.nc', './CLM/u_2021-06-06T09_regridded.nc', './CLM/u_2021-06-06T12_regridded.nc', './CLM/u_2021-06-06T15_regridded.nc', './CLM/u_2021-06-06T18_regridded.nc', './CLM/u_2021-06-06T21_regridded.nc', './CLM/u_2021-06-07T00_regridded.nc', './CLM/u_2021-06-07T03_regridded.nc', './CLM/u_2021-06-07T06_regridded.nc', './CLM/u_2021-06-07T09_regridded.nc', './CLM/u_2021-06-07T12_regridded.nc', './CLM/u_2021-06-07T15_regridded.nc', './CLM/u_2021-06-07T18_regridded.nc', './CLM/u_2021-06-07T21_regridded.nc', './CLM/u_2021-06-08T00_regridded.nc', './CLM/u_2021-06-08T03_regridded.nc', './CLM/u_2021-06-08T06_regridded.nc']
>>> u_file =  xr.open_mfdataset(a,data_vars='minimal',combine="by_coords",parallel=True,chunks={'eu': 1100, 'xu': 1249,'v2d_time' : 1 , 'v3d_time' : 1 , "s_rho" : 1},autoclose=True)
>>> u_file
<xarray.Dataset>
Dimensions:   (eu: 1100, s_rho: 35, v2d_time: 59, v3d_time: 59, xu: 1249)
Coordinates:
  * v2d_time  (v2d_time) timedelta64[ns] 59366 days 00:00:00 ... 59373 days 0...
  * v3d_time  (v3d_time) timedelta64[ns] 59366 days 00:00:00 ... 59373 days 0...
Dimensions without coordinates: eu, s_rho, xu
Data variables:
    u         (v3d_time, s_rho, eu, xu) float64 dask.array<chunksize=(1, 1, 1100, 1249), meta=np.ndarray>
    ubar      (v2d_time, eu, xu) float64 dask.array<chunksize=(59, 1100, 1249), meta=np.ndarray>
>>> u_file.ubar.shape
(59, 1100, 1249)
>>> u_file.ubar.data.shape
(59, 1100, 1249)
>>> u_file.ubar.data.nbytes
648480800
>>>
KeyboardInterrupt
>>> u_file.ubar.data
dask.array<where, shape=(59, 1100, 1249), dtype=float64, chunksize=(59, 1100, 1249), chunktype=numpy.ndarray>
>>> u_file.ubar.data.compute()

This image is the output of the top command in linux after executing the line with compute()

Anything else we need to know?:
The variable u is able to be written to the disk with 22Gb of memory usage, which is the expected behaviour as the variable has abou 22 Gb of data stored in those files combined. In fact, seeing that the dimension of u is 35x ubar, the file size of u_file should be only about 22 to 23Gb.
Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.8 (default, Apr 13 2021, 19:58:26)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-957.12.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.0
libnetcdf: 4.7.4

xarray: 0.18.2
pandas: 1.2.4
numpy: 1.20.3
scipy: 1.6.3
netCDF4: 1.5.6
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.5.0
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: 0.9.9.0
iris: None
bottleneck: None
dask: 2021.04.0
distributed: 2021.04.1
matplotlib: 3.4.2
cartopy: 0.19.0.post1
seaborn: None
numbagg: None
pint: None
setuptools: 49.6.0.post20210108
pip: 21.1.3
conda: None
pytest: None
IPython: 7.24.1
sphinx: None

The text was updated successfully, but these errors were encountered:

tommy307507 · 2021-07-15T05:36:50Z

The variable can be combined using xr.concat if I open the individual files using xr.open_dataset and takes only 1.1g memory , I think the issue is somehow inside open_mfdataset,
I also don't understand how the chunksize of v2d_time is 59 instead of 1

max-sixty · 2021-07-15T08:24:12Z

This will likely need much more detail. Though to start: what's the source of the 1000x number?
What happens if you pass compat="identical", coords="minimal" to open_mfdataset? If that fails, the opening operation may be doing some expensive alignment.

TomNicholas · 2021-07-15T15:51:18Z

I also don't understand how the chunksize of v2d_time is 59 instead of 1

Is v2d_time one of the dimensions being concatenated along by open_mfdataset?

tommy307507 · 2021-07-15T16:42:18Z

I also don't understand how the chunksize of v2d_time is 59 instead of 1

Is v2d_time one of the dimensions being concatenated along by open_mfdataset?

Yes, I will try the above tomorrow, and post it back here.
I did try to pass concat_dim = ["v2d_time", "v3d_time" ] but that still causes the problem

TomNicholas · 2021-07-15T16:44:32Z

An example which we can reproduce locally would be the most helpful, if possible!

…

On Thu, 15 Jul 2021, 12:42 tommy307507, ***@***.***> wrote: I also don't understand how the chunksize of v2d_time is 59 instead of 1 Is v2d_time one of the dimensions being concatenated along by open_mfdataset? Yes, I will try the above tomorrow, and post it back here. I did try to pass concat_dim = ["v2d_time", "v3d_time" ] but that still causes the problem — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#5604 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AISNPI5PCPY3XH4FSWTEQODTX4FXLANCNFSM5AMYCK2Q> .

tommy307507 · 2021-07-15T16:45:17Z

My temporary bypass around this is to do open_dataset on all of the files, storing the u and ubar in two separate lists and saving to file after doing an xr.concat on both of them
They can be concatenated just fine and the file is about the expected size of 23Gb. The operation also takes up similar memory.

tommy307507 · 2021-07-15T16:49:33Z

An example which we can reproduce locally would be the most helpful, if possible!
…
On Thu, 15 Jul 2021, 12:42 tommy307507, @.***> wrote: I also don't understand how the chunksize of v2d_time is 59 instead of 1 Is v2d_time one of the dimensions being concatenated along by open_mfdataset? Yes, I will try the above tomorrow, and post it back here. I did try to pass concat_dim = ["v2d_time", "v3d_time" ] but that still causes the problem — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#5604 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AISNPI5PCPY3XH4FSWTEQODTX4FXLANCNFSM5AMYCK2Q .

Thanks for your quick reply but I am not at work right now as it's 1am over here
I might test the limit of this happening tomorrow, I am trying to merge 59 files right now so might try less files for the lower limit. as passing 20 Gb of files around would be quite hard.

tommy307507 · 2021-07-16T00:55:25Z

This will likely need much more detail. Though to start: what's the source of the 1000x number?
What happens if you pass compat="identical", coords="minimal" to open_mfdataset? If that fails, the opening operation may be doing some expensive alignment.

Trying this gives me "conflicting values for variable 'ubar' on objects to be combined.", actually it makes sense as identical requires values to be the same right?

tommy307507 · 2021-07-16T01:13:11Z

For Ubar it says
dask.array<where, shape=(59, 1100, 1249), dtype=float64, chunksize=(59, 1100, 1249), chunktype=numpy.ndarray>
But for U it says
dask.array<concatenate, shape=(59, 35, 1100, 1249), dtype=float64, chunksize=(1, 1, 1100, 1249), chunktype=numpy.ndarray>
Those are very different operations, is that the reason for the 1000Gb consumption?

max-sixty · 2021-07-16T01:29:19Z

Again — where are you seeing this 1000GB or 1000x number?

(also have a look at GitHub docs on how to format the code)

tommy307507 · 2021-07-16T02:29:33Z

Again — where are you seeing this 1000GB or 1000x number?

(also have a look at GitHub docs on how to format the code)

Sorry I think the 1000x is a confusion on my part on not reading the numbers correctly or poor understanding of how memory units work, but I will explain it again.
on the top command, it draws all 100GiB of memory and started to use swap files that it causes the system to automately kill the code. The ubar variable should only draw 5911001249*8 = 648,480,800 bytes of memory, which is only 0.648GiB (Gigabytes), however the top command shows that it uses 92.5Gib Mem and all 16Gib of swap files, the actual drawn memory of the program is about 109 Gib (because that's all that is avaliable before it gets automatically killed) and it is in fact only 168x what's really needed.

max-sixty · 2021-07-16T18:36:45Z

The memory usage does seem high. Not having the indexes aligned makes it into an expensive operation, and I would vote to have that fail by default ref (#5499 (reply in thread)).

Can the input files be aligned before attempting to combine the data? Or are you not in control of the input files?

To debug the memory, you probably need to do something like use memory_profiler, and try for varying numbers of files — unfortunately it's a complex problem and just looking at htop gives very course information.

anonymousForPeer · 2021-07-21T13:38:22Z

Hi there, I have a very similar problem and before I open another issue I rather share my example here:

Minimal Complete Verifiable Example:

This little computation uses >500 MB of memory even if the file reveals only a size of 154MB:

with xr.open_dataset(climdata+'tavg_subset.nc', chunks={"latitude": 300, "longitude": 300}) as ds:
    print(ds)

    <xarray.Dataset>
    Dimensions:    (latitude: 168, longitude: 664, time: 731)
    Coordinates:
      * time       (time) datetime64[ns] 1971-01-01 1971-01-02 ... 1972-12-31
      * longitude  (longitude) float64 20.27 20.3 20.33 20.36 ... 40.92 40.95 40.98
      * latitude   (latitude) float64 40.23 40.2 40.17 40.14 ... 35.08 35.05 35.02
    Data variables:
        tavg       (time, latitude, longitude) float32 dask.array<chunksize=(731, 168, 300), meta=np.ndarray>
    

    annualMean = ds.tavg.resample(time="1Y").mean('time', keep_attrs=True)
    annualMean.to_netcdf("outputMean.nc", format="NETCDF4_CLASSIC", engine="netcdf4")

My problem is that the original files are each >120GB in size and I run into out-of-memory error on our HPC (asking for 10 CPUs with 16GB each).

I thought xarray processes everything in chunks for not overusing the memory - but something seems really wrong here!?

hansukyang · 2021-08-03T20:54:30Z

I don't know if this is related but recent updates of Dask has very large memory usage (after 2021.03 version) that I'm not sure is getting addressed yet (dask/dask#7583).

jhamman · 2023-09-12T15:31:12Z

Closing this issue as stale. If others are running into similar issues, please open a new issue.

dcherian added the usage question label Jul 16, 2021

jhamman closed this as completed Sep 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extremely Large Memory usage for a very small variable #5604

Extremely Large Memory usage for a very small variable #5604

tommy307507 commented Jul 15, 2021

INSTALLED VERSIONS

tommy307507 commented Jul 15, 2021

max-sixty commented Jul 15, 2021

TomNicholas commented Jul 15, 2021

tommy307507 commented Jul 15, 2021

TomNicholas commented Jul 15, 2021 via email

tommy307507 commented Jul 15, 2021

tommy307507 commented Jul 15, 2021

tommy307507 commented Jul 16, 2021 •

edited

tommy307507 commented Jul 16, 2021

max-sixty commented Jul 16, 2021

tommy307507 commented Jul 16, 2021 •

edited

max-sixty commented Jul 16, 2021

anonymousForPeer commented Jul 21, 2021 •

edited

hansukyang commented Aug 3, 2021

jhamman commented Sep 12, 2023

Extremely Large Memory usage for a very small variable #5604

Extremely Large Memory usage for a very small variable #5604

Comments

tommy307507 commented Jul 15, 2021

INSTALLED VERSIONS

tommy307507 commented Jul 15, 2021

max-sixty commented Jul 15, 2021

TomNicholas commented Jul 15, 2021

tommy307507 commented Jul 15, 2021

TomNicholas commented Jul 15, 2021 via email

tommy307507 commented Jul 15, 2021

tommy307507 commented Jul 15, 2021

tommy307507 commented Jul 16, 2021 • edited

tommy307507 commented Jul 16, 2021

max-sixty commented Jul 16, 2021

tommy307507 commented Jul 16, 2021 • edited

max-sixty commented Jul 16, 2021

anonymousForPeer commented Jul 21, 2021 • edited

hansukyang commented Aug 3, 2021

jhamman commented Sep 12, 2023

tommy307507 commented Jul 16, 2021 •

edited

tommy307507 commented Jul 16, 2021 •

edited

anonymousForPeer commented Jul 21, 2021 •

edited