Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extremely Large Memory usage for a very small variable #5604

Closed
tommy307507 opened this issue Jul 15, 2021 · 15 comments
Closed

Extremely Large Memory usage for a very small variable #5604

tommy307507 opened this issue Jul 15, 2021 · 15 comments

Comments

@tommy307507
Copy link

What happened:
Variable that takes up very little memory in actual data size uses over 1000x the memory
What you expected to happen:
It should take the order of the memory of the data size
Minimal Complete Verifiable Example:

# Put your MCVE code here
>>> import xarray as xr
>>>a  = ['./CLM/u_2021-06-01T00_regridded.nc', './CLM/u_2021-06-01T03_regridded.nc', './CLM/u_2021-06-01T06_regridded.nc', './CLM/u_2021-06-01T09_regridded.nc', './CLM/u_2021-06-01T12_regridded.nc', './CLM/u_2021-06-01T15_regridded.nc', './CLM/u_2021-06-01T18_regridded.nc', './CLM/u_2021-06-01T21_regridded.nc', './CLM/u_2021-06-02T00_regridded.nc', './CLM/u_2021-06-02T03_regridded.nc', './CLM/u_2021-06-02T06_regridded.nc', './CLM/u_2021-06-02T09_regridded.nc', './CLM/u_2021-06-02T12_regridded.nc', './CLM/u_2021-06-02T15_regridded.nc', './CLM/u_2021-06-02T18_regridded.nc', './CLM/u_2021-06-02T21_regridded.nc', './CLM/u_2021-06-03T00_regridded.nc', './CLM/u_2021-06-03T03_regridded.nc', './CLM/u_2021-06-03T06_regridded.nc', './CLM/u_2021-06-03T09_regridded.nc', './CLM/u_2021-06-03T12_regridded.nc', './CLM/u_2021-06-03T15_regridded.nc', './CLM/u_2021-06-03T18_regridded.nc', './CLM/u_2021-06-03T21_regridded.nc', './CLM/u_2021-06-04T00_regridded.nc', './CLM/u_2021-06-04T03_regridded.nc', './CLM/u_2021-06-04T06_regridded.nc', './CLM/u_2021-06-04T09_regridded.nc', './CLM/u_2021-06-04T12_regridded.nc', './CLM/u_2021-06-04T15_regridded.nc', './CLM/u_2021-06-04T18_regridded.nc', './CLM/u_2021-06-04T21_regridded.nc', './CLM/u_2021-06-05T00_regridded.nc', './CLM/u_2021-06-05T03_regridded.nc', './CLM/u_2021-06-05T06_regridded.nc', './CLM/u_2021-06-05T09_regridded.nc', './CLM/u_2021-06-05T12_regridded.nc', './CLM/u_2021-06-05T15_regridded.nc', './CLM/u_2021-06-05T18_regridded.nc', './CLM/u_2021-06-05T21_regridded.nc', './CLM/u_2021-06-06T00_regridded.nc', './CLM/u_2021-06-06T03_regridded.nc', './CLM/u_2021-06-06T06_regridded.nc', './CLM/u_2021-06-06T09_regridded.nc', './CLM/u_2021-06-06T12_regridded.nc', './CLM/u_2021-06-06T15_regridded.nc', './CLM/u_2021-06-06T18_regridded.nc', './CLM/u_2021-06-06T21_regridded.nc', './CLM/u_2021-06-07T00_regridded.nc', './CLM/u_2021-06-07T03_regridded.nc', './CLM/u_2021-06-07T06_regridded.nc', './CLM/u_2021-06-07T09_regridded.nc', './CLM/u_2021-06-07T12_regridded.nc', './CLM/u_2021-06-07T15_regridded.nc', './CLM/u_2021-06-07T18_regridded.nc', './CLM/u_2021-06-07T21_regridded.nc', './CLM/u_2021-06-08T00_regridded.nc', './CLM/u_2021-06-08T03_regridded.nc', './CLM/u_2021-06-08T06_regridded.nc']
>>> u_file =  xr.open_mfdataset(a,data_vars='minimal',combine="by_coords",parallel=True,chunks={'eu': 1100, 'xu': 1249,'v2d_time' : 1 , 'v3d_time' : 1 , "s_rho" : 1},autoclose=True)
>>> u_file
<xarray.Dataset>
Dimensions:   (eu: 1100, s_rho: 35, v2d_time: 59, v3d_time: 59, xu: 1249)
Coordinates:
  * v2d_time  (v2d_time) timedelta64[ns] 59366 days 00:00:00 ... 59373 days 0...
  * v3d_time  (v3d_time) timedelta64[ns] 59366 days 00:00:00 ... 59373 days 0...
Dimensions without coordinates: eu, s_rho, xu
Data variables:
    u         (v3d_time, s_rho, eu, xu) float64 dask.array<chunksize=(1, 1, 1100, 1249), meta=np.ndarray>
    ubar      (v2d_time, eu, xu) float64 dask.array<chunksize=(59, 1100, 1249), meta=np.ndarray>
>>> u_file.ubar.shape
(59, 1100, 1249)
>>> u_file.ubar.data.shape
(59, 1100, 1249)
>>> u_file.ubar.data.nbytes
648480800
>>>
KeyboardInterrupt
>>> u_file.ubar.data
dask.array<where, shape=(59, 1100, 1249), dtype=float64, chunksize=(59, 1100, 1249), chunktype=numpy.ndarray>
>>> u_file.ubar.data.compute()

image
This image is the output of the top command in linux after executing the line with compute()

Anything else we need to know?:
The variable u is able to be written to the disk with 22Gb of memory usage, which is the expected behaviour as the variable has abou 22 Gb of data stored in those files combined. In fact, seeing that the dimension of u is 35x ubar, the file size of u_file should be only about 22 to 23Gb.
Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.8 (default, Apr 13 2021, 19:58:26)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-957.12.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.0
libnetcdf: 4.7.4

xarray: 0.18.2
pandas: 1.2.4
numpy: 1.20.3
scipy: 1.6.3
netCDF4: 1.5.6
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.5.0
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: 0.9.9.0
iris: None
bottleneck: None
dask: 2021.04.0
distributed: 2021.04.1
matplotlib: 3.4.2
cartopy: 0.19.0.post1
seaborn: None
numbagg: None
pint: None
setuptools: 49.6.0.post20210108
pip: 21.1.3
conda: None
pytest: None
IPython: 7.24.1
sphinx: None

@tommy307507
Copy link
Author

The variable can be combined using xr.concat if I open the individual files using xr.open_dataset and takes only 1.1g memory , I think the issue is somehow inside open_mfdataset,
I also don't understand how the chunksize of v2d_time is 59 instead of 1

@max-sixty
Copy link
Collaborator

This will likely need much more detail. Though to start: what's the source of the 1000x number?
What happens if you pass compat="identical", coords="minimal" to open_mfdataset? If that fails, the opening operation may be doing some expensive alignment.

@TomNicholas
Copy link
Contributor

I also don't understand how the chunksize of v2d_time is 59 instead of 1

Is v2d_time one of the dimensions being concatenated along by open_mfdataset?

@tommy307507
Copy link
Author

I also don't understand how the chunksize of v2d_time is 59 instead of 1

Is v2d_time one of the dimensions being concatenated along by open_mfdataset?

Yes, I will try the above tomorrow, and post it back here.
I did try to pass concat_dim = ["v2d_time", "v3d_time" ] but that still causes the problem

@TomNicholas
Copy link
Contributor

TomNicholas commented Jul 15, 2021 via email

@tommy307507
Copy link
Author

My temporary bypass around this is to do open_dataset on all of the files, storing the u and ubar in two separate lists and saving to file after doing an xr.concat on both of them
They can be concatenated just fine and the file is about the expected size of 23Gb. The operation also takes up similar memory.

@tommy307507
Copy link
Author

An example which we can reproduce locally would be the most helpful, if possible!

On Thu, 15 Jul 2021, 12:42 tommy307507, @.***> wrote: I also don't understand how the chunksize of v2d_time is 59 instead of 1 Is v2d_time one of the dimensions being concatenated along by open_mfdataset? Yes, I will try the above tomorrow, and post it back here. I did try to pass concat_dim = ["v2d_time", "v3d_time" ] but that still causes the problem — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#5604 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AISNPI5PCPY3XH4FSWTEQODTX4FXLANCNFSM5AMYCK2Q .

Thanks for your quick reply but I am not at work right now as it's 1am over here
I might test the limit of this happening tomorrow, I am trying to merge 59 files right now so might try less files for the lower limit. as passing 20 Gb of files around would be quite hard.

@tommy307507
Copy link
Author

tommy307507 commented Jul 16, 2021

This will likely need much more detail. Though to start: what's the source of the 1000x number?
What happens if you pass compat="identical", coords="minimal" to open_mfdataset? If that fails, the opening operation may be doing some expensive alignment.

Trying this gives me "conflicting values for variable 'ubar' on objects to be combined.", actually it makes sense as identical requires values to be the same right?

@tommy307507
Copy link
Author

For Ubar it says
dask.array<where, shape=(59, 1100, 1249), dtype=float64, chunksize=(59, 1100, 1249), chunktype=numpy.ndarray>
But for U it says
dask.array<concatenate, shape=(59, 35, 1100, 1249), dtype=float64, chunksize=(1, 1, 1100, 1249), chunktype=numpy.ndarray>
Those are very different operations, is that the reason for the 1000Gb consumption?

@max-sixty
Copy link
Collaborator

Again — where are you seeing this 1000GB or 1000x number?

(also have a look at GitHub docs on how to format the code)

@tommy307507
Copy link
Author

tommy307507 commented Jul 16, 2021

Again — where are you seeing this 1000GB or 1000x number?

(also have a look at GitHub docs on how to format the code)

Sorry I think the 1000x is a confusion on my part on not reading the numbers correctly or poor understanding of how memory units work, but I will explain it again.
on the top command, it draws all 100GiB of memory and started to use swap files that it causes the system to automately kill the code. The ubar variable should only draw 5911001249*8 = 648,480,800 bytes of memory, which is only 0.648GiB (Gigabytes), however the top command shows that it uses 92.5Gib Mem and all 16Gib of swap files, the actual drawn memory of the program is about 109 Gib (because that's all that is avaliable before it gets automatically killed) and it is in fact only 168x what's really needed.

@max-sixty
Copy link
Collaborator

The memory usage does seem high. Not having the indexes aligned makes it into an expensive operation, and I would vote to have that fail by default ref (#5499 (reply in thread)).

Can the input files be aligned before attempting to combine the data? Or are you not in control of the input files?

To debug the memory, you probably need to do something like use memory_profiler, and try for varying numbers of files — unfortunately it's a complex problem and just looking at htop gives very course information.

@anonymousForPeer
Copy link

anonymousForPeer commented Jul 21, 2021

Hi there, I have a very similar problem and before I open another issue I rather share my example here:

Minimal Complete Verifiable Example:

This little computation uses >500 MB of memory even if the file reveals only a size of 154MB:

with xr.open_dataset(climdata+'tavg_subset.nc', chunks={"latitude": 300, "longitude": 300}) as ds:
    print(ds)

    <xarray.Dataset>
    Dimensions:    (latitude: 168, longitude: 664, time: 731)
    Coordinates:
      * time       (time) datetime64[ns] 1971-01-01 1971-01-02 ... 1972-12-31
      * longitude  (longitude) float64 20.27 20.3 20.33 20.36 ... 40.92 40.95 40.98
      * latitude   (latitude) float64 40.23 40.2 40.17 40.14 ... 35.08 35.05 35.02
    Data variables:
        tavg       (time, latitude, longitude) float32 dask.array<chunksize=(731, 168, 300), meta=np.ndarray>
    

    annualMean = ds.tavg.resample(time="1Y").mean('time', keep_attrs=True)
    annualMean.to_netcdf("outputMean.nc", format="NETCDF4_CLASSIC", engine="netcdf4")

My problem is that the original files are each >120GB in size and I run into out-of-memory error on our HPC (asking for 10 CPUs with 16GB each).

I thought xarray processes everything in chunks for not overusing the memory - but something seems really wrong here!?

@hansukyang
Copy link

I don't know if this is related but recent updates of Dask has very large memory usage (after 2021.03 version) that I'm not sure is getting addressed yet (dask/dask#7583).

@jhamman
Copy link
Member

jhamman commented Sep 12, 2023

Closing this issue as stale. If others are running into similar issues, please open a new issue.

@jhamman jhamman closed this as completed Sep 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants