-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extremely Large Memory usage for a very small variable #5604
Comments
The variable can be combined using xr.concat if I open the individual files using xr.open_dataset and takes only 1.1g memory , I think the issue is somehow inside open_mfdataset, |
This will likely need much more detail. Though to start: what's the source of the 1000x number? |
Is |
Yes, I will try the above tomorrow, and post it back here. |
An example which we can reproduce locally would be the most helpful, if
possible!
…On Thu, 15 Jul 2021, 12:42 tommy307507, ***@***.***> wrote:
I also don't understand how the chunksize of v2d_time is 59 instead of 1
Is v2d_time one of the dimensions being concatenated along by
open_mfdataset?
Yes, I will try the above tomorrow, and post it back here.
I did try to pass concat_dim = ["v2d_time", "v3d_time" ] but that still
causes the problem
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#5604 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AISNPI5PCPY3XH4FSWTEQODTX4FXLANCNFSM5AMYCK2Q>
.
|
My temporary bypass around this is to do open_dataset on all of the files, storing the u and ubar in two separate lists and saving to file after doing an xr.concat on both of them |
Thanks for your quick reply but I am not at work right now as it's 1am over here |
Trying this gives me "conflicting values for variable 'ubar' on objects to be combined.", actually it makes sense as identical requires values to be the same right? |
For Ubar it says |
Again — where are you seeing this 1000GB or 1000x number? (also have a look at GitHub docs on how to format the code) |
Sorry I think the 1000x is a confusion on my part on not reading the numbers correctly or poor understanding of how memory units work, but I will explain it again. |
The memory usage does seem high. Not having the indexes aligned makes it into an expensive operation, and I would vote to have that fail by default ref (#5499 (reply in thread)). Can the input files be aligned before attempting to combine the data? Or are you not in control of the input files? To debug the memory, you probably need to do something like use |
Hi there, I have a very similar problem and before I open another issue I rather share my example here: Minimal Complete Verifiable Example: This little computation uses >500 MB of memory even if the file reveals only a size of 154MB: with xr.open_dataset(climdata+'tavg_subset.nc', chunks={"latitude": 300, "longitude": 300}) as ds:
print(ds)
<xarray.Dataset>
Dimensions: (latitude: 168, longitude: 664, time: 731)
Coordinates:
* time (time) datetime64[ns] 1971-01-01 1971-01-02 ... 1972-12-31
* longitude (longitude) float64 20.27 20.3 20.33 20.36 ... 40.92 40.95 40.98
* latitude (latitude) float64 40.23 40.2 40.17 40.14 ... 35.08 35.05 35.02
Data variables:
tavg (time, latitude, longitude) float32 dask.array<chunksize=(731, 168, 300), meta=np.ndarray>
annualMean = ds.tavg.resample(time="1Y").mean('time', keep_attrs=True)
annualMean.to_netcdf("outputMean.nc", format="NETCDF4_CLASSIC", engine="netcdf4") My problem is that the original files are each >120GB in size and I run into out-of-memory error on our HPC (asking for 10 CPUs with 16GB each). I thought xarray processes everything in chunks for not overusing the memory - but something seems really wrong here!? |
I don't know if this is related but recent updates of Dask has very large memory usage (after 2021.03 version) that I'm not sure is getting addressed yet (dask/dask#7583). |
Closing this issue as stale. If others are running into similar issues, please open a new issue. |
What happened:
Variable that takes up very little memory in actual data size uses over 1000x the memory
What you expected to happen:
It should take the order of the memory of the data size
Minimal Complete Verifiable Example:
This image is the output of the top command in linux after executing the line with compute()
Anything else we need to know?:
The variable u is able to be written to the disk with 22Gb of memory usage, which is the expected behaviour as the variable has abou 22 Gb of data stored in those files combined. In fact, seeing that the dimension of u is 35x ubar, the file size of u_file should be only about 22 to 23Gb.
Environment:
Output of xr.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.8.8 (default, Apr 13 2021, 19:58:26)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-957.12.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.0
libnetcdf: 4.7.4
xarray: 0.18.2
pandas: 1.2.4
numpy: 1.20.3
scipy: 1.6.3
netCDF4: 1.5.6
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.5.0
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: 0.9.9.0
iris: None
bottleneck: None
dask: 2021.04.0
distributed: 2021.04.1
matplotlib: 3.4.2
cartopy: 0.19.0.post1
seaborn: None
numbagg: None
pint: None
setuptools: 49.6.0.post20210108
pip: 21.1.3
conda: None
pytest: None
IPython: 7.24.1
sphinx: None
The text was updated successfully, but these errors were encountered: