-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak while looping through a Dataset #2186
Comments
The memory management here is handled by python, not xarray. Python decides when to perform garbage collection. I know that doesn't help solve your problem... |
Yes, I understand the garbage collection. The problem I'm struggling with here is that normally when working with arrays, maintaining only one reference to an array and replacing the data that reference points to within a loop does not result in memory accumulation because GC is triggered on the prior, now dereferenced array from the previous iteration. Here, it seems that under the hood, references to arrays have been created other than my "data" variable that are not being dereferenced when I reassign to "data," so stuff is accumulating in memory. |
I'm now wondering if this issue is in dask land, based on this issue: dask/dask#3247 It has been suggested in other places to get around the memory accumulation by running each loop iteration in a forked process: def worker(ds, k):
print('accessing data')
data = ds.datavar[k,:,:].values
print('data acquired')
for k in range(ds.dims['t']):
p = multiprocessing.Process(target=worker, args=(ds, k))
p.start()
p.join() But apparently one can't access dask-wrapped xarray datasets in subprocesses without a deadlock. I don't know enough about the internals to understand why. |
I've discovered that setting the environment variable MALLOC_MMAP_MAX_ to a reasonably small value can partially mitigate this memory fragmentation. Performing 4 iterations over dataset slices of shape ~(5424, 5424) without this tweak was yielding >800MB of memory usage (an increase of ~400MB over the first iteration). Setting MALLOC_MMAP_MAX_=40960 yielded ~410 MB of memory usage (an increase of only ~130MB over the first iteration). This level of fragmentation is still offensive, but this does suggest the problem may lie deeper within the entire unix, glibc, Python, xarray, dask ecosystem. |
I might try experimenting with setting Memory growth with |
Using Thanks for the explanation of |
@meridionaljet I might've run into the same issue, but I'm not 100% sure. In my case I'm looping over a Dataset containing variables from 3 different files, all of them with a Can you see what happens when using the distributed client? Put Also, for me the memory behaviour looks very different between the threaded and multi-process scheduler, although they both leak. (I'm not sure if leaking is the right term here). Maybe you can try I've tried without succes:
For my messy and very much work in process code, look here: https://github.com/Karel-van-de-Plassche/QLKNN-develop/blob/master/qlknn/dataset/hypercube_to_pandas.py |
This might be the same issue as dask/dask#3530 |
In an effort to reduce the issue backlog, I'll close this, but please reopen if you disagree |
Hey folks, I ran into a similar memory leak issue. In my case a had the following:
For some reason (maybe having to do with the |
For what it's wroth, the recommended way to do this is to explicitly close the Dataset with Or with a context manager, e.g., for num in range(100):
with xr.open_dataset('data.{}.nc'.format(num)) as ds:
# do some stuff, but NOT assigning any data in ds to new variables
... |
I just stumbled across the same issue and created a minimal example similar to @lkilcher. I am using What seems to work: do not use the If I understand things correctly, this indicates that the issue is a consequence of dask/dask#3530. Not sure if there is anything to be fixed on the xarray side or what would be the best work around. I will try to use the processes scheduler. I can create a new (xarray) ticket with all details about the minimal example, if anyone thinks that this might be helpful (to collect work-a-rounds or discuss fixes on the xarray side). |
Hello, I am facing the same memory leak issue. I am using
I have tried the following
None of the solutions worked for me. I have also tried increasing RAM but that didn't help either. I was wondering if anyone has found a work around this problem. I am using |
For what it is worth, since this issue is quite old. I also noticed high memory usage for netcdf files containing many (2410) variables. The memory usage is way lower with |
I'm encountering a detrimental memory leak when simply accessing data from a Dataset repeatedly within a loop. I'm opening netCDF files concatenated in time, and looping through time to create plots. In this case the x-y slices are about 5000 x 5000 in size
I tried explicitly dereferencing the array by calling
del data
at the end of each iteration, which reduces the memory growth a little bit, but not much.Strangely, in this simplified example I can greatly reduce the memory growth by using much smaller chunk sizes, but in my real-world example, opening all data with smaller chunk sizes does not mitigate the problem. Either way, it's not clear to me why the memory usage should grow for any chunk size at all.
I can also generate memory growth when cutting dask out entirely with
open_dataset(chunks=None)
and simply looping through different variables in the Dataset:Though you can see that, strangely, the growth stops after several iterations. This isn't always true. Sometimes it asymptotes for a few interations and then begins growing again.
I feel like I'm missing something fundamental about xarray memory management. It seems like a great impediment that arrays (or something) read from a Dataset are not garbage collected while looping through that Dataset, which kind of defeats the purpose of only accessing and working with the data you need in the first place. I have to access rather large chunks of data at a time, so being able to discard that slice of data and move onto the next one without filling up the RAM is a big deal.
Any ideas what's going on? Or what I'm missing?
<xarray.Dataset>
Dimensions: (band: 1, number_of_image_bounds: 2, number_of_time_bounds: 2, t: 4, x: 5424, y: 5424)
Coordinates:
Data variables:
data (t, y, x) float32 dask.array<shape=(4, 5424, 5424), chunksize=(1, 4000, 4000)>
INSTALLED VERSIONS
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.16.8-300.fc28.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
xarray: 0.10.4
pandas: 0.22.0
numpy: 1.14.3
scipy: 1.1.0
netCDF4: 1.4.0
h5netcdf: 0.5.1
h5py: 2.8.0
Nio: None
zarr: None
bottleneck: 1.2.1
cyordereddict: None
dask: 0.17.4
distributed: 1.21.8
matplotlib: 2.2.2
cartopy: 0.16.0
seaborn: None
setuptools: 39.1.0
pip: 10.0.1
conda: None
pytest: None
IPython: 6.4.0
sphinx: None
The text was updated successfully, but these errors were encountered: