Memory leak while looping through a Dataset #2186

meridionaljet · 2018-05-25T13:53:31Z

I'm encountering a detrimental memory leak when simply accessing data from a Dataset repeatedly within a loop. I'm opening netCDF files concatenated in time, and looping through time to create plots. In this case the x-y slices are about 5000 x 5000 in size

import xarray as xr
import os, psutil
ds = xr.open_mfdataset('*.nc', chunks={'x': 4000, 'y': 4000}, concat_dim='t')
for k in range(ds.dims['t']):
    data = ds.datavar[k,:,:].values
    print('memory=', process.memory_info().rss)

>>> memory= 566484992
>>> memory= 823836672
>>> memory= 951439360
>>> memory= 1039261696

I tried explicitly dereferencing the array by calling del data at the end of each iteration, which reduces the memory growth a little bit, but not much.

Strangely, in this simplified example I can greatly reduce the memory growth by using much smaller chunk sizes, but in my real-world example, opening all data with smaller chunk sizes does not mitigate the problem. Either way, it's not clear to me why the memory usage should grow for any chunk size at all.

ds = xr.open_mfdataset('*.nc', chunks={'x': 1000, 'y': 1000}, concat_dim='t')
>>> memory= 514043904
>>> memory= 499363840
>>> memory= 502509568
>>> memory= 522133504

I can also generate memory growth when cutting dask out entirely with open_dataset(chunks=None) and simply looping through different variables in the Dataset:

ds = xr.open_dataset('data.nc', chunks=None) # x-y dataset 5424 x 5424
for var in ['var1', 'var2', ... , 'var15']:
    data = ds[var].values
    print('memory =', process.memory_info().rss)

>>> memory = 246087680
>>> memory = 280604672
>>> memory = 285810688
>>> memory = 315834368
>>> memory = 344510464
>>> memory = 374530048
>>> memory = 403742720
>>> memory = 403804160
>>> memory = 404140032
>>> memory = 403660800
>>> memory = 404262912
>>> memory = 403513344
>>> memory = 404115456
>>> memory = 403636224

Though you can see that, strangely, the growth stops after several iterations. This isn't always true. Sometimes it asymptotes for a few interations and then begins growing again.

I feel like I'm missing something fundamental about xarray memory management. It seems like a great impediment that arrays (or something) read from a Dataset are not garbage collected while looping through that Dataset, which kind of defeats the purpose of only accessing and working with the data you need in the first place. I have to access rather large chunks of data at a time, so being able to discard that slice of data and move onto the next one without filling up the RAM is a big deal.

Any ideas what's going on? Or what I'm missing?

print(ds) # from open_mfdataset()

<xarray.Dataset>
Dimensions: (band: 1, number_of_image_bounds: 2, number_of_time_bounds: 2, t: 4, x: 5424, y: 5424)
Coordinates:

y (y) float32 0.151844 0.151788 ...
x (x) float32 -0.151844 -0.151788 ...
t (t) datetime64[ns] 2018-05-25T00:36:02.796268032 ...
Data variables:
data (t, y, x) float32 dask.array<shape=(4, 5424, 5424), chunksize=(1, 4000, 4000)>

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.16.8-300.fc28.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

xarray: 0.10.4
pandas: 0.22.0
numpy: 1.14.3
scipy: 1.1.0
netCDF4: 1.4.0
h5netcdf: 0.5.1
h5py: 2.8.0
Nio: None
zarr: None
bottleneck: 1.2.1
cyordereddict: None
dask: 0.17.4
distributed: 1.21.8
matplotlib: 2.2.2
cartopy: 0.16.0
seaborn: None
setuptools: 39.1.0
pip: 10.0.1
conda: None
pytest: None
IPython: 6.4.0
sphinx: None

The text was updated successfully, but these errors were encountered:

rabernat · 2018-05-25T16:17:15Z

The memory management here is handled by python, not xarray. Python decides when to perform garbage collection. I know that doesn't help solve your problem...

meridionaljet · 2018-05-25T16:23:55Z

Yes, I understand the garbage collection. The problem I'm struggling with here is that normally when working with arrays, maintaining only one reference to an array and replacing the data that reference points to within a loop does not result in memory accumulation because GC is triggered on the prior, now dereferenced array from the previous iteration.

Here, it seems that under the hood, references to arrays have been created other than my "data" variable that are not being dereferenced when I reassign to "data," so stuff is accumulating in memory.

meridionaljet · 2018-05-26T00:03:59Z

I'm now wondering if this issue is in dask land, based on this issue: dask/dask#3247

It has been suggested in other places to get around the memory accumulation by running each loop iteration in a forked process:

def worker(ds, k):
    print('accessing data')
    data = ds.datavar[k,:,:].values
    print('data acquired')

for k in range(ds.dims['t']):
    p = multiprocessing.Process(target=worker, args=(ds, k))
    p.start()
    p.join()

But apparently one can't access dask-wrapped xarray datasets in subprocesses without a deadlock. I don't know enough about the internals to understand why.

meridionaljet · 2018-05-26T01:35:36Z

I've discovered that setting the environment variable MALLOC_MMAP_MAX_ to a reasonably small value can partially mitigate this memory fragmentation.

Performing 4 iterations over dataset slices of shape ~(5424, 5424) without this tweak was yielding >800MB of memory usage (an increase of ~400MB over the first iteration).

Setting MALLOC_MMAP_MAX_=40960 yielded ~410 MB of memory usage (an increase of only ~130MB over the first iteration).

This level of fragmentation is still offensive, but this does suggest the problem may lie deeper within the entire unix, glibc, Python, xarray, dask ecosystem.

shoyer · 2018-05-26T03:58:14Z

I might try experimenting with setting autoclose=True in open_mfdataset(). It's a bit of a short in the dark, but that might help here.

Memory growth with xr.open_dataset('data.nc', chunks=None) is expected, because by default we set cache=True when not using dask. This means that variables get cached in memory as NumPy arrays.

meridionaljet · 2018-05-26T04:11:18Z

Using autoclose=True doesn't seem to make a difference. My test only uses 4 files anyway.

Thanks for the explanation of open_dataset() - that makes sense.

Karel-van-de-Plassche · 2018-06-01T10:57:09Z

@meridionaljet I might've run into the same issue, but I'm not 100% sure. In my case I'm looping over a Dataset containing variables from 3 different files, all of them with a .sel and some of them with a more complicated (dask) calculation. (still, mostly sums and divisions) The leak seems mostly happening for those with the calculation.

Can you see what happens when using the distributed client? Put client = dask.distributed.Client() in front of your code. This leads to many distributed.utils_perf - WARNING - full garbage collections took 40% CPU time recently (threshold: 10%) messages being shown for me, indeed pointing to something garbage-collecty.

Also, for me the memory behaviour looks very different between the threaded and multi-process scheduler, although they both leak. (I'm not sure if leaking is the right term here). Maybe you can try memory_profiler?

I've tried without succes:

explicitly deleting ds[varname] and running gc.collect()
explicitly clearing dask cache with client.cancel and client.restart
Moving the leaky code in its own function (should not matter, but I seemed to remember that it sometimes helps for garbage collect in edge cases)
Explicitly triggering computation with either dask persist or xarray load and then explicitly deleting the result

For my messy and very much work in process code, look here: https://github.com/Karel-van-de-Plassche/QLKNN-develop/blob/master/qlknn/dataset/hypercube_to_pandas.py

shoyer · 2018-06-01T20:13:18Z

This might be the same issue as dask/dask#3530

max-sixty · 2019-01-14T21:09:36Z

In an effort to reduce the issue backlog, I'll close this, but please reopen if you disagree

lkilcher · 2022-02-10T22:24:37Z

Hey folks, I ran into a similar memory leak issue. In my case a had the following:

for num in range(100):
    ds = xr.open_dataset('data.{}.nc'.format(num)) # This data was compressed with zlib, not sure if that matters

    # do some stuff, but NOT assigning any data in ds to new variables

    del ds

For some reason (maybe having to do with the # do some stuff), ds wasn't actually getting cleared. I was able to fix the problem by manually triggering garbage collection (import gc, and gc.collect() after the del ds statement). Perhaps this will help others who end up here...

shoyer · 2022-02-10T22:49:40Z

For what it's wroth, the recommended way to do this is to explicitly close the Dataset with ds.close() rather than using del ds.

Or with a context manager, e.g.,

for num in range(100):
    with xr.open_dataset('data.{}.nc'.format(num)) as ds:
        # do some stuff, but NOT assigning any data in ds to new variables
        ...

lumbric · 2022-02-21T09:41:00Z

I just stumbled across the same issue and created a minimal example similar to @lkilcher. I am using xr.open_dataarray() with chunks and do some simple computation. After that 800mb of RAM is used, no matter whether I close the file explicitly, delete the xarray objects or invoke the Python garbage collector.

What seems to work: do not use the threading Dask scheduler. The issue does not seem to occur with the single-threaded or processes scheduler. Also setting MALLOC_MMAP_MAX_=40960 seems to solve the issue as suggested above (disclaimer: I don't fully understand the details here).

If I understand things correctly, this indicates that the issue is a consequence of dask/dask#3530. Not sure if there is anything to be fixed on the xarray side or what would be the best work around. I will try to use the processes scheduler.

I can create a new (xarray) ticket with all details about the minimal example, if anyone thinks that this might be helpful (to collect work-a-rounds or discuss fixes on the xarray side).

hmkhatri · 2022-03-08T10:00:07Z

Hello,

I am facing the same memory leak issue. I am using mpirun and dask-mpi on a slurm batch submission (see below). I am running through a time loop to perform some computations. After few iterations, the code blows up because out of memory issue. This does not happen if I execute the same code as a serial job.

from dask_mpi import initialize
initialize()

from dask.distributed import Client
client = Client()

#main code goes here
ds = xr.open_mfdataset("*nc")

for i in range(0, len(ds.time)):
        ds1 = ds.isel(time=i)
        # perform some computations here
        
        ds1.close()
ds.close()

I have tried the following

explicit ds.close() calls on datasets
gc.collect()
client.cancel(vars)

None of the solutions worked for me. I have also tried increasing RAM but that didn't help either. I was wondering if anyone has found a work around this problem.
@lumbric @shoyer @lkilcher

I am using dask 2022.2.0 dask-mpi 2021.11.0 xarray 0.21.1

veenstrajelmer · 2024-08-21T10:18:51Z

For what it is worth, since this issue is quite old. I also noticed high memory usage for netcdf files containing many (2410) variables. The memory usage is way lower with engine="h5netcdf", but this is also way slower for these particular datasets with many variables. It seems that the amount of variables really matters. When I open netcdf files that are 10 times larger (35GB vs 3GB), but with many less variables, the memory usage is way less with the default engine="netcdf4". I am not sure if your netcdf files also contain many variables, but if so, you might want to test this.

max-sixty closed this as completed Jan 14, 2019

JiaweiZhuang mentioned this issue Nov 22, 2019

High memory usage JiaweiZhuang/xESMF#53

Open

hmkhatri mentioned this issue Mar 7, 2022

Huge memory consumption in batch jobs on 3D variables hmkhatri/AMOC_Variability_DePreSys_MET#16

Open

laestrada mentioned this issue Oct 21, 2022

[BUG/ISSUE] Out-of-memory issue in inversion scripts geoschem/integrated_methane_inversion#82

Closed

veenstrajelmer mentioned this issue Aug 20, 2024

high memory usage when appending uds to partitions list Deltares/xugrid#287

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak while looping through a Dataset #2186

Memory leak while looping through a Dataset #2186

meridionaljet commented May 25, 2018 •

edited

Loading

INSTALLED VERSIONS

rabernat commented May 25, 2018

meridionaljet commented May 25, 2018 •

edited

Loading

meridionaljet commented May 26, 2018

meridionaljet commented May 26, 2018

shoyer commented May 26, 2018

meridionaljet commented May 26, 2018

Karel-van-de-Plassche commented Jun 1, 2018

shoyer commented Jun 1, 2018

max-sixty commented Jan 14, 2019

lkilcher commented Feb 10, 2022

shoyer commented Feb 10, 2022 •

edited

Loading

lumbric commented Feb 21, 2022

hmkhatri commented Mar 8, 2022

veenstrajelmer commented Aug 21, 2024 •

edited

Loading

Memory leak while looping through a Dataset #2186

Memory leak while looping through a Dataset #2186

Comments

meridionaljet commented May 25, 2018 • edited Loading

INSTALLED VERSIONS

rabernat commented May 25, 2018

meridionaljet commented May 25, 2018 • edited Loading

meridionaljet commented May 26, 2018

meridionaljet commented May 26, 2018

shoyer commented May 26, 2018

meridionaljet commented May 26, 2018

Karel-van-de-Plassche commented Jun 1, 2018

shoyer commented Jun 1, 2018

max-sixty commented Jan 14, 2019

lkilcher commented Feb 10, 2022

shoyer commented Feb 10, 2022 • edited Loading

lumbric commented Feb 21, 2022

hmkhatri commented Mar 8, 2022

veenstrajelmer commented Aug 21, 2024 • edited Loading

meridionaljet commented May 25, 2018 •

edited

Loading

meridionaljet commented May 25, 2018 •

edited

Loading

shoyer commented Feb 10, 2022 •

edited

Loading

veenstrajelmer commented Aug 21, 2024 •

edited

Loading