Means of zarr arrays cause a memory overload in dask workers #6709

robin-cls · 2022-06-20T22:24:39Z

What is your issue?

Hello everyone !

I am submitting this issue here but it is not entirely clear if my problem comes from xarray, dask or zarr.

The goal here is to compute a mean from the GCM anomalies of SSH. The following simple code creates an artificial dataset (a variable is about 90G) with the anomaly fields, and compute the cross-products means.

import dask.array as da
import numpy as np
import xarray as xr

ds = xr.Dataset(
    dict(
        anom_u=(["time", "face", "j", "i"], da.ones((10311, 1, 987, 1920), chunks=(10, 1, 987, 1920))),
        anom_v=(["time", "face", "j", "i"], da.ones((10311, 1, 987, 1920), chunks=(10, 1, 987, 1920))),
    )
)

ds["anom_uu_mean"] = (["face", "j", "i"], np.mean(ds.anom_u.data**2, axis=0))
ds["anom_vv_mean"] = (["face", "j", "i"], np.mean(ds.anom_v.data**2, axis=0))
ds["anom_uv_mean"] = (["face", "j", "i"], np.mean(ds.anom_u.data * ds.anom_v.data, axis=0))

ds[["anom_uu_mean", "anom_vv_mean", "anom_uv_mean"]].compute()

I was expecting a low memory usage because after using a single chunk of anom_u and anom_v to do a mean iteration, these two could be forgotten. The following figure checks that we are very low on memory usage so all is well.

The matter becomes more complicated when the dataset is opened from a ZARR store. We simply dumped our previous articially generated data to a temporary store, and reloaded it :

import dask.array as da
import numpy as np
import xarray as xr

ds = xr.Dataset(
    dict(
        anom_u=(["time", "face", "j", "i"], da.ones((10311, 1, 987, 1920), chunks=(10, 1, 987, 1920))),
        anom_v=(["time", "face", "j", "i"], da.ones((10311, 1, 987, 1920), chunks=(10, 1, 987, 1920))),
    )
)

store = "/work/scratch/test_zarr_graph"
ds.to_zarr(store, compute=False, mode="a")
ds = xr.open_zarr(store)

ds["anom_uu_mean"] = (["face", "j", "i"], np.mean(ds.anom_u.data**2, axis=0))
ds["anom_vv_mean"] = (["face", "j", "i"], np.mean(ds.anom_v.data**2, axis=0))
ds["anom_uv_mean"] = (["face", "j", "i"], np.mean(ds.anom_u.data * ds.anom_v.data, axis=0))

ds[["anom_uu_mean", "anom_vv_mean", "anom_uv_mean"]].compute()

I was expecting a similar behavior between a dataset created from scratch and one created from a zarr store, but it seems not to be the case. I tried using inline_array=True with xr.open_dataset but to no avail. I also tried computing 2 variables instead of 3 and it works properly, so the behavior seems strange to me.

Do you see any reason as to why I am seeing such memory load on my workers ?

Here are the software version I use :
xarray version : 2022.6.0rc0
dask version : 2022.04.1
zarr version : 2.11.1
numpy version : 1.21.6

TomNicholas · 2022-06-21T16:31:58Z

Hi @robin-cls - can we see the dask graphs (or a representative subset of them) for these two cases?

robin-cls · 2022-06-22T08:54:18Z

Hi @TomNicholas

I've reduced the original dataset to 11 chunks over the time dimension so that we can see the graph properly. I also replaced the .compute operation by a to_zarr(compute=False) because I don't know how to visualize xarray operations without generating a Delayed object (comments are welcomed on this point !)

Anyway here are the files, first one is the graph where the means are built from dask.ones arrays

Second one is the graph where the means are built from the same arrays but opened from a zarr store

I am quite a newbie in dask graphs debug but everything seems ok in the second graph, apart from the open_dataset tasks that are linked to a parent task. Also, I noticed that Dask have fused the 'ones' operations in the first graph. Would it help if I generated another arrays with zeros instead ?

dcherian · 2022-06-22T15:24:24Z

You can use dask.visualize(xarray_object)

You could try passing inline_array=True to open_zarr

robin-cls · 2022-06-23T07:40:33Z

Thanks for the tips, I was investigating inline_array=True and still no luck. The graph seems OK though. I can attach it if you want but I think zarr is not the culprit.

Here is why :

In the first case, where we build the array from scratch, the ones array is simple. Dask seems to understand that it does not have to make many copies of it. So when replacing ones with random data, we observe the same behavior as opening the dataset from a ZARR store (high memory usage on a worker) :

import dask.array as da
import numpy as np
import xarray as xr

ds = xr.Dataset(
    dict(
        anom_u=(["time", "face", "j", "i"], da.random.random((10311, 1, 987, 1920), chunks=(10, 1, 987, 1920))),
        anom_v=(["time", "face", "j", "i"], da.random.random((10311, 1, 987, 1920), chunks=(10, 1, 987, 1920))),
    )
)

ds["anom_uu_mean"] = (["face", "j", "i"], np.mean(ds.anom_u.data**2, axis=0))
ds["anom_vv_mean"] = (["face", "j", "i"], np.mean(ds.anom_v.data**2, axis=0))
ds["anom_uv_mean"] = (["face", "j", "i"], np.mean(ds.anom_u.data * ds.anom_v.data, axis=0))

ds[["anom_uu_mean", "anom_vv_mean", "anom_uv_mean"]].compute()

I think the question now is why Dask must load so many data when doing my operation :

If we take the computation graph (I've put the non optimized version), my understanding is that we could do the following :

Load the first chunk of anom_u
Load the second chunk of anom_v
Do the multiplication anom_u*anom_v, anom_u**, anom_v ** 2
Do the mean-chunk task
Unload all the previous tasks
Redo the same and combine the mean-chunks tasks

For information, one chunk is about 1.4G, so I expect see peaks of 5*1.4 = 7G in memory (plus what's needed to store the mean_chunk), but I instead got 15G+ in my worker, most of it taken by the random-samples

Is my understanding of distributed mean wrong ? Why are the random-sample not flushed ?

dcherian · 2022-06-23T14:59:57Z

This looks like the classic distributed scheduling issue that might get addressed soon: dask/distributed#6560

@gjoseph92 the MVCE in #6709 (comment) might be a useful benchmark / regression test (ds.u.mean(), ds.v.mean(), (ds.u * ds.v).mean() all together). This kind of dependency -- variables and combinations of multiple variable being computed together -- turns up a lot in climate model workflows).

gjoseph92 · 2022-06-23T17:05:12Z

Thanks @dcherian, yeah this is definitely root task overproduction. I think your case is somewhat similar to @TomNicholas's dask/distributed#6571 (that one might even be a little simpler actually).

There's some prototyping going on to address this, but I'd say "soon" is probably on the couple month timescale right now FYI.

dask/distributed#6598 or dask/distributed#6614 will probably make this work. I'm hopefully going to benchmark these against some real workloads in the next couple days, so I'll probably add yours in. Thanks for the MVCE!

Is my understanding of distributed mean wrong ? Why are the random-sample not flushed?

See dask/distributed#6360 (comment) and the linked issues for why this happens.

dcherian · 2022-06-23T17:18:05Z

Thanks @gjoseph92 I think there is a small but important increase in complexity here because we do ds.u.mean(), ds.v.mean(), (ds.u*ds.v).mean() all together so each chunk of ds.u and ds.v is used for two different outputs.

IIUC the example in dask/distributed#6571 is basically (ds.u * ds.v).mean() purely.

robin-cls · 2022-06-23T17:33:48Z

Thanks @gjoseph92 and @dcherian . I'll try the different approaches in the links you have provided to see if I can improve my current solution (I compute the fields separately which means more IO and more operations)

gjoseph92 · 2022-06-23T17:37:59Z

FYI @robin-cls I would be a bit surprised if there is anything you can do on your end to fix things here with off-the-shelf dask. What @dcherian mentioned in dask/distributed#6360 (comment) is probably the only thing that might work. Otherwise you'll need to run one my experimental branches.

gjoseph92 · 2022-06-23T23:15:19Z

I took a little bit more of a look at this and I don't think root task overproduction is the (only) problem here.

I also feel like intuitively, this operation shouldn't require holding so many root tasks around at once. But the graph dask is making, or how it's ordering it, doesn't seem to work that way. We can see the ordering is pretty bad:

When we actually run it (on dask/distributed#6614 with overproduction fixed), you can see that dask requires keeping tons of the input chunks in memory, because they're going to be needed by a future task that isn't able to run yet (because not all of its inputs have been computed):

I feel like it's possible that the order in which dask is executing the input tasks is bad? But I more thank that I haven't thought about the problem enough, and there's an obvious reason why the graph is structured like this.

…6709

* as committed here ran a test case for western US using three features to predict precip - a couple remaining TODOs but functional at this point * allow access to wind and standardize names/coordinates * tweaks for multivariate run * sample config for multivariate run * update config files for multivariate and boolean in detrend * variable switching * param configs with wind * split get_gcm by variable due to this issue with dask: pydata/xarray#6709 * switched order of postprocess and added method = nearest to .sel * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Co-authored-by: norland r hagen <norlandrhagen@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

fjetter · 2023-06-29T13:47:26Z

xref dask/dask#10384 I could track down the suspicion of @gjoseph92 that this is a dask order issue. This is somehow introduced because xarray is running multiple reductions on the same source data (dask should be robust to this, just pointing it out)

This example should work trivially when either is true:

Only one of the arrays is calculated at the same time
The xarray dataset is transformed to a dask.DataFrame using mean.to_dask_dataframe() (The DataFrame graph looks slightly different and is handled well by dask.order)

fjetter · 2023-09-28T15:14:34Z

Just a heads up. I'm working for a fix for this in dask/dask, see dask/dask#10535

Preliminary results look very promising

This graph show the memory usage for a couple of runs with increasing size in the time dimension using the example as posted in pangeo-data/distributed-array-examples#2 (comment)

This was far away from the spilling threshold (yellow line) so the constant memory was indeed due to better scheduling, not spilling or anything like that.

I'm also looking at other workloads. If you are aware of other stuff that should be constant or near-constant in memory usage but isn't, please let me know!

dcherian · 2023-09-28T15:46:49Z

I'm also looking at other workloads. If you are aware of other stuff that should be constant or near-constant in memory usage but isn't, please let me know!

There is this one: dask/distributed#7274 (comment) / dask/distributed#8005 that I didn't manage to make a reproducible example for.

fjetter · 2023-09-29T09:47:33Z

There is this one:

I remember, thank you. I'll have a look at it next

dcherian · 2023-10-04T03:44:46Z

(responding to your email here)

Does something like this look bad to you?

import dask.array

big_array = dask.array.random.random((1e5, 3e3, 3e3), chunks=(20, 1000, 1000))
small_array_1 = dask.array.random.random((1, 3e3, 3e3), chunks=1000) # more than 5 chunks, so not root-ish
small_array_2 = dask.array.random.random((1, 3e3, 3e3), chunks=1000) # more than 5 chunks, so not root-ish
result = (big_array * small_array_1 * small_array_2).mean(axis=(-1, -2))

fjetter · 2023-10-04T12:47:39Z

Does something like this look bad to you?

Doesn't look bad at all. Nothing here is detected as root-ish but it is running very smoothly on my machine (both using main and my new ordering PR)

dcherian · 2023-10-09T15:39:35Z

Closing since dask/dask#10535 will be merged soon.

robin-cls added the needs triage Issue that has not been reviewed by xarray team member label Jun 20, 2022

TomNicholas added topic-dask topic-zarr Related to zarr storage library and removed needs triage Issue that has not been reviewed by xarray team member labels Jun 21, 2022

dcherian added the upstream issue label Jun 23, 2022

norlandrhagen added a commit to carbonplan/cmip6-downscaling that referenced this issue Jun 24, 2022

split get_gcm by variable due to this issue with dask: pydata/xarray#…

f7c9480

…6709

This was referenced Apr 13, 2023

Means of quadratic quantities pangeo-data/distributed-array-examples#2

Open

Pangeo / earth science workflows coiled/benchmarks#770

Open

dcherian closed this as completed Oct 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Means of zarr arrays cause a memory overload in dask workers #6709

Means of zarr arrays cause a memory overload in dask workers #6709

robin-cls commented Jun 20, 2022 •

edited

TomNicholas commented Jun 21, 2022

robin-cls commented Jun 22, 2022 •

edited

dcherian commented Jun 22, 2022

robin-cls commented Jun 23, 2022 •

edited

dcherian commented Jun 23, 2022 •

edited

gjoseph92 commented Jun 23, 2022

dcherian commented Jun 23, 2022

robin-cls commented Jun 23, 2022

gjoseph92 commented Jun 23, 2022

gjoseph92 commented Jun 23, 2022

fjetter commented Jun 29, 2023

fjetter commented Sep 28, 2023

dcherian commented Sep 28, 2023

fjetter commented Sep 29, 2023 •

edited

dcherian commented Oct 4, 2023 •

edited

fjetter commented Oct 4, 2023

dcherian commented Oct 9, 2023

Means of zarr arrays cause a memory overload in dask workers #6709

Means of zarr arrays cause a memory overload in dask workers #6709

Comments

robin-cls commented Jun 20, 2022 • edited

What is your issue?

TomNicholas commented Jun 21, 2022

robin-cls commented Jun 22, 2022 • edited

dcherian commented Jun 22, 2022

robin-cls commented Jun 23, 2022 • edited

dcherian commented Jun 23, 2022 • edited

gjoseph92 commented Jun 23, 2022

dcherian commented Jun 23, 2022

robin-cls commented Jun 23, 2022

gjoseph92 commented Jun 23, 2022

gjoseph92 commented Jun 23, 2022

fjetter commented Jun 29, 2023

fjetter commented Sep 28, 2023

dcherian commented Sep 28, 2023

fjetter commented Sep 29, 2023 • edited

dcherian commented Oct 4, 2023 • edited

fjetter commented Oct 4, 2023

dcherian commented Oct 9, 2023

robin-cls commented Jun 20, 2022 •

edited

robin-cls commented Jun 22, 2022 •

edited

robin-cls commented Jun 23, 2022 •

edited

dcherian commented Jun 23, 2022 •

edited

fjetter commented Sep 29, 2023 •

edited

dcherian commented Oct 4, 2023 •

edited