Minimize duplication in `map_blocks` task graph #8412

dcherian · 2023-11-03T18:30:02Z

Builds on #8560

Closes Task graphs on .map_blocks with many chunks can be huge #8409
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst

print(len(cloudpickle.dumps(da.chunk(lat=1, lon=1).map_blocks(lambda x: x))))
# 779354739 -> 47699827
print(len(cloudpickle.dumps(da.chunk(lat=1, lon=1).drop_vars(da.indexes).map_blocks(lambda x: x))))
# 15981508

This is a quick attempt. I think we can generalize this to minimize duplication.

The downside is that the graphs are not totally embarrassingly parallel any more.
This PR:

vs main:

xarray/core/parallel.py

max-sixty · 2023-11-04T17:36:19Z

Thanks a lot @dcherian !

(I don't have enough context to know how severe the change the parallelism is. I do really appreciate that .map_blocks is really simple in concept, and gets around Dask tripping over itself by just making an opaque function and running it lots of times. Possibly we could do the index filtering locally, which would need much more setup time, but retain .map_blocks simplicity...)

dcherian · 2023-11-04T20:21:35Z

Possibly we could do the index filtering locally, which would need much more setup time,

We do filter the indexes. The problem is that the filtered index values are duplicated a very large number of times for the calculation. The duplication allows the graph to be embarassingly parallel.

And then we include them a second time to enable nice error messages.

max-sixty · 2023-11-04T20:56:28Z

We do filter the indexes. The problem is that the filtered index values are duplicated a very large number of times for the calculation. The duplication allows the graph to be embarassingly parallel.

Ah right, yes. I confirmed that — the size difference scales by n_blocks, not n_blocks*index_size, so it must be filtering:


da = xr.tutorial.load_dataset('air_temperature').isel(lat=slice(25 // 2), lon=slice(53 //2))

[ins] In [9]: len(cloudpickle.dumps(da.chunk(lat=1, lon=1).map_blocks(lambda x: x)))
Out[9]: 18688240

[ins] In [10]: len(cloudpickle.dumps(da.chunk(lat=1, lon=1).drop_vars(da.indexes).map_blocks(lambda x: x)))
Out[10]: 3766137

Defer to you on how this affects dask stability...

dcherian · 2023-12-15T22:51:42Z

@fjetter do you think dask/distributed will handle the change in graph topology in the OP gracefully? map_blocks seems to have decent use in the wild to workaround dask scheduling issues, so it would be nice to not break that. Alternatively is there a better way to scatter out the duplicated data?

dcherian · 2023-12-18T03:54:02Z

FWIW this graph seems to be what blockwise constructs for broadcasting:

dask.array.blockwise(
    lambda x,y: x+y,
    'ij',
    dask.array.ones((3,), chunks=(1,)),
    'i',
    dask.array.ones((5,), chunks=(1,)),
    'j',
).visualize()

xarray/core/parallel.py

fjetter · 2023-12-18T12:00:53Z

map_blocks seems to have decent use in the wild to workaround dask scheduling issues, so it would be nice to not break that.

At least if the idea of "working around scheduling issues" is to forcefully flatten the graph to a purely embarrassingly parallel workload, this property is now gone but I believe you are still fine.

I am not super familiar with xarray datasets so I am doing a bit of guesswork here. IIUC this example dataset has three coordinates / indices lat, lon, time which are numpy arrays (always?) that are known to the client / known at graph construction time. IIUC the issue that is being fixed here is that these arrays are being duplicated?

Then there is also the air data variable that is the actual payload. In this situation tihs is also a numpy array but in a realistic one this is a remote data storage, e.g. a zarr file. We want to release these tasks asap.

If this is all correct, then yes, this is handled gracefully by dask (at least with the latest release, haven't checked older ones)

import xarray as xr
from dask.utils import key_split
from dask.order import diagnostics
from dask.base import collections_to_dsk
da = xr.tutorial.load_dataset('air_temperature')

dsk = collections_to_dsk([da.chunk(lat=1, lon=1).map_blocks(lambda x: x)])
diag, _ = diagnostics(dsk)
ages_data_tasks = [
    v.age == 1
    for k, v in diag.items()
    if key_split(k).startswith('xarray-air')
]
assert ages_data_tasks
assert all(ages_data_tasks)

Age refers to the number of "ticks / time steps" this task survives. age=1 means that once data is "produced", i.e. the task is scheduled, it's consumer is scheduled right afterwards such that after one time step this is being released.

Alternatively is there a better way to scatter out the duplicated data?

If those indices are truly always numpy arrays, I would probably suggest to just slice them to whatever size they need for the given task and embed them, keeping the embarrassingly parallel workload. I think I do not understand this problem sufficiently, It feels like I'm missing something.

dcherian · 2023-12-19T02:18:56Z

I think I do not understand this problem sufficiently, It feels like I'm missing something.

Broadcasting means that the tiny shards get duplicated a very large number of times in the graph. The OP was prompted by a 1GB task graph.

Closes pydata#8409

This reverts commit 7276cbf.

This reverts commit f6557f7.

* main: Adapt map_blocks to use new Coordinates API (pydata#8560) add xeofs to ecosystem.rst (pydata#8561) Offer a fixture for unifying DataArray & Dataset tests (pydata#8533) Generalize cumulative reduction (scan) to non-dask types (pydata#8019)

* upstream/main: Faster encoding functions. (pydata#8565) ENH: vendor SerializableLock from dask and use as default backend lock, adapt tests (pydata#8571) Silence a bunch of CachingFileManager warnings (pydata#8584) Bump actions/download-artifact from 3 to 4 (pydata#8556) Minimize duplication in `map_blocks` task graph (pydata#8412) [pre-commit.ci] pre-commit autoupdate (pydata#8578) ignore a `DeprecationWarning` emitted by `seaborn` (pydata#8576) Fix mypy type ignore (pydata#8564) Support for the new compression arguments. (pydata#7551) FIX: reverse index output of bottleneck move_argmax/move_argmin functions (pydata#8552)

github-actions bot added the topic-dask label Nov 3, 2023

dcherian commented Nov 3, 2023

View reviewed changes

xarray/core/parallel.py Outdated Show resolved Hide resolved

dcherian mentioned this pull request Nov 5, 2023

Is there any way of having .map_blocks be even more opaque to dask? #8414

Closed

dcherian mentioned this pull request Dec 15, 2023

Task graphs on .map_blocks with many chunks can be huge #8409

Closed

5 tasks

fjetter reviewed Dec 18, 2023

View reviewed changes

xarray/core/parallel.py Outdated Show resolved Hide resolved

dcherian added 3 commits December 18, 2023 15:54

Adapt map_blocks to use new Coordinates API

e45fd52

cleanup

768d7b2

typing fixes

b5f763d

dcherian added 13 commits December 18, 2023 19:28

Minimize duplication in map_blocks task graph

d903384

Closes pydata#8409

Some more optimization

9790d3b

Refactor inserting of in memory data

7af206a

[WIP] De-duplicate in expected["indexes"]

62c7e52

Revert "[WIP] De-duplicate in expected["indexes"]"

a7a63c0

This reverts commit 7276cbf.

Revert "Refactor inserting of in memory data"

da95bc4

This reverts commit f6557f7.

Be more clever about scalar broadcasting

b9974b4

Small speedup

9b5e4cc

Small improvement

a106569

Trim some more.

1334009

Restrict numpy code path only for scalars and indexes

21b0949

Small cleanup

1933e0b

Add test

a4bda14

dcherian force-pushed the map-blocks-indexes-fix branch from ba52ec0 to a4bda14 Compare December 19, 2023 03:30

typing fixes

79f15ef

dcherian added 2 commits December 19, 2023 09:32

optimize

9befd55

reorder

84ba745

dcherian added the plan to merge Final call for comments label Dec 20, 2023

dcherian and others added 4 commits December 20, 2023 10:28

better test

9b8c0b3

cleanup + whats-new

7dd26bb

Merge branch 'main' into map-blocks-indexes-fix

3989c08

dcherian merged commit d87ba61 into pydata:main Jan 3, 2024
25 of 27 checks passed

dcherian deleted the map-blocks-indexes-fix branch January 3, 2024 04:10

charlesgauthier-udm mentioned this pull request Jan 12, 2024

CI fails in upstream_dev pangeo-data/xESMF#323

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimize duplication in `map_blocks` task graph #8412

Minimize duplication in `map_blocks` task graph #8412

dcherian commented Nov 3, 2023 •

edited

Loading

max-sixty commented Nov 4, 2023

dcherian commented Nov 4, 2023

max-sixty commented Nov 4, 2023

dcherian commented Dec 15, 2023

dcherian commented Dec 18, 2023

fjetter commented Dec 18, 2023

dcherian commented Dec 19, 2023

Minimize duplication in map_blocks task graph #8412

Minimize duplication in map_blocks task graph #8412

Conversation

dcherian commented Nov 3, 2023 • edited Loading

max-sixty commented Nov 4, 2023

dcherian commented Nov 4, 2023

max-sixty commented Nov 4, 2023

dcherian commented Dec 15, 2023

dcherian commented Dec 18, 2023

fjetter commented Dec 18, 2023

dcherian commented Dec 19, 2023

Minimize duplication in `map_blocks` task graph #8412

Minimize duplication in `map_blocks` task graph #8412

dcherian commented Nov 3, 2023 •

edited

Loading