<img src="http://xarray.pydata.org/en/stable/_static/dataset-diagram-logo.png" align="right" width="30%">

# Dask

This notebook demonstrates one of xarray's most powerful features: the ability
to wrap [dask arrays](https://docs.dask.org/en/stable/array.html) and allow users to seamlessly execute analysis code in
parallel.

By the end of this notebook, you will:

1. Xarray DataArrays and Datasets are "dask collections" i.e. you can execute
   top-level dask functions such as `dask.visualize(xarray_object)`
2. Learn that all xarray built-in operations can transparently use dask



In [None]:
import numpy as np
import xarray as xr

First lets set up a `LocalCluster` using [dask.distributed](https://distributed.dask.org/).

You can use any kind of dask cluster. This step is completely independent of
xarray. While not strictly necessary, the dashboard provides a nice learning
tool.


In [None]:
from dask.distributed import Client

client = Client()
client

<p>&#128070</p> Click the Dashboard link above. Or click the "Search" button in the dashboard.

Let's test that the dashboard is working..


In [None]:
import dask.array

dask.array.ones((1000, 4), chunks=(2, 1)).compute()  # should see activity in dashboard

<a id='readwrite'></a>

## Reading data with Dask and Xarray

The `chunks` argument to both `open_dataset` and `open_mfdataset` allow you to
read datasets as dask arrays. See
https://xarray.pydata.org/en/stable/dask.html#reading-and-writing-data for more
details


In [None]:
ds = xr.tutorial.open_dataset(
    "air_temperature",
    chunks={  # this tells xarray to open the dataset as a dask array
        "lat": 25,
        "lon": 25,
        "time": -1,
    },
)
ds

## Examining a DataArray with dask

The repr for the `air` DataArray shows the very nice dask repr.


In [None]:
ds.air

Access the underlying chunk sizes using `.chunks`

In [None]:
ds.air.chunks

**Tip**: All variables in a `Dataset` need _not_ have the same chunk size along
common dimensions.


<a id='compute'></a>

## Parallel/streaming/lazy computation using dask.array with Xarray

Xarray seamlessly wraps dask so all computation is deferred until explicitly
requested


In [None]:
mean = ds.air.mean("time")  # no activity on dashboard
mean  # contains a dask array

This is true for all xarray operations including slicing


In [None]:
ds.air.isel(lon=1, lat=20)

and more complicated operations...


In [None]:
timeseries = ds.air.rolling(time=5).mean().isel(lon=1, lat=20)  # no activity on dashboard
timeseries  # contains dask array

In [None]:
timeseries = ds.air.rolling(time=5).mean()  # no activity on dashboard
timeseries  # contains dask array

### Getting concrete values from dask arrays

At some point, you will want to actually get concrete values (_usually_ a numpy array) from dask.

There are two ways to compute values on dask arrays.

1. `.compute()` returns an xarray object
2. `.load()` replaces the dask array in the xarray object with a numpy array.
   This is equivalent to `ds = ds.compute()`


In [None]:
computed = mean.compute()  # activity on dashboard
computed  # has real numpy values

Note that `mean` still contains a dask array


In [None]:
mean

But if we call `.load()`, `mean` will now contain a numpy array


In [None]:
mean.load()

Let's check that again...


In [None]:
mean

**Tip:** `.persist()` loads the values into distributed RAM. This is useful if
you will be repeatedly using a dataset for computation but it is too large to
load into local memory. You will see a persistent task on the dashboard.

See https://docs.dask.org/en/latest/api.html#dask.persist for more


### Extracting underlying data: `.values` vs `.data`

There are two ways to pull out the underlying data in an xarray object.

1. `.values` will always return a NumPy array. For dask-backed xarray objects,
   this means that compute will always be called
2. `.data` will return a Dask array

#### Exercise

Try extracting a dask array from `ds.air`


Now extract a NumPy array from `ds.air`. Do you see compute activity on your
dashboard?


## Xarray data structures are first-class dask collections.

This means you can do things like `dask.compute(xarray_object)`,
`dask.visualize(xarray_object)`, `dask.persist(xarray_object)`. This works for
both DataArrays and Datasets

### Exercise

Visualize the task graph for `mean`


Visualize the task graph for `mean.data`. Is that the same as the above graph?


Gracefully shutdown our client.

In [None]:
client.close()