diff --git a/advanced/xarray_and_dask.ipynb b/advanced/xarray_and_dask.ipynb deleted file mode 100644 index de5720e1..00000000 --- a/advanced/xarray_and_dask.ipynb +++ /dev/null @@ -1,395 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "# Dask\n", - "\n", - "This notebook demonstrates one of xarray's most powerful features: the ability\n", - "to wrap [dask arrays](https://docs.dask.org/en/stable/array.html) and allow users to seamlessly execute analysis code in\n", - "parallel.\n", - "\n", - "By the end of this notebook, you will:\n", - "\n", - "1. Xarray DataArrays and Datasets are \"dask collections\" i.e. you can execute\n", - " top-level dask functions such as `dask.visualize(xarray_object)`\n", - "2. Learn that all xarray built-in operations can transparently use dask\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "import xarray as xr" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "First lets set up a `LocalCluster` using [dask.distributed](https://distributed.dask.org/).\n", - "\n", - "You can use any kind of dask cluster. This step is completely independent of\n", - "xarray. While not strictly necessary, the dashboard provides a nice learning\n", - "tool.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from dask.distributed import Client\n", - "\n", - "client = Client()\n", - "client" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "

👆

Click the Dashboard link above. Or click the \"Search\" button in the dashboard.\n", - "\n", - "Let's test that the dashboard is working..\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import dask.array\n", - "\n", - "dask.array.ones((1000, 4), chunks=(2, 1)).compute() # should see activity in dashboard" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Reading data\n", - "\n", - "The `chunks` argument to both `open_dataset` and `open_mfdataset` allow you to\n", - "read datasets as dask arrays. See\n", - "https://xarray.pydata.org/en/stable/dask.html#reading-and-writing-data for more\n", - "details\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ds = xr.tutorial.open_dataset(\n", - " \"air_temperature\",\n", - " chunks={ # this tells xarray to open the dataset as a dask array\n", - " \"lat\": 25,\n", - " \"lon\": 25,\n", - " \"time\": -1,\n", - " },\n", - ")\n", - "ds" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Examining a DataArray" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The repr for the `air` DataArray shows the very nice dask repr.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ds.air" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Access the underlying chunk sizes using `.chunks`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ds.air.chunks" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Tip**: All variables in a `Dataset` need _not_ have the same chunk size along\n", - "common dimensions.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Lazy computation\n", - "\n", - "Xarray seamlessly wraps dask so all computation is deferred until explicitly\n", - "requested\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "mean = ds.air.mean(\"time\") # no activity on dashboard\n", - "mean # contains a dask array" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This is true for all xarray operations including slicing\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "ds.air.isel(lon=1, lat=20)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "and more complicated operations...\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "timeseries = ds.air.rolling(time=5).mean().isel(lon=1, lat=20) # no activity on dashboard\n", - "timeseries # contains dask array" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "timeseries = ds.air.rolling(time=5).mean() # no activity on dashboard\n", - "timeseries # contains dask array" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Getting concrete values\n", - "\n", - "At some point, you will want to actually get concrete values (_usually_ a numpy array) from dask.\n", - "\n", - "There are two ways to compute values on dask arrays.\n", - "\n", - "1. `.compute()` returns an xarray object\n", - "2. `.load()` replaces the dask array in the xarray object with a numpy array.\n", - " This is equivalent to `ds = ds.compute()`\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "computed = mean.compute() # activity on dashboard\n", - "computed # has real numpy values" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Note that `mean` still contains a dask array\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "mean" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "But if we call `.load()`, `mean` will now contain a numpy array\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "mean.load()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's check that again...\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "mean" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Tip:** `.persist()` loads the values into distributed RAM. This is useful if\n", - "you will be repeatedly using a dataset for computation but it is too large to\n", - "load into local memory. You will see a persistent task on the dashboard.\n", - "\n", - "See https://docs.dask.org/en/latest/api.html#dask.persist for more\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Extracting underlying data:\n", - "\n", - "There are two ways to pull out the underlying data in an xarray object.\n", - "\n", - "1. `.values` will always return a NumPy array. For dask-backed xarray objects,\n", - " this means that compute will always be called\n", - "2. `.data` will return a Dask array\n", - "\n", - "#### Exercise\n", - "\n", - "Try extracting a dask array from `ds.air`\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now extract a NumPy array from `ds.air`. Do you see compute activity on your\n", - "dashboard?\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Xarray objects as dask collections.\n", - "\n", - "This means you can do things like `dask.compute(xarray_object)`,\n", - "`dask.visualize(xarray_object)`, `dask.persist(xarray_object)`. This works for\n", - "both DataArrays and Datasets\n", - "\n", - "### Exercise\n", - "\n", - "Visualize the task graph for `mean`\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Visualize the task graph for `mean.data`. Is that the same as the above graph?\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Gracefully shutdown our client." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "client.close()" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.12" - }, - "widgets": { - "application/vnd.jupyter.widget-state+json": { - "state": {}, - "version_major": 2, - "version_minor": 0 - } - } - }, - "nbformat": 4, - "nbformat_minor": 4 -}