# My Dask and XArray Journey


From some encouragement I am writing out impressions on learning dask and xarray; 
essential elements together with Intake-STAC of the Pangeo stack.


## Get Going With XArray


XArray precedes Dask; and is built upon pandas and NumPy. The following steps 
require a few hours to go through; plus additional time spent internalizing
the details, ideally by working your own examples. This is the quickest means 
I am aware of for building XArray skills. Dask is covered later.


* Clone [this repository](https://github.com/coecms-training/introduction_to_xarray).
* Watch and work through the accompanying [8-video YouTube tutorial](https://youtu.be/zoB54IpofYA)
* For backing skills with pandas: Work through chapter 3 of the [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)


## Dask


Dask is a task scheduler that coordinates and speeds up larger computations. Some of what Dask
is good at happens "behind the scenes" in XAarray; so in principle there is nothing to learn
per se. However this is a bit vague so let's look at it from a more open-ended inquiry: What
is going on with Dask? 


Approach: From a (possibly Pangeo) Jupyter Lab environment clone 
[the dask tutorial repo](https://github.com/dask/dask-tutorial).


The [YouTube workshop video from 2018](https://youtu.be/mqdglv9GnM8) runs through this tutorial. 


Unfortunately some questions are inaudible so it can be difficult to follow in places. 
Also there is very little motivation in the exposition. One key idea that goes by 
rather quickly is "in memory / not in memory". This refers to whether a given calculation
fits in the computer's RAM. If not it may be a good candidate for Dask; which has a 
formalism for breaking tasks into components and executing them in an ordered fashion
on whatever parallel resources are available. 

## XArray spadework

I have two *actual research* objectives that should ideally result in papers. I'll present these as 
short abstracts. 

* Temperate glaciers are thinning and receding. They also surge episodically, essentially 
decoupling from the glacial bed and moving quickly. We have global remote sensing observations
of glaciers back to 1991 (and earlier) available. This work characterizes quiescent glacier
behavior and capture surge events over a thirty year interval.*


* The ocean water column is observed at high resolution at three locations in the northeast
Pacific by the Regional Cabled Array, an observatory that is a major component of Ocean Observations
Initiative. This work characterizes means and variances of the ocean as observed by RCA sensors
in both time and depth. It also generates a separate dataset flagging anomalies in an idealized 
smoothly varying sequence of observations with depth; often attributed to various mixing processes.*


I refer to these respectively as the **Ice Problem** and the **Ocean Problem**. 


## How XArray works

Begin with a data model that closely associates coordinates with data. 
To motivate this: Here are  
[examples of Xarray in action](http://xarray.pydata.org/en/stable/examples.html). 


### Model


- XArray is built on two data container forms or *types*: The `Dataset` and the `DataArray`.
  - A Dataset is comprised of one or more DataArrays
  - I abbreviate Datasets as `ds` and DataArrays as `da`
    - Useful: Start a variable name with <source_>, as in `glodap_`
    - Useful: Append a variable name with <_sensor>, as in `glodap_ds_temp` 
  - Create a Dataset out of thin air... or by compounding a DataArray
  - Create a DataArray out of thin air... or by extraction from a Dataset 
  - The Xarray formalism expands from `pandas` dataframes
    - As noted this is taught in the [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)
  - An XArray `Dataset` is comprised of four subsets with standard names
    - Dimensions, Coordinates, Data Variables and Attributes
    - A precise understanding of all four of these is quite helpful
    
    
The parts of an Xarray Dataset:

1. `Dimensions`
2. `Coordinates`
3. `Data variables` 
4. `Attributes` is for my purposes a *dictionary* of metadata. These can be created and deleted. 
    

## Motivation


Above I mentioned "running out of RAM" as a problem for large calculations that suggests using Dask. 
There is another aspect to this work as well that also touches on scale. It is common at meetings
to see very focused investigations. These might work from a *small* collection of data of say 
*two* parameters at a *precise* location on the earth for a *narrow* range of time. 
One might for example analyze and publish an excellent paper on Pacific Chorus Frog calls observed 
in spring of 2007 at Lake Tapps, Washington. It may well have required hours of painstaking work
preparing the data; but there it is at last. And then... it might be interesting to generalize the 
results in a larger study: Across larger geographical area and time extent. 


What we would like to avoid is -- say three months later -- doing the same 
painstaking preparation of a slightly different dataset 
beset with its own idiosyncracies. One imagines writing code just once 
and for all for use in all special cases to build the larger story out. 


That's why we're here: Generalization and abstraction. Abstraction of data, of code, 
of compute infrastructure: All the components of data science where we really do not have the time
for bespoke effort endlessly repeated. Our bet is that by going through this learning 
process we will find a net gain in time and effort in data analysis. 


Returning for a moment to one of our two practical examples: The Ice Problem was 
developed in a few-degrees-square region of Southeast Alaska (a single UTM zone)
with a lot of moving ice. The method however applies to the Himalayas, to the 
Patagonian Icefield, to British Columbia and to many other glacier-covered regions. 
The Ice Problem computation could and should be run on a global scale over the 
full time extent of the available data, in excess of a decade, off of a single 
keystroke.


## XArray in more detail

## The Dask narrative


The first thing they try to teach us about Dask is that it has a method -- really a *decorator* -- that operates on a computational task
in two phases. The first phase is where dask draws a graph of the problem; and the second phase is where dask grabs execution threads 
made available by the host computer and uses each of them to resolve the nodes of this graph which are of course smaller compute tasks
that must be run in some implicit order. This implies there must be something very clever about dask that allows it to construct this
directed acyclic *task solver* graph... but I suspect that the cleverness resides with us as coders. 


### Dask `delayed`


We begin by using functions that have built in one-second delays that simulate some computing time. The do trivial things. 
The functions are themselves not touched by the dask formalism; but the composition of these functions into a compute task
brings in the dask function `delayed`.


I learn that `dask.delayed` is a Python *decorator* so here is what that means:


> A decorator is a design pattern in Python that allows a user to add new functionality to an 
existing object without modifying its structure. Decorators are usually called before the 
definition of a function you want to decorate. [...] **Functions in Python [...] support operations 
such as being passed as an argument, returned from a function, modified, and assigned 
to a variable.**

Need graphviz to see the graphs...

```
conda install graphviz
```


and then as it still seemed to be non-working...


```
pip install graphviz
```

It *seemed* like both were necessary but that seems odd... maybe just the `conda install` is all that was needed. Anyway now I have graphs that illustrate dask's thinking. 


## Impressions of `dask.delayed`

To understand the second and third examples I'm matching `delayed` mentally to any compute-heavy task.
Here that means anything with a built-in `sleep(1)` to mimic a lot of work. So write out sequential code
and stick `delayed(xxx)` around any slow `xxx()`. That's the recipe but it misses the implicit finesse 
from the narrative. I think this is 'the graph builds *instantaneously* and then executes *later* ("when needed")
via parallel resources'. 