# My Dask and XArray Journey

I thought it might be of value (after some encouragement) to write down impressions on learning dask and xarray; 
essential elements together with Intake-STAC of the Pangeo stack.


## Preamble 


First I logged in to my Pangeo Jupyter Lab environment and grabbed the dask tutorial


```
clone https://github.com/dask/dask-tutorial
```


Then I watched about an hour of the YouTube workshop video from 2018. This unfortunately includes questions 
that are inaudible so the responses do not close the idea presented. Also there is very little motivation
in the exposition so it is not clear "why are we here?" This is not to disparage the presentation which 
made tons of sense at the time in its context. 

## XArray spadework

So I have two *actual research* objectives. The first is to chart out all of the surging glaciers captured
by remote sensing means since 1991. On earth. The second is to characterize the statistical behavior of
the ocean water column at three locations in the Pacific for further use in oceanographicc biogeochemistry. 
I will refer to these respectively as the **Ice Problem** and the **Ocean Problem**. 


Earlier efforts on the Ice Problem taught me a few things about how XArray works. Let me explain that and 
also what I *suspect* will be the end-result of this program I'm on. 


### How XArray works

First we need a data model; a collection of ideas and terms that abstractly describe 
many datasets, capturing their commonality. Here are some  
[examples of Xarray in action](http://xarray.pydata.org/en/stable/examples.html)
that help build context. Here are some key starting points from my experience:


- XArray is built on two data container forms or *types*: The `Dataset` and the `DataArray`.
  - I abbreviate these `ds` and `da`
    - Also useful: Start a container name with a source, as in `glodap_`
    - Also useful: Append sensor, as in `glodap_ds_temp` 
  - Create a Dataset out of thin air... or from a DataArray
  - Create a DataArray out of thin air... or from a Dataset 
  - The Xarray formalism expands from `pandas` dataframes
    - Consequently one should learn dataframes *first*
      - By carefully following Jake VanDerplas' chapter on pandas
  - An XArray `Dataset` is comprised of four sets with standard names
    - These four components are interrelated
    - A precise understanding of all four is extremely helpful
    
The parts of an Xarray Dataset:

1. `dims`
2. `something`
3. `data variables` maybe?
4. `attrs` is a dictionary of data attributes; metadata; descriptors. These can be created/deleted. 
    

### Why do this? 


It is easy enough to do a very very focused investigation
into a *small* collection of data of say *two* parameters at a *precise* location on the earth 
for a *narrow* range of time. You might analyze and publish an excellent paper on Pacific Chorus Frog 
chirps as observed in spring of 2007 at Lake Tapps, Washington. But you could well find yourself
staying up into the small hours writing specialized code that cleans your data and renders
the analysis and the key illustrative charts... then three months later
find yourself doing the exact same painstaking preparation for a slightly different dataset 
beset with its own idiosyncracies. You think 'I wish I could write this code just once and for all and use 
it for all these special cases.' 


That's why we're here. The key is abstraction. Abstraction of data, of code, of compute
infrastructure: All the components of data science where we really do not have the time
for bespoke effort endlessly repeated. Our bet is that by going through this learning 
process we will find a net gain in time and effort in data analysis. 


Returning for a moment to one of our two practical examples: The Ice Problem was 
developed in a few-degrees-square region of Southeast Alaska (a single UTM zone)
with a lot of moving ice. The method however applies to the Himalayas, to the 
Patagonian Icefield, to British Columbia and to many other glacier-covered regions. 
The Ice Problem computation could and should be run on a global scale over the 
full time extent of the available data, in excess of a decade, off of a single 
keystroke.


### How XArray works in more detail

## The Dask narrative


The first thing they try to teach us about Dask is that it has a method -- really a *decorator* -- that operates on a computational task
in two phases. The first phase is where dask draws a graph of the problem; and the second phase is where dask grabs execution threads 
made available by the host computer and uses each of them to resolve the nodes of this graph which are of course smaller compute tasks
that must be run in some implicit order. This implies there must be something very clever about dask that allows it to construct this
directed acyclic *task solver* graph... but I suspect that the cleverness resides with us as coders. 


### Dask `delayed`


We begin by using functions that have built in one-second delays that simulate some computing time. The do trivial things. 
The functions are themselves not touched by the dask formalism; but the composition of these functions into a compute task
brings in the dask function `delayed`.


I learn that `dask.delayed` is a Python *decorator* so here is what that means:


> A decorator is a design pattern in Python that allows a user to add new functionality to an 
existing object without modifying its structure. Decorators are usually called before the 
definition of a function you want to decorate. [...] **Functions in Python [...] support operations 
such as being passed as an argument, returned from a function, modified, and assigned 
to a variable.**

Need graphviz to see the graphs...

```
conda install graphviz
```


and then as it still seemed to be non-working...


```
pip install graphviz
```

It *seemed* like both were necessary but that seems odd... maybe just the `conda install` is all that was needed. Anyway now I have graphs that illustrate dask's thinking. 


## Impressions of `dask.delayed`

To understand the second and third examples I'm matching `delayed` mentally to any compute-heavy task.
Here that means anything with a built-in `sleep(1)` to mimic a lot of work. So write out sequential code
and stick `delayed(xxx)` around any slow `xxx()`. That's the recipe but it misses the implicit finesse 
from the narrative. I think this is 'the graph builds *instantaneously* and then executes *later* ("when needed")
via parallel resources'. 