# My Dask and XArray Journey


From some encouragement I am writing out impressions on learning dask and xarray; 
essential elements together with Intake-STAC of the Pangeo stack.


## Get Going With XArray


XArray precedes Dask; and is built upon pandas and NumPy. The following steps 
require a few hours to go through; plus additional time spent internalizing
the details, ideally by working your own examples. This is the quickest means 
I am aware of for building XArray skills. Dask is covered later.


* Clone [this repository](https://github.com/coecms-training/introduction_to_xarray).
* Watch and work through the accompanying [8-video YouTube tutorial](https://youtu.be/zoB54IpofYA)
* For backing skills with pandas: Work through chapter 3 of the [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)


## Dask


Dask is a task scheduler that coordinates and speeds up larger computations. Some of what Dask
is good at happens "behind the scenes" in XAarray; so in principle there is nothing to learn
per se. However this is a bit vague so let's look at it from a more open-ended inquiry: What
is going on with Dask? 


Approach: From a (possibly Pangeo) Jupyter Lab environment clone 
[the dask tutorial repo](https://github.com/dask/dask-tutorial).


The [YouTube workshop video from 2018](https://youtu.be/mqdglv9GnM8) runs through this tutorial. 


Unfortunately some questions are inaudible so it can be difficult to follow in places. 
Also there is very little motivation in the exposition. One key idea that goes by 
rather quickly is "in memory / not in memory". This refers to whether a given calculation
fits in the computer's RAM. If not it may be a good candidate for Dask; which has a 
formalism for breaking tasks into components and executing them in an ordered fashion
on whatever parallel resources are available. 

## XArray spadework

I have two *actual research* objectives that should ideally result in papers. I'll present these as 
short abstracts. 

* Temperate glaciers are thinning and receding. They also surge episodically, essentially 
decoupling from the glacial bed and moving quickly. We have global remote sensing observations
of glaciers back to 1991 (and earlier) available. This work characterizes quiescent glacier
behavior and capture surge events over a thirty year interval.*


* The ocean water column is observed at high resolution at three locations in the northeast
Pacific by the Regional Cabled Array, an observatory that is a major component of Ocean Observations
Initiative. This work characterizes means and variances of the ocean as observed by RCA sensors
in both time and depth. It also generates a separate dataset flagging anomalies in an idealized 
smoothly varying sequence of observations with depth; often attributed to various mixing processes.*


I refer to these respectively as the **Ice Problem** and the **Ocean Problem**. 


## How XArray works

Begin with a data model that closely associates coordinates with data. 
To motivate this: Here are  
[examples of Xarray in action](http://xarray.pydata.org/en/stable/examples.html). 


### Model


- XArray is built on two data container forms or *types*: The `Dataset` and the `DataArray`.
  - A Dataset is comprised of one or more DataArrays
  - I abbreviate Datasets as `ds` and DataArrays as `da`
    - Useful: Start a variable name with <source_>, as in `glodap_`
    - Useful: Append a variable name with <_sensor>, as in `glodap_ds_temp` 
  - Create a Dataset out of thin air... or by compounding a DataArray
  - Create a DataArray out of thin air... or by extraction from a Dataset 
  - The Xarray formalism expands from `pandas` dataframes
    - As noted this is taught in the [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)
  - An XArray `Dataset` is comprised of four subsets with standard names
    - Dimensions, Coordinates, Data Variables and Attributes
    - A precise understanding of all four of these is quite helpful
    
    
The parts of an Xarray Dataset:

1. `Dimensions`
2. `Coordinates`
3. `Data variables` 
4. `Attributes` is for my purposes a *dictionary* of metadata. These can be created and deleted. 
    

## Motivation


Above I mentioned "running out of RAM" as a problem: Large calculations suggest using Dask. 
Another scale aspect, the 'positive flip side of the coin', stems from focused investigation.
A *small* data collection, perhaps 2 or 3 parameters at a specific site over a limited
time range can take a great deal of time and effort. However these are increasingly expanding
out to much larger datasets via satellite proxies, deployable sensors with high sample rates
and other such innovations. One hopes to generalize a
specific result to a larger study. 


While the data acquisition might scale upwards the task of painstakingly cleaning up data
for analysis might 'come along for the ride' creating a time bottleneck. This is motivates
the kind of work we find in geospatial data handling projects such as Xarray and Dask. 


Returning for a moment to our two practical examples... the Ice and Ocean problems.


The Ice Problem was developed in a few-degrees-square region of Southeast Alaska 
(a single UTM zone) with a lot of moving ice. The method however applies to the Himalayas, 
to the Patagonian Icefield, to British Columbia and to many other glacier-covered regions. 
The Ice Problem computation ought to run on a global scale over the 
full time extent of the available data, in excess of a decade, from 'a single keystroke'.
Albeit after a few preliminary keystrokes. 


The Ocean problem grows in scale according to this progression:


- There is one Regional Cabled Array under consideration
  - Note there are others, e.g. Neptune Canada, but let's stay 'small' for the moment
- This Array has three profiling sites: Axial Base, Oregon Slope Base, Oregon Offshore
- Each Profiling Sites has three instrument platforms: Deep profiler, shallow profiler, shallow platform
- Each platform carries multiple sensor packages, each generating one or more data streams
  - CTD
    - Time, Pressure, Temperature, Salinity, Dissolved Oxygen
  - Fluorometer
  - PAR
  


## XArray subsets


XArray subsetting can be confusing. The first step is to use the `sel()` convenience method 
with a slice for parameter range. This operates on **dimensions** (as does `isel()`) leading
to a potential source of confusion. So the following remark is offset for emphasis: 


> Use `sel()` and slices to subset data by a dimension (with its corresponding like-named
coordinate). However `sel()` is *not* usable to filter on non-dimensional `Coordinates` or
on `Data variables`. These are filterable using the `where()` method. 


For the Ocean Problem we would like -- for a given time range and site and profiler 
platform --  a reconciliation of as many as *nine* instruments, each with one or more 
sensor streams. And by reconciliation we mean that it handles various sampling rates 
gracefully. Features of this reconciliation include:


* An XArray Dataset
  * Dimensions that accommodate typical "one sample per second" instruments
  * Dimensions that accommodate slower sampling rates (pH, nitrate)
  * Dimensions of lat and lon to make location accessible
  * Dimension of pressure in dbar corresponding roughly to depth in meters
* An aggregated XArray Dataset
  * In n-day blocks (n = 1, 2, 3, ..., 7, 8) over a year x m vertical blocks (say 10 meter depth intervals)
  * For each sensor data stream: Mean, standard deviation, number of samples, depth, center time
  * Handle missing data gracefully via np.nan values
* Second aggregated XArray Dataset
  * As above with time of day also factored in in relation to daylight
* Annotation dataset
  * Coordinate: By date and profile
  * Presence / Absence
    * Inversion signals
    * "thin layer"
    * "out of bounds" signal


```
time0 = dt64('2019-06-23T00')            # a known good start time
time1 = time0 + td64(20, 'h')            # 20 hours later; a good time range

rca_subds_chlor = rca_ds_chlor.sel(time = slice(time0, time1))
rca_subds_chlor_pressure = rca_subds_chlor.sel(int_ctd_pressure = slice(0., 40.))
rca_subds_chlor_pressure
```


### A digression on the journey to `where()`


I had some severe confusion on a really basic aspect of XArray, alluded to above. I had 
an xarray Dataset from a marine profiler changing its depth with time while generating sensor values. 


* Let the `Data variable` be `PAR` that varies with `Dimension time`
  * I want to subset the `PAR` data based on a time range and a depth range
* There is a `pressure` (i.e. depth) `Coordinate` that varies with `time`
  * This is called a `Coordinate without dimension` 
* `time` is a `Dimension` and also a `Coordinate` per NetCDF-CF convention. 
  * In an XArray Dataset printout this fact is indicated by an asterisk next to the `time Coordinate`
* Subsetting the data by time range works fine:
  * `small = ds.sel(time=slice(t0,t1))`
* Subsetting this further to a depth range ***does not work using `sel()`***


Eventually I realized that `.where()` does the right sort of filtering by `Coordinate`. This distinction
is not immediately apparent in the native documentation. Solution: 

```
smaller = small.where(small.depth < 60.)
smaller2 = smaller.where(smaller.depth > 40.)
```

## The Dask narrative


The first thing they try to teach us about Dask is that it has a method -- really a *decorator* -- that operates on a computational task
in two phases. The first phase is where dask draws a graph of the problem; and the second phase is where dask grabs execution threads 
made available by the host computer and uses each of them to resolve the nodes of this graph which are of course smaller compute tasks
that must be run in some implicit order. This implies there must be something very clever about dask that allows it to construct this
directed acyclic *task solver* graph... but I suspect that the cleverness resides with us as coders. 


### Dask `delayed`


We begin by using functions that have built in one-second delays that simulate some computing time. The do trivial things. 
The functions are themselves not touched by the dask formalism; but the composition of these functions into a compute task
brings in the dask function `delayed`.


I learn that `dask.delayed` is a Python *decorator* so here is what that means:


> A decorator is a design pattern in Python that allows a user to add new functionality to an 
existing object without modifying its structure. Decorators are usually called before the 
definition of a function you want to decorate. [...] **Functions in Python [...] support operations 
such as being passed as an argument, returned from a function, modified, and assigned 
to a variable.**

Need graphviz to see the graphs...

```
conda install graphviz
```


and then as it still seemed to be non-working...


```
pip install graphviz
```

It *seemed* like both were necessary but that seems odd... maybe just the `conda install` is all that was needed. Anyway now I have graphs that illustrate dask's thinking. 


## Impressions of `dask.delayed`

To understand the second and third examples I'm matching `delayed` mentally to any compute-heavy task.
Here that means anything with a built-in `sleep(1)` to mimic a lot of work. So write out sequential code
and stick `delayed(xxx)` around any slow `xxx()`. That's the recipe but it misses the implicit finesse 
from the narrative. I think this is 'the graph builds *instantaneously* and then executes *later* ("when needed")
via parallel resources'. 