<div style="text-align: center">
    <img src="../_static/xcdat-logo.png" alt="xCDAT logo" style="display: inline-block; width:450px;">
</div>


# A Gentle Introduction to xCDAT (Xarray Climate Data Analysis Tools)

<h3 style="text-align: left;">
    "A Python package for simple and robust climate data analysis."</h2>
<h3 style="text-align: left; font-style:italic">Core Developers: Tom Vo, Stephen Po-Chedley, Jason Boutte, Jill Zhang, Jiwoo Lee</h3>

---
<p style="text-align: left;">With thanks to Peter Gleckler, Paul Durack, Karl Taylor, and Chris Golaz</p>


_This work is performed under the auspices of the U. S. DOE by Lawrence Livermore National Laboratory under contract No. DE-AC52-07NA27344._

## Presentation Overview

Intended audience: Some or no familiarity with `xarray` and `xcdat`

* What is xCDAT?
* An overview of Xarray
* How does xCDAT fit in the Xarray ecosystem?
* The API design of xCDAT
* Demo of xCDAT capabilities and Dask parallelism
* Wrap up and resources


## What is xCDAT?

* xCDAT is an **extension of xarray** for **climate data analysis on structured grids**
* The goal is to provide **generalizable features and utilities for simple and robust analysis of climate data**
* Jointly developed by scientists and developers from **E3SM** and **PCMDI** at **Lawrence Livermore National Lab**
    * In collaboration with external users and organizations through GitHub
* Performed for the E3SM and **SEATS** (Simplifying ESM Analysis Through Standards) projects

<div style="text-align: center">
<img src="../_static/e3sm-logo.jpg" alt="E3SM logo" align=\"center\" style="display: inline-block; width:300px;">
<img src="../_static/PCMDI-logo.png" alt="PCMDI logo" style="display: inline-block; width:300px;">
<img src="../_static/SEATS-logo.png" alt="SEATS logo" style="display: inline-block; width:300px;">

</div>

* Some key xCDAT features are inspired by or ported from the core **CDAT** library
  * Examples: spatial/temporal averaging, regrid2 for horizontal regridding
* Other features **leverage powerful libraries** in the **xarray ecosystem** 
  * Examples: `xESMF` and `CF-xarray`
* xCDAT strives to support **CF compliant datasets** and datasets with **common non-CF compliant metadata**
    * Example common non-CF metadata: time units in “months since …” or “years since …”

<div style="text-align: center">
        <img src="../_static/CMIP6-logo.png" alt="CMIP6 logo" style="display: inline-block; width:450px">
    <img src="../_static/CF-xarray.png" alt="cf-xarray logo" style="display: inline-block; width:450px
;">

</div>


## First, Let's Dive into Xarray

* Xarray is an evolution of an internal tool developed at The Climate Corporation
* Released as open source ina May 2014
* __NumFocus__ fiscally sponsored project since August 2018


<div style="text-align: center">
    <img src="../_static/xarray-logo.png" alt="xarray logo" style="display: inline-block; width:300px;">
    <img src="../_static/NumFocus-logo.png" alt="NumFOCUS logo" style="display: inline-block; width:300px">
</div>



### Key Features and Capabilities in Xarray

* __“N-D labeled arrays and datasets in Python”__
    * Built upon and extends NumPy and pandas
* __Interoperable with scientific Python ecosystem__ including NumPy, Dask, Pandas, and Matplotlib
* Supports file I/O, indexing and selecting, interpolating, grouping, aggregating, parallelism (Dask), plotting (matplotlib wrapper)
    * Supports I/O for netCDF, Iris, OPeNDAP, Zarr, and GRIB

<div style="text-align: center">
<img src="../_static/numpy-logo.svg" alt="NumPy logo" style="display: inline-block; width:300px;">
<img src="../_static/pandas-logo.svg" alt="Pandas logo" style="display: inline-block; width:300px;">
    <img src="../_static/dask-logo.svg" alt="Dask logo" style="display: inline-block; width:300px">
</div>

Source: <cite>https://xarray.dev/#features</cite>


### Why Xarray?

> "Xarray introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like 
> multidimensional arrays, which allows for a more intuitive, more concise, and less error-prone developer 
> experience."
>
> &mdash; <cite> https://xarray.pydata.org/en/v2022.10.0/getting-started-guide/why-xarray.html</cite>



Xarray uses labels on arrays to provide a powerful and concise interface

  * __Apply operations over dimensions by name__
    * `x.sum('time')`
  * __Select values by label__ (or logical location) instead of integer location
    * `x.loc['2014-01-01']` or `x.sel(time='2014-01-01')`
  * __Mathematical operations vectorize across multiple dimensions__ (array broadcasting) based on __dimension names__, not shape
    * `x - y`
  * Easily use the __split-apply-combine paradigm__ with groupby
    * `x.groupby('time.dayofyear').mean()`.
  * __Database-like alignment__ based on coordinate labels that __smoothly handles missing values__
    * `x, y = xr.align(x, y, join='outer')`
  * Keep track of __arbitrary metadata in__ the form of a __Python dictionary__
    * `x.attrs`

Source: <cite>https://docs.xarray.dev/en/v2022.10.0/getting-started-guide/why-xarray.html#what-labels-enable</cite>

## The Xarray Core Data Structures

> "Xarray data models are borrowed from netCDF file format, which provides xarray with a natural and portable
> serialization format."
>
> &mdash; <cite>https://docs.xarray.dev/en/v2022.10.0/getting-started-guide/why-xarray.html</cite>


Xarray has two core data structures:

1. `xarray.DataArray`
   * A class that attaches __dimension names__, __coordinates__, and __attributes__ to __multi-dimensional arrays__ (aka "labeled arrays")
   * An N-D generalization of a `pandas.Series`
2. `xarray.Dataset`
   * A __dictionary-like container__ of DataArray objects with __aligned dimensions__ 
      * DataArray objects are classified as "coordinate variables" or "data variables"
      * All data variables have a shared __union__ of coordinates
   * Serves a similar purpose to a `pandas.DataFrame`

<div style="text-align: center">
    <img src="../_static/dataset-diagram.webp" alt="xarray logo" style="display: inline-block; width:450px">
</div>

### Dissecting Xarray Data Structures in a Real-World Dataset

This example netCDF4 dataset is opened directly from ESGF using xarray's OPeNDAP support.

It contains the `tas` variable, which represents near-surface air temperature.
`tas` is recorded on a monthly frequency.

In [1]:
# This style import is necessary to properly render Xarray's HTML output with
# the Jupyer RISE extension.
# GitHub Issue: https://github.com/damianavila/RISE/issues/594
# Source: https://github.com/smartass101/xarray-pydata-prague-2020/blob/main/rise.css

from IPython.core.display import HTML

style = """
<style>
.reveal pre.xr-text-repr-fallback {
    display: none;
}

.reveal ul.xr-sections {
    display: grid
}

.reveal ul ul.xr-var-list {
    display: contents
}
</style>
"""


HTML(style)

In [None]:
import xarray as xr

filepath = "http://esgf.nci.org.au/thredds/dodsC/master/CMIP6/CMIP/CSIRO/ACCESS-ESM1-5/historical/r10i1p1f1/Amon/tas/gn/v20200605/tas_Amon_ACCESS-ESM1-5_historical_r10i1p1f1_gn_185001-201412.nc"

ds = xr.open_dataset(filepath)

### The `Dataset` Model

A dictionary-like container of labeled arrays (DataArray objects) with aligned dimensions. 

Key properties:

* `dims`: a dictionary mapping from dimension names to the fixed length of each dimension (e.g., {'x': 6, 'y': 6, 'time': 8})
* `coords`: another dict-like container of DataArrays intended to label points used in data_vars (e.g., arrays of numbers, datetime objects or strings)
* `data_vars`: a dict-like container of DataArrays corresponding to variables
* `attrs`: dict to hold arbitrary metadata

Source: <cite>https://docs.xarray.dev/en/stable/user-guide/data-structures.html#dataset</cite>

In [None]:
ds

### The `DataArray` Model

A class that attaches __dimension names__, __coordinates__, and __attributes__ to __multi-dimensional arrays__ (aka "labeled arrays")

Key properties:

* `values`: a numpy.ndarray holding the array’s values
* `dims`: dimension names for each axis (e.g., ('x', 'y', 'z'))
* `coords`: a dict-like container of arrays (coordinates) that label each point (e.g., 1-dimensional arrays of numbers, datetime objects or strings)
* `attrs`: dict to hold arbitrary metadata (attributes)


Source: <cite>https://docs.xarray.dev/en/stable/user-guide/data-structures.html#dataarray</cite>

In [None]:
ds.tas

## Resources for Learning Xarray

* Now that you have a general sense of xarray data models, give xarray a shot if you haven't already!
* Here are some highly recommended resources:
  * [Xarray Tutorial](https://tutorial.xarray.dev/intro.html)
  * ["Xarray in 45 minutes"](https://tutorial.xarray.dev/overview/xarray-in-45-min.html#) 
  * [Xarray Documentation](https://docs.xarray.dev/en/stable/index.html)
  * [Xarray API Reference](https://docs.xarray.dev/en/stable/api.html)

## Jumping Forward to xCDAT, an Extension of Xarray

> "Xarray is designed as a general purpose library, and hence tries to avoid including overly domain specific
> functionality. But inevitably, the need for more domain specific logic arises."
>
> &mdash; https://docs.xarray.dev/en/v2022.10.0/internals/extending-xarray.html#extending-xarray

* xCDAT aims to provide **generalizable features and utilities for simple and robust analysis of climate data**.
* xCDAT's design philosophy is focused on **reducing the overhead required to accomplish certain tasks in xarray**.

### Available xCDAT Features

* Extension of xarray's ``open_dataset()`` and ``open_mfdataset()`` with post-processing options
  * Generate bounds that don't exist
  * Keep a single data variable in the Dataset
  * Optional decoding of time coordinates, centering of time coordinates, swapping longitudinal axis orientation between [0, 360) and [-180, 180)
* Temporal averaging
  * Time series averages (single snapshot and grouped), climatologies, and departures
  * Weighted or unweighted
  * Optional seasonal configuration (e.g., DJF vs. JFD, custom seasons)
* Geospatial weighted averaging (rectilinear grid)
  * Optional specification of regional domain
* Horizontal structured regridding (rectilinear and curvilinear grids)
  * Python implementation of `regrid2`_ for handling cartesian latitude longitude grids
  * API that wraps `xESMF`

## xCDAT's API Design

xCDAT provides public APIs in two ways:

1. Top-level APIs functions 
   * Example: `xcdat.open_dataset()`, `xcdat.center_times()`
2. Accessor classes
   * xcdat provides `Dataset` accessors, which are __implicit namespaces for custom functionality__.
   * Accessor __namespaces__ clearly identifies __separation from built-in xarray methods__. 
   * Example: `ds.spatial`, `ds.temporal`, `ds.regridder`



<div style="text-align: center">
    <figure>
    <img src="../_static/accessor_api.svg" alt="xcdat accessor" style="display: inline-block; width:50%">
        <figcaption>xcdat spatial functionality is exposed by chaining the <span style="background-color: #e4e6e8">.spatial</span> accessor attribute to the <span style="background-color: #e4e6e8">xr.Dataset</span> object.</figcaption>
    </figure>
</div>



## A Demo of xCDAT Capabilities

* Prerequisites
    * Installing `xcdat`
    * Import `xcdat`
    * Open a dataset and apply postprocessing operations
* Scenario 1 - Calculate the spatial averages over the tropical region
* Scenario 2 - Calculate the annual anomalies
* Scenario 3 - Horizontal regridding (bilinear, gaussian grid)

### Installing `xcdat`

xCDAT is available on Anaconda under the `conda-forge` channel (https://anaconda.org/conda-forge/xcdat)

Two ways to install `xcdat` with recommended dependencies (`xesmf`):
1. Create a conda environment from scratch (`conda create`)
    ```bash
    conda create -n <ENV_NAME> -c conda-forge xcdat xesmf
    conda activate <ENV_NAME>
    ```
2. Install `xcdat` in an existing conda environment (`conda install`)
    ```bash
    conda activate <ENV_NAME>
    conda install -c conda-forge xcdat xesmf
    ```

_Source_: <cite>https://xcdat.readthedocs.io/en/latest/getting-started.html</cite>

### Opening a dataset

This example netCDF4 dataset is opened directly from ESGF using xarray's OPeNDAP support.

It contains the __`tas` variable__, which represents near-surface air temperature.
`tas` is recorded on a monthly frequency.

In [None]:
# This gives access to all xcdat public top-level APIs and accessor classes.
import xcdat as xc

# We import these packages specifically for plotting. It is not required to use xcdat.
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
filepath = "http://esgf.nci.org.au/thredds/dodsC/master/CMIP6/CMIP/CSIRO/ACCESS-ESM1-5/historical/r10i1p1f1/Amon/tas/gn/v20200605/tas_Amon_ACCESS-ESM1-5_historical_r10i1p1f1_gn_185001-201412.nc"

ds = xc.open_dataset(
    filepath,
    add_bounds=True,
    decode_times=True,
    center_times=True
)

# Unit adjustment from Kelvin to Celcius
ds["tas"] = ds.tas - 273.15

In [None]:
ds

### Scenario 1: Spatial Averaging

Related accessor: `ds.spatial`

In this example, we calculate the spatial average of `tas` over the tropical region.

In [None]:
ds_trop_avg = ds.spatial.average("tas", axis=["X","Y"], lat_bounds=(-25,25))

In [None]:
ds_trop_avg.tas

#### Plot the first 100 time steps

In [None]:
ds_trop_avg.tas.isel(time=slice(0, 100)).plot()

### Scenario 2: Calculate temporal average 

Related accessor: `ds.temporal`

In this example, we calculate the temporal average of `tas` as a single snapshot. The time dimension is removed after averaging.

In [None]:
ds_avg = ds.temporal.average("tas", weighted=True)
ds_avg.tas

#### Plot the temporal average

In [None]:
ds_avg.tas.plot(label="weighted")

### Scenario 3: Horizontal Regridding

Related accessor: `ds.regridder`

In this example, we will generate a gaussian grid with 32 latitudes to regrid our input data to.

#### Create the output grid

In [None]:
output_grid = xc.create_gaussian_grid(32)

In [None]:
fig, axes = plt.subplots(ncols=2, figsize=(16, 4))

ds.regridder.grid.plot.scatter('lon', 'lat', s=0.01, ax=axes[0])
axes[0].set_title('Input Grid')

output_grid.plot.scatter('lon', 'lat', s=0.1, ax=axes[1])
axes[1].set_title('Output Grid')

plt.tight_layout()

#### Regrid the data

xCDAT offers horizontal regridding with `xESMF` (default) and a Python port of `regrid2`.
We will be using `xESMF` to regrid.


In [None]:
# xesmf supports "bilinear", "conservative", "nearest_s2d", "nearest_d2s", and "patch"
output = ds.regridder.horizontal('tas', output_grid, tool='xesmf', method='bilinear')

In [None]:
fig, axes = plt.subplots(ncols=2, figsize=(16, 4))

ds.tas.isel(time=0).plot(ax=axes[0])
axes[0].set_title('Input data')

output.tas.isel(time=0).plot(ax=axes[1])
axes[1].set_title('Output data')

plt.tight_layout()

### Parallelism with Dask

> Nearly all existing xarray methods have been extended to work automatically with Dask arrays for parallelism
&mdash; <cite>https://docs.xarray.dev/en/stable/user-guide/dask.html#using-dask-with-xarray</cite>


* Parallelized xarray methods include __indexing, computation, concatenating and grouped operations__
* __xCDAT inherently supports Dask parallelism__ for most APIs by building upon xarray methods
  * Dask arrays are loaded into memory only when absolutely required (e.g., decoding time, handling bounds)

<div style="text-align:center">
  <img src="../_static/dask-logo.svg" alt="Dask logo" style="display: inline-block; width:300px;">
</div>

#### High-level Overview of Dask Mechanics

* __Dask divides arrays__ into many small pieces, called __"chunks"__ (each presumed to small enough to fit into memory)
* Dask arrays __operations are lazy__
  * Operations __queue__ up a series of tasks mapped over blocks
  * No computation is performed until values need to be computed (lazy)
  * Data is loaded into memory and __computation__ is performed in __streaming fashion__, __block-by-block__
* Computation is controlled by multi-processing or thread pool


Source: <cite>https://docs.xarray.dev/en/stable/user-guide/dask.html</cite>

#### How do I activate Dask with Xarray/xCDAT?

* The usual way to create a Dataset filled with Dask arrays is to load the data from a netCDF file or files
* You can do this by supplying a `chunks` argument to open_dataset() or using the ``open_mfdataset``() function
  * By default, ``open_mfdataset()`` will chunk each netCDF file into a single Dask array
  * Supply the `chunks` argument to control the size of the resulting Dask arrays
  * Xarray maintains a Dask array until it is not possible (raises an exception instead of loading into memory)

Source: <cite>https://docs.xarray.dev/en/stable/user-guide/dask.html#reading-and-writing-data</cite>

In [None]:
filepath = "http://esgf.nci.org.au/thredds/dodsC/master/CMIP6/CMIP/CSIRO/ACCESS-ESM1-5/historical/r10i1p1f1/Amon/tas/gn/v20200605/tas_Amon_ACCESS-ESM1-5_historical_r10i1p1f1_gn_185001-201412.nc"

# Use .chunk() to activate Dask arrays
# NOTE: `open_mfdataset()` automatically chunks by the number of files, which
# might not be optimal.
ds = xc.open_dataset(
    filepath,
    chunks={"time": 1}
)

In [None]:
ds

#### Example of Parallelism in xCDAT 

In [None]:
tas_global = ds.spatial.average("tas", axis=["X", "Y"], weights="generate")["tas"]
tas_global

#### Dask Guidance

Visit these pages for more guidance (e.g., when to parallelize):

* Parallel computing with Dask: https://docs.xarray.dev/en/stable/user-guide/dask.html
* Xarray with Dask Arrays: https://examples.dask.org/xarray.html

[//]: # "TODO: Add link to xCDAT Dask guidance"

## Wrapping Things Up
* "Xarray introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like multidimensional arrays, which allows for a more intuitive, more concise, and less error-prone developer 
experience"
* xCDAT is an extension of xarray for climate data analysis on structured grids. 
* xCDAT aims to make analyzing climate data simple and robust with xarray

### We'd love your support!

* The xCDAT core team's mission is to provide a __maintainable and extensible package that serves the needs of the climate community in the long-term__. 
* Please check out the repository and give it a star to increase xCDAT's visibility!
* We're always open to contributions, whether through GitHub issues, pull requests, discussions, etc.


Repository: https://github.com/xCDAT/xcdat

### Resources

__If you comments or questions, reach out to us over email or the GitHub discussions forum!__

* GitHub repository: https://github.com/xCDAT/xcdat
* GitHub Discussions forum: https://github.com/xCDAT/xcdat/discussions
* Documentation: https://xcdat.readthedocs.io/en/latest/
* Anaconda page: https://anaconda.org/conda-forge/xcdat


