# CSS 120: Environmental Data Science

## Lecture 08 - DataArrays for Global Climate Data

(Based on Project Pythia)

## Today's Lecture

So far we learned a lot about single-variables statistics.

Now we are going to study a data type called `DataArray`, which provides a simple way to use `numpy` arrays that have multiple attributes.

Think of it as the `pandas` for climate research.

## Climate Systems

![](./img/l08img01.png)

## Variations on Climate Systems

![](./img/l08img02.png)

## DataArrays and Datasets

![](./img/l08img03.png)

## Packages

The first thing we need to do is to install the `xarray` package.

Remove the comment of the line below, run it, add the comment back, and restart the kernel.

In [None]:
# !pip install xarray # Then restart the kernel

## Packages

Now that we have it installed and the kernel restarted, let us load the packages:

In [None]:
# Packages
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

# XArray
import xarray as xr

## `Xarray` Package

[Xarray](https://xarray.pydata.org/en/v2023.05.0/getting-started-guide/why-xarray.html) expands on the capabilities of [NumPy](https://numpy.org/doc/stable/user/index.html#user) arrays, providing streamlined data manipulation.

Similar to [Pandas](https://pandas.pydata.org/docs/user_guide/index.html#user-guide), but focused on N-dimensional arrays of data (i.e. grids).

Its interface is based largely on the netCDF data model (variables, attributes, and dimensions).

Provides functionality similar to netCDF-java's [Common Data Model (CDM)](https://docs.unidata.ucar.edu/netcdf-java/current/userguide/common_data_model_overview.html).

## `DataArray` Object

The `DataArray` is one of the basic building blocks of Xarray (see docs [here](http://xarray.pydata.org/en/stable/user-guide/data-structures.html#dataarray)). 

It provides a `numpy.ndarray`-like object that expands to provide two critical pieces of functionality:

1. Coordinate names and values are stored with the data, making slicing and indexing much more powerful
2. It has a built-in container for attributes

## Some (cooked) Data

For this example, let us create a random array of *temperature* data in units of Kelvin:

In [None]:
rand_data = 283 + 5 * np.random.randn(5, 3, 4)
rand_data

## `DataArray`

Now we create a basic `DataArray` just by passing our plain `data` as an input:

In [None]:
temperature = xr.DataArray(rand_data)

## `DataArray`: Manipulation

Note a few things:

1. It generates some basic dimension names for us (`dim_0`, `dim_1`, `dim_2`).

2. Wrapping the numpy array in a `DataArray` gives us a rich display in the notebook.

We will now improve on that.

In [None]:
temperature

## `DataArray`: Assign Dimension Names

Much of the power of Xarray comes from making use of named dimensions. So let us add some more useful names.

We can do that by passing an ordered list of names using the keyword argument `dims`:

In [None]:
temperature = xr.DataArray(rand_data, dims=["time", "lat", "lon"])
temperature

## `DataArray`: Assign Dimension Names

This is already an improvement over `numpy` arrays. 

Even better, we can associate arrays representing the values for the coordinates for each of these dimensions with the data when we create the `DataArray`.

Now let us create a full `DataArray`, with other attributes.

## `DataArray`: Make Time Coordinates

Here we will use [Pandas](https://foundations.projectpythia.org/core/pandas.html) to create an array of [datetime data](https://foundations.projectpythia.org/core/datetime.html), which we will then use to create a `DataArray` with a named coordinate `time`.


In [None]:
times_index = pd.date_range("2024-04-01", periods=5)
times_index

## `DataArray`: Make Space Coordinates

We'll also create arrays to represent sample longitude and latitude:


In [None]:
lons = np.linspace(-120, -60, 4)
lats = np.linspace(25, 55, 3)

In [None]:
lons

In [None]:
lats

## `DataArray`: Add all this info

When we create the `DataArray` instance, we pass in the arrays we just created:

In [None]:
temperature = xr.DataArray(
    rand_data, 
    coords=[times_index, lats, lons], 
    dims=["time", "lat", "lon"]
)
temperature

## `DataArray`: Set Useful Attributes

We can also set some attribute metadata, which will help provide clear descriptions of the data. 

In this case, we can specify that we're looking at 'air_temperature' data and the units are 'kelvin'.

In [None]:
temperature.attrs["units"] = "kelvin"
temperature.attrs["standard_name"] = "air_temperature"

In [None]:
temperature

## `DataArray`: Attributes Are Not Preserved by Default

If we perform a mathematical operaton with the `DataArray`, the coordinate values persist, but the attributes are lost. 

This because it is very challenging to know if the attribute metadata is still correct or appropriate after arbitrary arithmetic operations.

To illustrate this, let's do a simple unit conversion from Kelvin to Celsius:

In [None]:
temperature_in_celsius = temperature - 273.15
temperature_in_celsius

## `DataArray`: Attributes Are Not Preserved by Default

We usually wish to keep metadata with our dataset, even after manipulating the data. 

For example it can tell us what the units are of a variable of interest. 

So when you perform operations on your data, make sure to check that all the information you want is carried over. 

For an in-depth discussion of how Xarray handles metadata, you can find more information in the Xarray documents [here](http://xarray.pydata.org/en/stable/getting-started-guide/faq.html#approach-to-metadata).

## `Dataset` and `DataArray`

Along with `DataArray`, the other key object type in Xarray is the `Dataset`

It is a dictionary-like container that holds one or more `DataArray`s, which can also optionally share coordinates (see docs [here](http://xarray.pydata.org/en/stable/user-guide/data-structures.html#dataset)).

The most common way to create a `Dataset` object is to load data from a file (which we will practice in a later tutorial).

Here, instead, we will create another `DataArray` and combine it with our `temperature` data.

This will illustrate how the information about common coordinate axes is used.

## Pressure `DataArray` Using the Same Coordinates

For our next `DataArray` example, we'll create a random array of `pressure` data in units of hectopascal (hPa).

This code mirrors how we created the `temperature` object above.

In [None]:
pressure_data = 1000.0 + 5 * np.random.randn(5, 3, 4)
pressure = xr.DataArray(
    pressure_data, coords=[times_index, lats, lons], dims=["time", "lat", "lon"]
)
pressure.attrs["units"] = "hPa"
pressure.attrs["standard_name"] = "air_pressure"

## Pressure `DataArray`

In [None]:
pressure

## Creating a `Dataset` Object

Each `DataArray` in our `Dataset` needs a name.

The most straightforward way to create a `Dataset` with our `temperature` and `pressure` arrays is to pass a dictionary using the keyword argument `data_vars`:

In [None]:
ds = xr.Dataset(
    data_vars = {
        "Temperature": temperature, 
        "Pressure": pressure
    }
)

## Creating a `Dataset` Object

Notice that the `Dataset` object `ds` is aware that both data arrays sit on the same coordinate axes.

In [None]:
ds

## Access Data Variables and Coordinates in a `Dataset`

We can pull out any of the individual `DataArray` objects in a few different ways.

Using the "dot" notation:

In [None]:
ds.Pressure

## Access Data Variables and Coordinates in a `Dataset`

Or using dictionary access like this:

In [None]:
ds["Pressure"]

## Summary

We just learned how to create `XArrray` `DataArray`s and `Dataset`s

Next class we will see how this is useful for manipulating environmental science data.