# CSS 120: Environmental Data Science

## Xarray Operations (slicing)

(based on Pythia Project)

In [None]:
## Install Packages
!pip install pythia_datasets
!pip install netCDF4 h5netcdf

# Today's Lecture

Global climate datasets can be very large with multiple variables, and DataArrays and Datasets are very useful tools for organizing, comparing and interpreting such data. 

However, sometimes we are not interested in examining a *global* dataset but wish to examine a specific time or location. 

We will explore multiple computational tools in `Xarray` that allow you to select data from a specific spatial and temporal range. In particular, you will practice using:

-   **`.sel()`:** select data based on coordinate values or date
-   **`.interp()`:** interpolate to any latitude/longitude location to extract data
-   **`slice()`:** to select a range (or slice) along one or more coordinates, we can pass a Python slice object to `.sel()`

##  Some Environmental Science

### Solar Radiation and Earth's Energy Budget

![](l09img01.png)

### Solar Radiation and Earth's Energy Budget

![](l09img02.png)

### Solar Radiation and Earth's Energy Budget

![](l09img03.png)

### Solar Radiation and Earth's Energy Budget

![](l09img04.png)

### Solar Radiation and Earth's Energy Budget

![](l09img05.png)

# Packages

In [None]:
# imports
from datetime import timedelta
import numpy as np
import pandas as pd
import xarray as xr
from matplotlib import pyplot as plt
from pythia_datasets import DATASETS

## Subsetting and Selection by Coordinate Values

Since Xarray allows us to label coordinates, you can select data based on coordinate names and values, rather than array indices. 

We'll explore this briefly here.

In [None]:
# temperature data
rand_data = 283 + 5 * np.random.randn(5, 3, 4)
times_index = pd.date_range("2018-01-01", periods=5)
lons = np.linspace(-120, -60, 4)
lats = np.linspace(25, 55, 3)
temperature = xr.DataArray(
    rand_data, coords=[times_index, lats, lons], dims=["time", "lat", "lon"]
)
temperature.attrs["units"] = "kelvin"
temperature.attrs["standard_name"] = "air_temperature"

# pressure data
pressure_data = 1000.0 + 5 * np.random.randn(5, 3, 4)
pressure = xr.DataArray(
    pressure_data, coords=[times_index, lats, lons], dims=["time", "lat", "lon"]
)
pressure.attrs["units"] = "hPa"
pressure.attrs["standard_name"] = "air_pressure"

# combinate temperature and pressure DataArrays into a Dataset called 'ds'
ds = xr.Dataset(data_vars={"Temperature": temperature, "Pressure": pressure})
ds

## NumPy-like Selection

Suppose you want to extract all the spatial data for one single date: January 2, 2018. 

It's possible to achieve that with NumPy-like index selection.

In [None]:
indexed_selection = temperature[
    1, :, :
]  # index 1 along axis 0 is the time slice we want...
indexed_selection

## Selection with `.sel()`

In Xarray, we can instead select data based on coordinate values using the `.sel()` method.

In [None]:
named_selection = temperature.sel(time = "2018-01-02")
named_selection

## Selection with `.sel()`

We got the same result as when we used the NumPy-like index selection, but

- we didn't have to know anything about how the array was created or stored
- our code is agnostic about how many dimensions we are dealing with
- the intended meaning of our code is much clearer!

By using the .sel() method in Xarray, we can easily isolate data from a specific time. You can also isolate data from a specific coordinate.

**Your turn:** Write a line of code to select the temperature data from the coordinates lat = "25", lon = "-120."

In [None]:
coordinate_selection = ...
coordinate_selection

## Approximate Selection and Interpolation

The spatial and temporal resolution of climate data often differs between datasets or a dataset may be incomplete. 

Therefore, with time and space data, we frequently want to sample "near" the coordinate points in our dataset.

For example, we may want to analyze data from a specific coordinate or a specific time, but may not have a value from that specific location or date. 

In that case, we would want to use the data from the closest coordinate or time-step. Here are a few simple ways to achieve that.

### Nearest-neighbor Sampling

Suppose we want to know the temperature from `2018-01-07`. However, the last day on our `time` axis is `2018-01-05`. We can therefore sample within two days of our desired date of `2018-01-07`. 

We can do this using the `.sel` method we used earlier, but with the added flexibility of performing [nearest neighbor sampling](https://docs.xarray.dev/en/stable/user-guide/indexing.html#nearest-neighbor-lookups) and specifying an optional tolerance. 

This is called an **inexact lookup** because we are not searching for a perfect match, although there may be one. 

Here the **tolerance** is the maximum distance away from our desired point Xarray will search for a nearest neighbor.

In [None]:
temperature.sel(
    time = "2018-01-07", 
    method = "nearest", 
    tolerance = timedelta(days=2)
)

### Interpolation

The latitude values of our dataset are 25ºN, 40ºN, 55ºN, and the longitude values are 120ºW, 100ºW, 80ºW, 60ºW.

But suppose we want to extract a timeseries for Boulder, Colorado, USA (40°N, 105°W). 

Since `lon=-105` is _not_ a point on our longitude axis, this requires interpolation between data points.

We can do this using the `.interp()` method (see the docs [here](http://xarray.pydata.org/en/stable/interpolation.html)), which works similarly to `.sel()`. 

Using `.interp()`, we can interpolate to any latitude/longitude location using an interpolation method of our choice. 

In the example below, you will linearly interpolate between known points (other methods: nearest, cubic, quadratic).

In [None]:
temperature.interp(
    lon = -105, 
    lat = 40, 
    method = "linear"
)

## Slicing Along Coordinates

Frequently we want to select a range (or _slice_) along one or more coordinate(s). 

For example, you may wish to only assess average annual temperatures in equatorial regions. 

We can achieve this by passing a Python [slice](https://docs.python.org/3/library/functions.html#slice) object to `.sel()`. 

The calling sequence for <code>slice</code> always looks like <code>slice(start, stop[, step])</code>, where <code>step</code> is optional. 

In this case, let's only look at values between 110ºW-70ºW and 25ºN-40ºN:

In [None]:
temperature.sel(
    time = slice("2018-01-01", "2018-01-03"), 
    lon = slice(-110, -70), 
    lat = slice(25, 45)
)

## One More Selection Method: `.loc`

All of these operations can also be done within square brackets on the `.loc` attribute of the `DataArray`.

This is sort of in between the NumPy-style selection

```
temp[1,:,:]
```

and the fully label-based selection using `.sel()`

In [None]:
temperature.loc['2018-01-02']

## One More Selection Method: `.loc`

With `.loc`, we make use of the coordinate _values_, but lose the ability to specify the _names_ of the various dimensions. Instead, the slicing must be done in the correct order:

In [None]:
temperature.loc['2018-01-01':'2018-01-03', 25:45, -110:-70]

## One More Selection Method: `.loc`

What _doesn't_ work is passing the slices in a different order to the dimensions of the dataset:

```python
# This will generate an error
# temperature.loc[-110:-70, 25:45,'2018-01-01':'2018-01-03']
```

In [None]:
temperature.loc["2018-01-01":"2018-01-03", slice(25, 45), -110:-70]

## Opening and Plotting netCDF Data

##  Some Environmental Sciences

### Atmospheric Climate Systems

![](l09img06.png)

##  Some Environmental Sciences

### Atmospheric Climate Systems

![](l09img07.png)

##  Some Environmental Sciences

### Atmospheric Climate Systems

![](l09img08.png)

## Opening netCDF Data

Xarray is closely linked with the netCDF data model, and it even treats netCDF as a 'first-class' file format. 

This means that Xarray can easily open netCDF datasets. 

However, these datasets need to follow some of Xarray's rules. One such rule is that coordinates must be 1-dimensional.

In [None]:
ds = xr.open_dataset("NARR_19930313_0000.nc")
ds

### Questions 1


1. What are the dimensions of this dataset?
1. How many climate variables are in this dataset?


In [None]:
# Your answers here