# Xarray intro

> *DS Python for GIS and Geoscience*  
> *September, 2023*
>
> *© 2023, Joris Van den Bossche and Stijn Van Hoey. Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)*

> *Adapted in 2024 for the course 'Computational Physics' by Wout Dewettinck*
---

In this notebook we will introduce the `xarray` package and demonstrate some important functionalities. We will do so by using some simulated temperature data.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import xarray as xr
import cartopy
import cartopy.crs as ccrs

%matplotlib inline

In [None]:
# Change to directory where data is stored
data_dir = "../data"  # "/kyukon/data/gent/courses/2024/computphys_C004504/input"

## Introduction

First we load in the data into the variable `temperature_data`:

In [None]:
ds = xr.open_dataset(
    f"{data_dir}/baseline_2021071100_144_tas_2021-07-11T01_2021-07-17T00_3600_regridded.nc",
    engine="netcdf4",
)
temperature_data = ds.tas
temperature_data

We can print information about this variable by simply running a cell with the variable name. Explore the data a bit by clicking on the buttons. The next cells will help you along.

In [None]:
temperature_data

The variable `temperature_data` is an `xarray.DataArray`. It is similar to a Numpy array, but additionally it contains coordinate information, labeled dimensions and attributes. We can see the shape of this DataArray by running `.shape`, similarly to Numpy:

In [None]:
temperature_data.shape

We can find the coordinates of this DataArray by running `.coords`. The coordinates with an asterisk in front of them are also dimensions.  Dimensions prescribe the shape of the array. 

In [None]:
temperature_data.coords

We can see that this DataArray has three dimensions: time, lat and lon. We can find more information about, for example, the time dimension by running `temperature_data.time`. Do this for the other dimensions as well.

In [None]:
temperature_data.time

An `xarray.DataArray` contains attributes with information about the DataArray. These attributes can be accessed with the graphical interface when outputting the DataArray, or by accessing `.attrs`:

In [None]:
temperature_data.attrs

Not only the DataArray as a whole, but also the coordinates have attributes:

In [None]:
temperature_data.lon.attrs

The data type of this `xarray.DataArray` `temperature_data` is 'float32'. Xarray uses the data types provided by Numpy. More information on the data types NumPy supports is available in the [documentation](https://numpy.org/devdocs/user/basics.types.html#array-types-and-conversions-between-types).

Converting to another data type is supported by `astype()` method:

In [None]:
temperature_data.astype("float64")  # .nbytes

Using xarray:

- Data stored as a NumPy arrays.
- Dimensions do have a name.
- The coordinates of each of the dimensions can represent geographical coordinates, categories, dates, ... instead of just an index.

**REMEMBER**:

The [`xarray` package](xarray.pydata.org/en/stable/) introduces __labels__ in the form of dimensions, coordinates and attributes on top of raw NumPy-like arrays. Xarray is inspired by and borrows heavily from Pandas.   

## Selecting data

Xarray’s labels make working with multidimensional data much easier.

We could use the Numpy style of data slicing. We select here the first element of the first dimension, which corresponds with the time dimension in this case:

In [None]:
temperature_data[0]

However, it is often much more powerful to use xarray’s `.sel()` method to use label-based indexing, making the selection more declarative:

In [None]:
temperature_data.sel(time="2021-07-11T01")

We can select a specific set of coordinate values as a __list__ and take the value that is most near to the given value, by setting `method = "nearest"`:

In [None]:
temperature_data.sel(lat=[50.0, 51.0, 52.0], method="nearest").sel(
    time="2021-07-11T12"
).plot.line(hue="lat");

Sometimes, a specific range is required. The `.sel()` method also supports __slicing__, so we can select the green band and slice a subset of the data along the x direction:

In [None]:
temperature_data.sel(lon=slice(4, 5), time="2021-07-12T16").plot.imshow()

The __positional indexing__ as you would do with the underlying Numpy array is still possible as well:

In [None]:
temperature_data[0, 30:60:3, 30:60:4]

Use a __condition__ to select data, also called fancy indexing or boolean indexing:

In [None]:
temperature_data > 288

However, with xarray we cannot use a mask like this to directly filter the array or assign new values. 

One typical use case for raster data is where you want to apply a mask to the data and set those values to some "NODATA" value. For plotting, this can for example be `np.nan`, and for this we can use the `where()` method:

In [None]:
temperature_data.where(temperature_data > 288).sel(time="2021-07-12T06").plot.imshow()

## Plotting

We already used `.plot.imshow` and `.plot.line` in the previous section. `xarray` has a `plot` method, which can be used for different plot types.

It supports both 2 dimensional (e.g. line) as 3 (e.g. imshow, pcolormesh) dimensional plots. When just using `plot`, xarray will do a _best guess_ on how to plot the data. However being explicit `plot.line`, `plot.imshow`, `plot.pcolormesh`, `plot.scatter`,...  gives you more control.

In [None]:
temperature_data.sel(
    time="2021-07-11T16"
).plot();  # add .line() -> ValueError: For 2D inputs, please specify either hue, x or y.

In [None]:
temperature_data.sel(lat=51.0, lon=[3.5, 4.0, 4.5], method="nearest").plot.line(
    hue="lon"
);

"facetting" splits the data in subplots according to a dimension, e.g. `time`

In [None]:
temperature_data.sel(time=slice("2021-07-14T06", "2021-07-14T09")).plot.imshow(
    col="time"
);

Facetting also works for line plots:

In [None]:
temperature_data.sel(lon=4.2, method="nearest").sel(
    time=slice("2021-07-14T06", "2021-07-14T09")
).plot.line(col="time");  # row="band"

Use the `robust` option when there is a lack of visual difference. This will use the 2nd and 98th percentiles of the data to compute the color limits. The arrows on the color bar indicate that the colors include data points outside the bounds.

In [None]:
ax = temperature_data.sel(time="2021-07-11T12").plot(
    cmap="Reds", robust=True, figsize=(12, 5)
)  # use False as well
ax.axes.set_aspect("equal")

In case you want to show the output with a __discrete colormap__, one can define a [set of levels to split the colormap](http://xarray.pydata.org/en/stable/user-guide/plotting.html#discrete-colormaps) on:

In [None]:
ax = temperature_data.sel(time="2021-07-11T12").plot(
    cmap="Reds", levels=[288, 289, 290, 291, 292, 293, 294, 295], figsize=(12, 5)
)  # plot without these levels
ax.axes.set_aspect("equal")

For more control, defining the `Figure` and `Axes` Matplotlib object first provides more flexibility in terms of further adjustments. One can pass an `axes` object to an xarray `.plot` method in order to link the output:

In [None]:
fig, (ax0, ax1) = plt.subplots(1, 2, figsize=(20, 4))

# first subplot a histogram of temperature at 2021-07-13T08
temperature_data.sel(time="2021-07-13T08").plot.hist(bins=30, ax=ax0)
ax0.set_title("Histogram of temperature")
# second subplot a line plot at given longitude
temperature_data.sel(lon=4.8, lat=[51.2, 51.3, 51.4], method="nearest").plot.line(
    hue="lat", ax=ax1
);

## Reductions, element-wise calculations and broadcasting

In [None]:
temperature_moment = temperature_data.sel(time="2021-07-14T08")
temperature_moment

### Reductions

The __reductions__ (aggregations) are provided as methods and can be applied along one or more of the data dimensions.

By default, the array is reduced over all dimensions, returning a single value as a DataArray:

In [None]:
temperature_moment.mean()

In NumPy, the dimensions are called the __axis__:

In [None]:
temperature_moment.mean(axis=1)

But we have __dimensions with labels__, so rather than performing reductions on axes (as in NumPy), we can perform them on __dimensions__. This turns out to be convenient and declarative:

In [None]:
temperature_moment.mean(dim="lon")

Calculate the mean values for each time separately:

In [None]:
temperature_data.mean(
    dim=["lon", "lat"]
)  # read as: 'take the mean over the dimensions lat and lon combined'

Or some quantiles:

In [None]:
temperature_data.quantile([0.1, 0.5, 0.9], dim=["lat", "lon"])

Other useful reduction methods are: `sum()`, `cumsum()` and `diff()`.

### Element-wise computations

The __for each element__ is crucial for NumPy and Xarray. The typical answer in programming would be a `for`-loop, but Numpy is optimized to do these calculations __element-wise__ (i.e. for all elements together):

In [None]:
dummy = np.arange(1, 10)
dummy

In [None]:
dummy * 10

Instead of:

In [None]:
[el * 20 for el in dummy]

Numpy provides most of the familiar arithmetic operators to apply on an element-by-element basis:

In [None]:
np.exp(dummy), np.sin(dummy), dummy**2, np.log(dummy)

Xarray works seamlessly with those arithmetic operators and numpy array functions.

In [None]:
temperature_moment * 10.0

In [None]:
np.log(temperature_moment)

We can combine multiple xarray arrays in arithemetic operations:

In [None]:
temperature_data.sel(time="2021-07-16T15") - temperature_data.sel(time="2021-07-16T14")

### Broadcasting

When we combine arrays with different shapes during arithmetic operations, NumPy and Xarray apply a set of __broadcoasting__ rules and the smaller array is _broadcast_ across the larger array so that they have compatible shapes. 

Perfoming an operation on arrays with different coordinates will result in automatic broadcasting:

In [None]:
temperature_data.lon.shape, temperature_moment.shape

In [None]:
(
    temperature_moment + temperature_data.lon
)  # Note, this calculaton does not make much sense, but illustrates broadcasting

**REMEMBER**:

The combination of element-wise calculations, efficient reductions and broadcasting provides NumPy and Xarray a lot of power. In general, it is a good advice to __avoid for loops__ when working with arrays.

### Let's practice!


#### EXERCISE 1:

We will practice the previous concepts with a small exercise. Using the `temperature_data` DataArray perform the following tasks:

* Select data for the period from midnight 14 July 2021 to midnight 15 July 2021, but every 6 hours. Make a facetted plot of the temperature where each plot represents a different time.
* Calculate the mean temperature over the entire period of the data and plot.
* Select the data from the point closest to campus Sterre. The coordinates are: 51.023 N, 3.710 E. Also select all points which are on the same meridian with a spacing of one degree. Plot the time course of the data for the different latitudes on the same plot.

# Xarray advanced

Acknowledgments to the data service: https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-land-monthly-means?tab=overview

## `xarray.Dataset` for multiple variables

We already know `xarray.DataArray`, it is a single multi-dimensional array and each dimension can have a name and coordinate values. Next to the `DataArray`, `xarray` has a second main data structure to store arrays, i.e. `xarray.DataSet`. 

Let's read an xarray data set (global rain/temperature coverage stored in the file `2016-2017_global_rain-temperature.nc`), using the function `open_dataset`:

In [None]:
ds = xr.open_dataset(
    f"{data_dir}/2016-2017_global_rain-temperature.nc", engine="netcdf4"
)
ds

Let's take a closer look at this `xarray.Dataset`:

- A `xarray.Dataset` is the second main data type provided by xarray
- This example has 3 __dimensions__:
    - `y`: the y coordinates of the data set
    - `x`: the x coordinates of the data set
    - `year`: the year coordinate of the data set
- Each of these dimensions are defined by a __coordinate__ (1D) array
- It has 2 __Data variables__: `precipitation` and `temperature` that both share the same coordinates

Hence, a `Dataset` object stores *multiple* arrays that have shared dimensions (__Note:__ not all dimensions need to be shared).  It is designed as an in-memory representation of the data model from the netCDF file format.

![](http://xarray.pydata.org/en/stable/_images/dataset-diagram.png)

In [None]:
ds["temperature"].shape, ds["temperature"].dims

The data and coordinate variables are also contained separately in the `data_vars` and `coords` dictionary-like attributes of a `xarray.DataSet` to access them directly:

- The data variables:

In [None]:
ds.data_vars

- The data coordinates:

In [None]:
ds.coords

If you rather use an alternative name for a given variable, use the `rename` method:

In [None]:
ds.rename({"precipitation": "rain"})

__Note:__ _the renaming is not entirely correct as rain is just a part of all precipitation (see https://en.wikipedia.org/wiki/Precipitation)._

Adding new variables to the data set is very similar to Pandas/GeoPandas:

In [None]:
ds["precipitation_m"] = ds["precipitation"] / 1000.0

In [None]:
ds

### Selecting `DataSet` data

Each of the data variables can be accessed as a single `xarray.DataArray` similar to selecting from dictionaries or columns from DataFrames:

In [None]:
type(ds["precipitation"]), type(ds["temperature"])

One can select multiple variables at the same time as well by passing a list of variable names:

In [None]:
ds[["temperature", "precipitation"]]

Or the other way around, use the `drop_vars` to drop variables from the data set:

In [None]:
ds.drop_vars("temperature")

**NOTE**:

Selecting a single variable using `[]` results into a `xarray.DataArray`, selecting multiple variables using a list `[[..., ...]]` results into a `xarray.DataSet`. Using `drop_vars` always returns a `xarray.DataSet`.

The selection with `sel` works as well with `xarray.DataSet`, selecting the data _for all variables in the DataSet_ and returning a DataSet:

In [None]:
ds.sel(year=2016)

The inverse of the `sel` method is the `drop_sel` which returns a DataSet with the enlisted indices removed:

In [None]:
ds.drop_sel(year=[2016])

### DataSet plotting

Plotting for data set level is rather limited. A typical use case that is supported to compare two data variables are scatter plots:

In [None]:
ds.plot.scatter(x="temperature", y="precipitation", s=1, alpha=0.1, edgecolor="none")

`facetting` is also supported here, by linking the `col` or `row` parameter to a data variable.

In [None]:
ds.plot.scatter(
    x="temperature", y="precipitation", s=1, alpha=0.1, col="year", edgecolor="none"
)  # try also hue="year" instead of col; requires hue_style="discrete"

### DataSet reductions

Datasets support arithmetic operations by automatically looping over all data variables and supports most of the same methods found on `xarray.DataArray`:

In [None]:
ds.mean()

In [None]:
ds.max(dim=["x", "y"])

__Note__ Using the names of the data variables (which is actually element-wise operations with DataArrays) makes a calculation very self-describing, e.g.

In [None]:
(ds["temperature"] * ds["precipitation"]).sel(year=2016).plot.imshow()

### Let's practice

For the next set of exercises, we use the [ERA5-Land monthly averaged data](https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-land-monthly-means?tab=overview), provided by the ECMWF (European Centre for Medium-Range Weather Forecasts).

> ERA5-Land is a reanalysis dataset providing a consistent view of the evolution of land variables over several decades. Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of physics. 

For these exercises, a subset of the data set focusing on Belgium has been prepared, containing the following variables:

- `sf`: Snowfall (_m of water equivalent_)
- `sp`: Surface pressure (_Pa_)
- `t2m`: 2 metre temperature (_K_)
- `tp`: Total precipitation (_m_)
- `u10`: 10 metre U wind component (_m/s_)

The dimensions are the `longitude`, `latitude` and `time`, which are each represented by a corresponding coordinate.

In [None]:
era5 = xr.open_dataset(
    f"{data_dir}/era5-land-monthly-means_example.nc"
)  # engine="netcdf4" is here optional
era5

To get a feeling on the spatial subset of the data set for these exercises, the following code creates a cartopy map of the average temperature with country borders added:

In [None]:
fig = plt.figure(figsize=(9, 6))
ax = plt.axes(projection=ccrs.PlateCarree())

ax.add_feature(cartopy.feature.BORDERS, linestyle=":")
era5_mean_temp = era5["t2m"].mean(dim="time") - 273.15
era5_mean_temp.plot.imshow(
    ax=ax, cmap="coolwarm", transform=ccrs.PlateCarree(), cbar_kwargs={"shrink": 0.65}
)
ax.coastlines();

(See notebook on Cartopy for more information)

#### EXERCISE 2:

The [short names used by ECMWF](https://confluence.ecmwf.int/display/CKB/ERA5-Land%3A+data+documentation) are not very convenient to understand. Rename the variables of the  data set according to the following mapping:
    
- `sf`: snowfall_m
- `sp`: pressure_pa
- `t2m`: temperature_k
- `tp`: precipitation_m
- `u10`: wind_ms   
    
Save the result of the mapping as the variable `era5_renamed`.

<details><summary>Hints</summary>

* Both `rename` and `rename_vars` can be used to rename the DataSet variables
* The `rename` function requires a `dict-like` input with the current names as the keys and the new names as the values. 

</details>

In [None]:
mapping = {
    "sf": "snowfall_m",
    "sp": "pressure_pa",
    "t2m": "temperature_k",
    "tp": "precipitation_m",
    "u10": "wind_ms",
}

__Note:__ _Make sure you have the variable `era5_renamed` correctly loaded for the following exercises. If not, load the solution of the previous exercise._

#### EXERCISE 3:

Start from the `era5_renamed` variable. You are used to work with temperatures defined in degrees celsius instead of Kelvin. Add a new data variable to `era5_renamed`, named `temperature_c`, by converting the `temperature_k` into degrees celsius:
    
$T_{^{\circ}C} = T_{K} - 273.15$
    
Create a histogram of the `temperature_c` to check the distribution of all the temperature valus in the data set. Use an appropriate number of bins to draw the histogram.

<details><summary>Hints</summary>

* Xarray - similar to Numpy - applies the mathematical operation element-wise, so no need for loops.
* Most plot functions work on `DataArray`, so make sure to select the variable to apply the `.plot.hist()`.
* One can define the number of bins using the `bins` parameter in the `hist` method.

</details>    

#### EXERCISE 4:

Calculate the total snowfall of the entire region of the dataset in function of time and create a line plot showing total snowfall in the y-axis and time in the x-axis.

<details><summary>Hints</summary>

* You need to calculate the total (`sum`) snowfall (`snowfall_m`) in function of time, i.e. aggregate over both the `longitude` and `latitude` dimensions (`dim=["latitude", "longitude"]`).
* As the result is a DataArray with a single dimension, the default `plot` will show a line, but you can be more explicit by saying `.plot.line()`.

</details>

#### EXERCISE 5:

The speed to sound is linearly dependent on temperature:
    
$v = 331.5 + (0.6 \cdot T)$
    
with $v$ the speed of sound and $T$ the temperature in degrees celsius.    
    
Add a new variable to the `era5_renamed` data set, called `speed_of_sound_m_s`, that calculates for each location and each time stamp in the data set the temperature corrected speed of sound.
    
Create a scatter plot to control the (linear) relationship you just calculated by comparing all the `speed_of_sound_m_s` and `temperature_c` data points in the data set.

<details><summary>Hints</summary>

* Creating a new variable is similar to GeoPandas/Pandas/dictionaries, `ds["MY_NEW_VAR"] = ...`.
* Remember, calculations are __element-wise__ just as in Numpy/Pandas; so no need for loops. The numbers (331.5, 0.6) are broadcasted to all elements in the data set to do the calculation.
* To compare two variables in a data set visually, `plot.scatter()` it is.

</details>

## Working with time series

Let's start again from the ERA5 data set we worked with in the previous exercises, and rename the variables for convenience:

In [None]:
era5 = xr.open_dataset(f"{data_dir}/era5-land-monthly-means_example.nc")
mapping = {
    "sf": "snowfall_m",
    "sp": "pressure_pa",
    "t2m": "temperature_k",
    "tp": "precipitation_m",
    "u10": "wind_ms",
}
era5_renamed = era5.rename(mapping)
era5_renamed

Apart from the different coordinates, the data set also contains a `time` dimension. xarray borrows the indexing machinery from Pandas, also for datetime coordinates:

In [None]:
era5_renamed.time

With a DateTime-aware index (see `datetime64[ns]` as dtype), selecting dates can be done using the string representation, e.g.

In [None]:
era5_renamed.sel(time="2002").time

In [None]:
era5_renamed.sel(time=slice("2001-05", "2002-08")).time

And you can access the datetime components, e.g. "year", "month",..., "dayofyear", "week", "dayofweek",... but also "season". Do not forget to use the `.dt`-accessor to access these components:

In [None]:
era5_renamed["time"].dt.season

Xarray contains some more powerful functionalities to work with time series, e.g. `groupby`, `resample` and `rolling`. All have a similar syntax as Pandas/GeoPandas, but applied on an N-dimensional array instead of a DataFrame.

### split-apply-combine, aka `groupby`

If we are interested in the _average over time_ for each of the levels, we can use a reducton function to get the averages of each of the variables at the same time:

In [None]:
era5_renamed.mean(dim=["time"])

But if we wanted the _average for each month of the year_ per level, we would first have to __split__ the data set in a group for each month of the year, __apply__ the average function on each of the months and __combine__ the data again. 

More information about this approach can be found here: [split-apply-combine](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html). The syntax of Xarray’s groupby is almost identical to Pandas!

First, extract the month of the year (1 -> 12) from each of the time coordinate/dimension:

In [None]:
era5_renamed["time"].dt.month  # The coordinates is a Pandas datetime index

We can use this array in a `groupby` operation:

In [None]:
era5_renamed.groupby(era5_renamed["time"].dt.month)

_split the data in (12) groups where each element is grouped according to the month it belongs to._

__Note:__ Xarray also offers a more concise syntax when the variable you're grouping on is already present in the dataset. The following statement is identical to the previous line:

In [None]:
era5_renamed.groupby("time.month")

Next, we apply an aggregation function _for each of the months_ over the `time` dimension in order to end up with: _for each month of the year, the average (over time) for each of the levels_:

In [None]:
era5_renamed.groupby("time.month").mean()

The resulting dimension month contains the 12 groups on which the data set was split up.

**REMEMBER**:

The `groupby` method splits the data set in groups, applies some function _on each of the groups_ and combines again the results of each of the groups. It is not limited to time series data, but can be used in any situation where the data can be split up by a categorical variable. 

### resample/rolling

Another (alike) operation - specifically for time series data - is to `resample` the data to another time-aggregation. For example, resample to monthly (`4M`) or yearly (`1Y`) median values:

In [None]:
era5_renamed.resample(time="1YE").median()  # 4M

In [None]:
era5_renamed["temperature_k"].sel(
    latitude=51.0, longitude=4.0, method="nearest"
).plot.line(x="time")
era5_renamed["temperature_k"].sel(
    latitude=51.0, longitude=4.0, method="nearest"
).resample(time="10YE").median().plot.line(x="time");

A similar, but different functionality is `rolling` to calculate rolling window aggregates:

In [None]:
era5_renamed.rolling(time=12, center=True).median()

In [None]:
era5_renamed["temperature_k"].sel(
    latitude=51.0, longitude=4.0, method="nearest"
).plot.line()
era5_renamed["temperature_k"].sel(
    latitude=51.0, longitude=4.0, method="nearest"
).rolling(time=12, center=True).min().plot.line()
era5_renamed["temperature_k"].sel(
    latitude=51.0, longitude=4.0, method="nearest"
).rolling(time=12, center=True).max().plot.line()

**REMEMBER**:

The [xarray `groupby`](http://xarray.pydata.org/en/stable/groupby.html) with the same syntax as Pandas implements the __split-apply-combine__ strategy. Also [`resample`](http://xarray.pydata.org/en/stable/time-series.html#resampling-and-grouped-operations) and [`rolling`](http://xarray.pydata.org/en/stable/computation.html?highlight=rolling#rolling-window-operations) are available in xarray.
    
__Note:__ Xarray adds a [`groupby_bins`](http://xarray.pydata.org/en/stable/generated/xarray.Dataset.groupby_bins.html#xarray.Dataset.groupby_bins) convenience function for binned groups (instead of each value).

### Let's practice

Run this cell before doing the exercises, so the starting point is again the `era5_renamed` data set:

In [None]:
era5 = xr.open_dataset(f"{data_dir}/era5-land-monthly-means_example.nc")
mapping = {
    "sf": "snowfall_m",
    "sp": "pressure_pa",
    "t2m": "temperature_k",
    "tp": "precipitation_m",
    "u10": "wind_ms",
}
era5_renamed = era5.rename(mapping)

#### EXERCISE 6:

Select the pressure data for the pixel closest to the center of Ghent (lat: -51.05, lon: 3.71) and assign the outcome to a new variable `ghent_pressure`.

Define a Matplotlib `Figure` and `Axes` (respectively named `fig, ax`) and use it to create a plot that combines the yearly average of the pressure data in Gent with the monthly pressure data as function of time for that same pixel as line plots. Change the name of the y-label to `'Pressure (Pa)'` and the title of the plot to `'Pressure (Pa) in Ghent (at -51.05, 3.71)'` (see notebook on Matplotlib for more information)

<details><summary>Hints</summary>
    
* For the yearly average, actually both `resample(time="Y")` as `groupby("time.year")` can be used. The main difference is the returned dimension after the grouping: `resample` returns a DateTimeIndex, whereas `groupby` returns a dimension called `year` and no longer contains the DateTimeIndex. To combine the data with other time series data, using the DateTimeIndex is preferred (i.e. `resample`). 
* `fig, ax = plt.subplots()` is a useful shortcut to prepare a Matplotlib Figure. You can add a label `set_label()` and title `set_title()` to the `axes` object.

</details>

#### EXERCISE 7:
    
Select the precipitation data for the pixel closest to the center of Ghent (lat: -51.05, lon: 3.71) and assign the outcome to a new variable `ghent_precipitation`.
    
For the Ghent pixel, calculate the maximal precipitation _for each month of the year_ (1 -> 12) and convert it to mm precipitation.
    
To make a bar-chart (not supported in xarray) of the maximal precipitation for each month of the year, convert the output to a Pandas DataFrame using the `to_dataframe()` method and create a horizontal bar chart with the month in the y-axis and the maximal precipitation in the x-axis. Feel free to improve the axis labels.

<details><summary>Hints</summary>
    
* You need to group based on the month of the data point, hence `groupby("...")` with `time` as an existing dimension, the `time.month` shortcut to get the corresponding month will work.
* Plotting in Pandas is very similar to xarray, use `.plot.barh()` to create a horizontal bar chart.

</details>    

#### EXERCISE 8:
    
Calculate the pixel-based average temperature _for each season_. Make a plot (`imshow`) with each of the seasons in a separate subplot next to each other. 

<details><summary>Hints</summary>
    
* You need to group based on the season of the data point, hence `groupby("...")` with `time` as an existing dimension, the `time.season` shortcut to get the corresponding season will work.
* The labels of the season groups are sorted on the strings in the description instead of the seasons. The workaround is to `sortby` and provide a correctly sorted version to sort the `season` with, see the open issue https://github.com/pydata/xarray/issues/757.
* Use facetting to plot each of the seasons next to each other, with `col="season"`.


</details>

#### EXERCISE 9:

Calculate the average temperature of the entire region of the dataset in function of time. From this time series, use only those years for which all 12 months of the year are included in the data set and calculate the yearly average temperature.
    
Create a line plot showing the yearly average temperature in the y-axis and time in the x-axis.

<details><summary>Hints</summary>

* You need to calculate the average (`mean`) temperature in function of time, i.e. aggregate over both the `longitude` and `latitude` dimensions.
* Year 2021 is only available till June, so exclude the year (e.g. slice till 2020).
* Resample to yearly values.
* As the result is a DataArray with a single dimension, the default `plot` will show a line, but you can be more explicit by saying `.plot.line()`.

</details>

#### EXERCISE 10:
    
Create the yearly total snowfall from 1991 up to 2005 and convert the snowfall into cm.  Make a plot (`imshow`) with each of the individual years in a separate subplot divided into 3 rows and 5 columns.
    
Make sure to update the name of the snowfall variable and/or colorbar label to make sure it defines the unit in cm.
 

<details><summary>Hints</summary>

* When selecting time series data from a coordinate with datetime-aware data, one can use strings to define a date. In combination with `slice`, the selection of the required years becomes `slice("1991", "2005")`.
* From montly to yearly data is a `resample` of the data.
* Use `.rename(NEW_NAME)` to update the name of a `DataArray`
* xarray supports _facetting_ directly, check out the `col` and `col_wrap` parameters in the plot functions of xarray or check http://xarray.pydata.org/en/stable/user-guide/plotting.html#faceting.
* To update the colorbar unit use, the `cbar_kwargs` option.

</details>

## xarray lazy data loading

Values are only read from disk when needed. For example, the following statement only reads the coordinate information and the metadata. The data itself is not yet loaded:

In [None]:
data_file = f"{data_dir}/2016-2017_global_rain-temperature.nc"

In [None]:
ds = xr.open_dataset(data_file)
ds

`load()` will explicitly load the data into memory:

In [None]:
xr.open_dataset(data_file).load()

## Combining datasets:

Lastly, `xarray` contains some functions which allows you to combine data from different DataArrays/DataSets in several ways. More information can be found [here](https://docs.xarray.dev/en/stable/user-guide/combining.html)  

If you want to combine datasets or data arrays along a single dimension, you can use the `concat`-function. This will create a new dataset/data array where the data is combined along this dimension. We select the temperature data at two times:


In [None]:
temperature_data1 = temperature_data.sel(time="2021-07-11T12")
temperature_data2 = temperature_data.sel(time="2021-07-12T12")

We can combine these two data arrays along the `time` dimension:

In [None]:
temeperature_data_combined = xr.concat(
    [temperature_data1, temperature_data2], dim="time"
)
temeperature_data_combined

Another possibility would have been to first assign a new coordinate to the two data arrays and then combine them along this new coordinate. For this, we use the `assign_coords`-function:

In [None]:
temperature_data1_new_coord = temperature_data1.assign_coords(new_coord=10)
temperature_data2_new_coord = temperature_data2.assign_coords(new_coord=20)

In [None]:
temeperature_data_combined = xr.concat(
    [temperature_data1_new_coord, temperature_data2_new_coord], dim="new_coord"
)
temeperature_data_combined

Notice how `xarray` also automatically combines the `time`-coordinates along the new `new_coord`-dimension!

If you want to combine several data arrays or datasets into one dataset, you can use the `merge`-function:

In [None]:
t2m = era5.t2m  # t2m is a data array
u10 = era5.u10  # u10 is a data array

We combine the data arrays `t2m` and `u10` into a new dataset:

In [None]:
era5_combined = xr.merge([t2m, u10])
era5_combined