# Working with Gridded netCDF data and xarray

**This lesson is based on the [_Lesson: working with netCDF data_](https://fabienmaussion.info/climate_system/week_02/01_Lesson_NetCDF_Data.html) in [Fabien Maussion](https://fabienmaussion.info/)'s Physics of the Climate System Course.** 

_These lecture notes and exercises are licensed under a [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/) license._ 

You already learned how to use the basic features of the python language with the numpy and matplotlib libraries. The purpose of this lesson is to introduce you to the main tool that you will use in the semester: [xarray](http://xarray.pydata.org).

This is a dense lesson. Please do it entirely and try to remember its structure and content. This code will provide a template for your own code, and you can always come back to these examples when you'll need them. I don't expect you to understand all details, but I hope that you are going to get acquainted with the "xarray way" of manipulating multi-dimensional data. You will have to copy and adapt parts of the code below to complete the exercises.


## NetCDF Files

In order to open and plot NetCDF files, you'll need to install the `xarray`, `cartopy`, and `netcdf4` packages: if you haven't done so already, follow the [installation instructions](https://isat-drg.github.io/ISAT_420/EnvironmentalData/4_Environments/condaEnvs.html) for our ISAT420 python environment that contains these packages. 

As a quick fix, you can also install them directly using the code below (this will take some time). 


In [None]:
#To install these packages remove the hash (#) characters in the lines below and run the cell. The ! tells jupyter to run a system command. 
#! conda install xarray
#! conda install netcdf4
#! conda install cartopy 

### Imports and options

First, let's import the tools we need. Remember why we need to import our tools? If not, ask Dr. Gerken

In [None]:
# Import the tools we are going to need today:
import matplotlib.pyplot as plt  # plotting library
import numpy as np  # numerical library
import xarray as xr  # netCDF library
import cartopy  # Map projections libary
import cartopy.crs as ccrs  # Projections list
# Some defaults:
plt.rcParams['figure.figsize'] = (12, 5)  # Default plot size

### The Data 

The data we are going to use today is from the [CERES](https://climatedataguide.ucar.edu/climate-data/ceres-ebaf-clouds-and-earths-radiant-energy-systems-ceres-energy-balanced-and-filled) (Clouds and the Earth's Radiant Energy System) mission. We are going to use the EBAF-TOA and the EBAF-Surface data products (both freely available [on this webpage](https://ceres.larc.nasa.gov/data/)) as climatologies (i.e. monthly averages 2005-2015). 

The data quality summary of these data (PDF) can be found [here](https://ceres.larc.nasa.gov/documents/DQ_summaries/CERES_EBAF_Ed4.1_DQS.pdf), and more accessible publications can be found [here for TOA](https://journals.ametsoc.org/doi/pdf/10.1175/JCLI-D-17-0208.1) and [here for Surface](https://journals.ametsoc.org/doi/pdf/10.1175/JCLI-D-17-0523.1).

We will also be using an example of [ERA5 Reanalysis](https://www.ecmwf.int/en/forecasts/dataset/ecmwf-reanalysis-v5) data.

ERA5 (or European ReAnalysis v5) provides global, hourly estimates of atmospheric, ocean wave, and land-surface variables at a horizontal resolution of 31\,km. Data is available from 1940 onwards both hourly and averaged to monthly. 

Reanalysis in general are the fusion of observations with a global weather model to derive a homogenous, regular-best estimate output on a grid from station based observations. 

ERA5 is produced by the [European Center for Medium Range Weather Forecasting (ECMWF)](https://www.ecmwf.int/) and can be downloaded freely (account registration required).

I have placed the data files into the `W6_Gridded_Reanalysis/data` directory. 

### Read the data

Most of today's meteorological data is stored in the NetCDF format (``*.nc``). NetCDF files are binary files, which means that you can't just open them in a text editor. You need a special reader for it. Nearly all the programming languages offer an interface to NetCDF. For this course we are going to use the [xarray](http://xarray.pydata.org/en/stable/) library to read the data: 

Let's start with having a look at the ERA5 file, I am providing: 



In [None]:
# Here I downloaded the file in the "data" folder which I placed in a folder close to this notebook
# The variable name "ds" stands for "dataset"
ds = xr.open_dataset(r'../data/reanalysis-era5-single-level-monthly-means_2000_T_Td_u_v_SST_P.nc', engine='netcdf4')

In [None]:
# Lets see what we have:
ds

Each netcdf file has a data model, that is represented by xarray: 

![xarray data model](http://xarray.pydata.org/en/stable/_images/dataset-diagram.png)

The NetCDF dataset is made up of various elements: Dimensions, Coordinates, Variables, Attributes:

- the dimensions specify the number of elements of each data coordinate, their names should be understandable and specific
- the attributes provide some information about the file (metadata)
- the variables contain the actual data. In our file there are five variables. All have the dimensions [time, latitude, longitude], so we can expect an array of size [12, 721, 1440]
- the coordinates locate the data in space and time



### Coordinates 

Let's have a look at the **time** coordinate first:

(You can see that this is a similar notation to _pandas_)

In [None]:
ds['time']

The array contains numbers 12 datetimes, they represent the months of the year. If we were to look into the ERA5 documentation, we know that these represent the average for each month during 2002.

The **location coordinates** are also self-explaining:

In [None]:
ds.longitude  # This is an alternate notaion to ds['longitude] that works in pandas and xarray as long as there are not spaces

In [None]:
ds.latitude

**Q: what is the spatial resolution of the ERA5 data?**

### Variables 


Variables can also be accessed directly from the dataset:

In [None]:
ds.sst

The **attributes** of a variable are extremely important, they carry the metadata and must be specified by the data provider. Here we can read in which units the variable is defined, as well as a description of the variable (the "long_name" attribute).


**Q: what other information can we read from this printout? Explore the other data variables and see if you understand all of them.** *Note: you can expand each variable's attributes in the html display, or use the method `ds.info()` to list all vars and attributes.*

In [None]:
# your answer here

## First Plots

Let't create a first set of plots. For example, we can quickly produce a map of SST in June. 

To do so, we have to select data based on the time coordinate using the `.sel()` method. The resulting slice is a 2-D grid with latitude and longitude. 

In [None]:
sst_jun= ds.sst.sel(time="2000-06-01")
sst_jun

We can now make a plot:

In [None]:
# Define the map projection (how does the map look like)
ax = plt.axes(projection=ccrs.EqualEarth())
# ax is an empty plot. We now plot the variable sst_jan onto ax
sst_jan.plot(ax=ax, transform=ccrs.PlateCarree()) 
# the keyword "transform" tells the function in which projection the data is stored 
ax.coastlines(); ax.gridlines(); # Add gridlines and coastlines to the plot

## Simple Analysis 

Analysing climate data is extremely easy in Python thanks to the [xarray](http://xarray.pydata.org/en/stable/) and [cartopy](https://scitools.org.uk/cartopy/docs/latest/) libraries. First we are going to compute the time average of the SST over the year:

Similar to _pandas_ we can perform computations over our data, like taking the average [`.mean()`]. We just have need to specify the dimension (`'time'`, `'latitude'`, `'longitude'`).

In [None]:
sst_avg = ds.sst.mean(dim='time')

What is `sst_avg` by the way?

In [None]:
ssst_avg

So `ssst_avg` is a 2-dimensional array of dimensions [latitude, longitude] (note that the time dimension has disapeared).

When we applied the `.mean()` function, we added an argument (called a **keyword argument**): ``dim='time'``. With this argument, we told the function to compute the average *over the time dimension*.

Let's make another plot of this. Give it a try.

In [None]:
# Plot the mean sst 


In [None]:
# Try what happens if you don't specify the dimension for your averaging. 


**Q: What do you think happened?**



*Note: scalar output is quite verbose in xarray... Your can print just the data onto the screen with the  .values attribute:*

In [None]:
ds.sst.mean().values

**Q: what should we expect from the folowing commands:**

    ds.sst.mean(dim='longitude')
    ds.sst.mean(dim='time').mean(dim='longitude')
    ds.sst.mean(dim=['time', 'longitude'])
    
**Try them out!**

In [None]:
# Try the commands above. Do they work as expected? 


**E: what is the maximum SST value? And the minimum?** ([hint](http://xarray.pydata.org/en/stable/generated/xarray.DataArray.min.html))

In [None]:
# your answer here

## Slicing Data 

Sometimes, we want to focus on a certain region or period within our dataset, which means that we want to look at a _slice_ of the total array.  _xarray_ allows us to select data on such a slice. See what is happening below.  

In [None]:
sst_Australia = ds.sst.sel(time='2000-06-01',longitude = slice(105,160),latitude =slice(-8,-45))
sst_Australia

We can then make another plot. I am selecting a different map projection `LambertCylindrical()`, which looks better in this case. 
If you want to learn more about mapping and map projections you can read about them [here](https://earth-env-data-science.github.io/lectures/mapping_cartopy.html). 

In [None]:
# Define the map projection (how does the map look like)
ax = plt.axes(projection=ccrs.LambertCylindrical())  #
# ax is an empty plot. We now plot the variable sst_jan onto ax
sst_Australia.plot(ax=ax, transform=ccrs.PlateCarree()) 
# the keyword "transform" tells the function in which projection the data is stored 
ax.coastlines(); ax.gridlines(); # Add gridlines and coastlines to the plot