# Introduction to Python and Jupyter Notebook

**Material Cycles in Floodplain Systems · Seminar 1 (3 Nov 2021) · Jian Peng, Miguel Mahecha · WS 2021/2022**\
Feel free to contact [Marco Hannemann](mailto:marco.hannemann@ufz.de) or [Toni Schmidt](mailto:toni.schmidt@ufz.de) if you have questions or need help.

## 1—Motivation

**[Jupyter Notebook](https://jupyter.org)** is an online live editor which can be used for different programming languages including Python. Every Jupyter Notebook is made up of cells which among others can comprise code or text. The code can be [compiled](https://www.webopedia.com/definitions/compilation/) directly in the notebook, which makes it a way to work with Python effectively. There are [helpful tutorials](https://realpython.com/jupyter-notebook-introduction/) that explain how to use Jupyter Notebooks.

There are [many reasons](https://www.datacamp.com/community/blog/python-scientific-computing-case) why choosing **[Python](https://www.python.org/doc/essays/blurb/)** for scientific programming. In a nutshell, it has its advantages in easy code writing and performance. A growing community is constantly working on improving external libraries.

**[NetCDF](https://www.unidata.ucar.edu/software/netcdf/)** is a data file format commonly used for [array](https://www.w3schools.com/python/python_arrays.asp)-oriented scientific data. As science is all about working with data, it is important to know how to work with this very handy data format. During this seminar we will be working with netCDF files of different data products (e.g. soil moisture, evaporation, or precipitation).

## 2—Modules

### 2.1—Module installation

**Modules** are Python files that define **functions** which are needed to sufficiently work with data. **Package installers** are required to install and manage modules. For Python **``pip``** is an established package installer.

Firstly, we use ``pip`` to determine which libraries are already installed using the command ``pip list``.

In [None]:
pip list

To work with netCDF files, we need to install a module that comprises functions to read netCDF files. There are different modules that can be used (e.g. ``h5py``, ``netCDF4``, or ``xarray``). During the seminar we use **``xarray``** due to its convenient implication of working with netCDF files. To install ``xarray`` we would use the command ``pip install xarray``. As two other modules are required to use ``xarray``, we also need to install ``netCDF4`` and ``matplotlib`` by ``pip install xarray netcdf matplotlib``. Afterwards we need to restart the kernel.

In [None]:
pip install xarray netcdf4 matplotlib

### 2.2—Module importing

In Python, modules need to be **imported** to use their functions. To do so we use the command ``import <module>``. Additionally, we are capable of allocating short **[aliases](https://www.digitalocean.com/community/tutorials/how-to-import-modules-in-python-3#aliasing-modules)** to modules using ``import <module> as <alias>`` in order to make coding easier and the code more readable. For standard libraries there are common aliases, e.g. ``xr`` for ``xarray``.

In [None]:
import xarray as xr

## 3—Functions

### 3.1—Built-in functions

Python comes with many **[built-in functions](https://www.w3schools.com/python/python_ref_functions.asp)** which do not require to import a module. Two very important built-in functions are **``help()``** and **``print()``**. To help yourself use ``help(<function>)`` to retrieve all information needed to use the respective function. If you want to print a message to the screen use ``print()``. As Jupyter Notebook is based on [IPython](https://ipython.org), ``print()`` is not necessarily needed to print a message, instead a function or a variable can just be called right away. When using Python outside a Jupyter Notebook, ``print()`` is always necessary to print a message.

### 3.2—External functions

To apply **functions** that come with modules, we use the **[dot notation](https://www.askpython.com/python/built-in-methods/dot-notation)** by calling the module name or its alias and the function name separated by a dot, e.g. ``xr.open_dataset(<file_path>)`` to import data stored in a netCDF file. To work on with the data, it is appropriate to assign it to a **[variable](https://www.w3schools.com/python/python_variables.asp)**.

In [None]:
gleam = xr.open_dataset("data/gleam_europe_monmean_reduced.nc")

## 4—NetCDF files

### 4.1—Structure

Now that we have assigned the output of ``xr.open_dataset()`` to the variable ``gleam``, we can have a look at its **variable content** using ``print(gleam)``.

In [None]:
print(gleam)

Coming back to the benefits of IPython and Jupyter, there is a smarter way to look at the content by simply calling the variable name now without ``print()``.

In [None]:
gleam

The [**dimensions**](https://www.unidata.ucar.edu/software/netcdf/docs/group__dimensions.html) represent the **shape** of the data array and can be accessed directly using ``gleam.dims``.

In [None]:
gleam.dims

The [**coordinates**](https://www.unidata.ucar.edu/software/netcdf/workshops/2010/datamodels/NcCVars.html) can be accessed directly using ``gleam.coords``.

In [None]:
gleam.coords

The **data variables** of the netCDF file can be accessed directly using ``gleam.data_vars``.

In [None]:
gleam.data_vars

The [**attributes**](https://www.unidata.ucar.edu/software/netcdf/docs/group__attributes.html) of the netCDF file hold the **metadata** and can be accessed directly using ``gleam.attrs``.

In [None]:
gleam.attrs

There are many **[data types](https://www.w3schools.com/python/python_datatypes.asp)** used in Python. The output above is a [dictionary](https://www.w3schools.com/python/python_dictionaries.asp), which comprise data as key–value pairs. To get the **value** of a specific **key** we do ``dictionary["key"]``. For instance, to get the initial year of the GLEAM data set, we run ``gleam.attrs["StartYear"]``.

In [None]:
gleam.attrs["StartYear"]

The retrieval of values of the **data variables** that are stored within the netCDF file is similar to retrieving values of keys in a dictionary.

In [None]:
gleam["surface_soil_moisture"]

In order to work on with the variable content of ``surface_soil_moisture``, we assign it to a variable.

In [None]:
surface_soil_moisture = gleam["surface_soil_moisture"]

### 4.2—Statistics

We can use statistical methods that come with ``xarray`` to derive values like **minimum** (``.min()``), **maximum** (``.max()``), **mean** (``.mean()``), or **standard deviation** (``.std()``).

In [None]:
surface_soil_moisture.min()

In [None]:
surface_soil_moisture.max()

In [None]:
surface_soil_moisture.mean()

In [None]:
surface_soil_moisture.std()

## 5—Indexing and data selection

### 5.1—Zero-based indexing

For most objectives it is necessary to select parts of a data set. In Python, indexing is conventionally **zero-based**, i.e. we have to choose the zeroth element (``[0]``) in order to get the first element of a data set.

In [None]:
surface_soil_moisture[0]

### 5.2—Data selection using ``xarray``

As seen above, we now have selected the first time point of the GLEAM data set, as its the first dimension of the array. **[Indexing and data selection](http://xarray.pydata.org/en/stable/user-guide/indexing.html)** with ``xarray`` is much more convenient as it holds methods to select and slice data more easily. For instance, we can **select data by coordinates** using [``.sel()``](http://xarray.pydata.org/en/stable/generated/xarray.Dataset.sel.html). We are capable of selecting surface soil moisture data for specific time points, latitude, or longitudes or select slices, which involve time ranges, latitude ranges, or longitude ranges.

In [None]:
surface_soil_moisture.sel(time="2018-08-16")

In [None]:
surface_soil_moisture.sel(time=slice("2018-01-01","2019-12-31"))

In [None]:
surface_soil_moisture.sel(lon=13.40, lat=52.52, method="nearest")

We can also **group data** using [``.groupby()``](http://xarray.pydata.org/en/stable/generated/xarray.Dataset.groupby.html). So called **[date/time components](https://pandas.pydata.org/pandas-docs/stable/reference/indexing.html#datetimeindex)** can be used to group the data set on the time dimension easily. For instance, we can do ``ssm.groupby("time.month")`` to group the surface soil moisture data set into the months.

In [None]:
surface_soil_moisture_months = surface_soil_moisture.groupby("time.month")
surface_soil_moisture_months

This enables us to select the data month by month. If we want to select all data for January only (but still for all years), we can do ``surface_soil_moisture_months[1]``. Note that now the index is not zero-based but has to be a group label.

In [None]:
surface_soil_moisture_months[1]

## 6—Plotting

``xarray`` holds different methods to visualize data as can be seen in [this visualization gallery](http://xarray.pydata.org/en/stable/examples/visualization_gallery.html). We can use ``.plot()`` to plot a **histogram** of the surface soil moisture distribution over all dimensions.

In [None]:
surface_soil_moisture.plot()

We can also use ``.plot()`` in order to plot the spatial distribution of mean surface soil moisture as a **map**.

In [None]:
surface_soil_moisture_mean = surface_soil_moisture.mean(dim="time")

In [None]:
surface_soil_moisture_mean.plot()

Alternatively, we can plot a **time series** of surface soil moisture of a specific location using ``.plot()`` too.

In [None]:
surface_soil_moisture_berlin = surface_soil_moisture.sel(lon=13.40, lat=52.52, method="nearest")

In [None]:
surface_soil_moisture_berlin.plot()

## 7—More examples

So far we have used [**GLEAM**](https://www.gleam.eu) (Global Land Evaporation Amsterdam Model) for demonstations. In the following we have the chance to play around with other data sets, such as [**GPCP**](http://gpcp.umd.edu) (Global Precipitation Climatology Project) or [**MODIS Landcover**](https://modis.gsfc.nasa.gov/data/dataprod/mod12.php), where [MODIS](https://modis.gsfc.nasa.gov) is the satellite sensor *Moderate Resolution Imaging Spectroradiometer*.

### 7.1—GPCP

In [None]:
# open data set
gpcp = xr.open_dataset("data/GPCP__V2_3__PRECIP__2.5x2.5.nc")

In [None]:
# look at data structure
gpcp

In [None]:
# assign precipitation to a variable
precip = gpcp["precip"]

In [None]:
# group precipitation by seasons
precip_seasons = precip.groupby("time.season")

In [None]:
# look at groups
precip_seasons

In [None]:
# plot mean precipiation per latitude over time for summer (JJA)
precip_seasons["JJA"].mean(dim="lon").plot()

In [None]:
# plot mean precipiation per latitude over time for winter (DJF)
precip_seasons["DJF"].mean(dim="lon").plot()

### 7.2—MODIS Landcover

In [None]:
# open data set
modis = xr.open_dataset("data/modis_landcover_europe_2019.nc")

In [None]:
# look at data structure
modis

In [None]:
# assign landcover to a variable
landcover = modis["landcover_igbp"]

In [None]:
# group landcover by years
landcover.groupby("time.years")

In [None]:
# plot landcover as a map
landcover.plot(levels=20, cmap="terrain")

In [None]:
# select Spain
landcover_spain = landcover.sel(lon=slice(-10, 4), lat=slice(45, 35))

In [None]:
# plot landcover of Spain as a map
landcover_spain.plot(levels=20, cmap="terrain")

## Literature

- [DigitalOcean eBook: How To Code in Python](https://www.digitalocean.com/community/books/digitalocean-ebook-how-to-code-in-python?refcode=4d2af78748bd&utm_campaign=pythonebook&utm_medium=ebook&utm_source=ebook) → very good eBook introducing the Python basics
- [W3Schools Python Tutorial](https://www.w3schools.com/python/default.asp) → short and comprehensible overview of Python basics