# Tutorial A: Download Data from Copernicus and ISIMIP

**onehealth-data Python package - data download and visualization of the raw data**

---

**Authors:** Scientific Software Center  
**Date:** October 2025  
**Version:** 1.0

---

## Overview

This tutorial demonstrates how to download data files through the Copernicus and ISMIP API. You will learn how to:

- Download ERA5 climate data
- Download population data from ISIMIP
- Visualize the data to verify its integrity and correctness

Let's get started!

# Preparing data files

Preparing data files according to the [data flowchart](../../datalake.md#data-flowchart)

In [None]:
from onehealth_data_backend import inout
from pathlib import Path
from matplotlib import pyplot as plt
import xarray as xr
from isimip_client.client import ISIMIPClient

In [None]:
# change to your own data folder, if needed
data_root = Path("../../../data/")
data_folder = data_root / "in"

## Download ERA5-Land data

To download ERA5-Land data using CDS's API:
* Select the target dataset, e.g. ERA5-Land monthly averaged data from 1950 to present
* Go to tab `Download` of the dataset and select the data variables, time range, geographical area, etc. that you want to download
* At the end of the page, click on `Show API request code` and take notes of the following information
    * `dataset`: name of the dataset
    * `request`: a dictionary summarizes your download request
* Replace the values of `dataset` and `request` in the below cell correspondingly

In [None]:
# replace dataset and request with your own values
dataset = "reanalysis-era5-land-monthly-means"
request = {
    "product_type": ["monthly_averaged_reanalysis"],
    "variable": ["2m_temperature", "total_precipitation"],
    "year": ["2016", "2017"],
    "month": [
        "01",
        "02",
        "03",
        "04",
        "05",
        "06",
        "07",
        "08",
        "09",
        "10",
        "11",
        "12",
    ],
    "time": ["00:00"],
    "data_format": "netcdf",
    "download_format": "unarchived",
}

In [None]:
data_format = request.get("data_format")

# file name of downladed data
era5_fname = inout.get_filename(
    ds_name=dataset,
    data_format=data_format,
    years=request["year"],
    months=request["month"],
    has_area=bool("area" in request),
    base_name="era5_data",
    variables=request["variable"],
)
era5_fpath = data_folder / era5_fname

In [None]:
# download data
if not era5_fpath.exists():
    print("Downloading data...")
    inout.download_data(era5_fpath, dataset, request)
else:
    print("Data already exists at {}".format(era5_fpath))

### Special download for total precipitation data from ERA5-Land Hourly dataset

`P-model` requires total precipitation data downloaded from dataset `ERA5-Land hourly data from 1950 to present`.

Due to the nature of this dataset, value at `00:00` is total precipitation of the previous day (see [here](https://confluence.ecmwf.int/pages/viewpage.action?pageId=197702790))

To get correct precipitation values from `01.01.2016` to `31.12.2017`, we need to download data from `02.01.2016` to `01.01.2018`. The current CDS request API does not allow downloading data in a single request for ranges that are not full calendar years.

We implemented a special function for this case.

```python
def download_total_precipitation_from_hourly_era5_land(
    start_date: str,
    end_date: str,
    area: List[float] | None = None,
    out_dir: Path = Path("."),
    base_name: str = "era5_data",
    data_format: str = "netcdf",
    ds_name: str = "reanalysis-era5-land",
    coord_name: str = "valid_time",
    var_name: str = "total_precipitation",
    clean_tmp_files: bool = False,
) -> str:
```

Input for this function includes:

* `start_date` and `end_date` in the format of "YYYY-MM-DD"
* `area` indicates the area to download; `None` means the entire globe.
* `out_dir`: output directory to store the downloaded file
* `base_name`: base string used to name the output file. File name is described in [Naming convention - Special case](../../data.md#special-case)
* `data_format`: can be `netcdf` or `grib`
* `ds_name`, `coord_name`, and `var_name` represent the dataset name, coordinate name, and data variable name in the dataset. Please only change these values when CDS changes the corresponding names.
* `clean_tmp_files` parameter can be set to `False` to retain the downloaded temporary files, which store data for smaller sub-ranges derived from the overall date range. For example, the range `2016-01-01` to `2017-12-31` would be split into sub-ranges `2016-01-02` to `2016-12-31`, `2017-01-01` to `2017-12-31`, and `2018-01-01` to `2018-01-01`, because the timestamps are shifted one day forward.

The function handles time shifting, downloads the data, adjusts the time coordinate back to the target range, and returns the output file path.

In [None]:
# download total precipitation data from ERA5-Land Hourly dataset
# from 2016-01-01 to 2017-12-31
start_time = "2016-01-01"
end_time = "2017-12-31"
tp_era5_hourly_file = inout.download_total_precipitation_from_hourly_era5_land(
    start_date=start_time,
    end_date=end_time,
    area=None,
    out_dir=data_folder,
    base_name="era5_data",
    data_format="netcdf",
    ds_name="reanalysis-era5-land",
    coord_name="valid_time",
    var_name="total_precipitation",
    clean_tmp_files=False,  # keep temporary files for checking
)
tp_era5_hourly_file

In [None]:
tp_era5_hourly_ds = xr.open_dataset(tp_era5_hourly_file)
tp_era5_hourly_ds["valid_time"]

## Download ISIMIP data (population data)

To download ISIMIP data manually, please follow the instruction in [Data](../../data.md).

To download the data using ISIMIP's APIs, please perform these steps:

In [None]:
# initialize ISIMIP client
client = ISIMIPClient()

In [None]:
# search for population data
response = client.datasets(
    path="ISIMIP3a/InputData/socioeconomic/pop/histsoc/population"
)  # this path is similar to the one in ISIMIP's website

for dataset in response["results"]:
    print("Dataset found: {}".format(dataset["path"]))

# download population data file, 1901_2021
for dataset in response["results"]:
    for file in dataset["files"]:
        if "1901_2021" in file["name"]:
            isimip_fpath = data_folder / file["name"]
            if isimip_fpath.exists():
                print(f"Population data file already exists: {file['name']}")
            else:
                print(f"Downloading population data file: {file['name']}")
                client.download(file["file_url"], path=data_folder)
            break  # exit after first match

## Open the files and read contents into xarray datasets

In [None]:
# load netCDF files
ds_era5 = xr.open_dataset(era5_fpath)
ds_isimip = xr.open_dataset(isimip_fpath)

## Plot the datasets

In [None]:
# plot the cartesian grid data of t2m and tp for 2016-2017, all months
ds_era5.t2m.plot.pcolormesh(
    col="valid_time", col_wrap=4, cmap="coolwarm", robust=True, figsize=(15, 10)
)
plt.savefig("era5_2016_2017_plots.png", dpi=300)
plt.show()