# Preparing data files

Preparing data files according to the [data flowchart](../../../datalake_database/#data-flowchart)

In [None]:
from onehealth_data_backend import inout
from onehealth_data_backend import preprocess
from pathlib import Path
import time
import xarray as xr
from isimip_client.client import ISIMIPClient

In [None]:
# change to your own data folder, if needed
data_folder = Path("../../../data/in/")

## Download ERA5-Land data

To download ERA5-Land data using CDS's API:
* Select the target dataset, e.g. ERA5-Land monthly averaged data from 1950 to present
* Go to tab `Download` of the dataset and select the data variables, time range, geographical area, etc. that you want to download
* At the end of the page, click on `Show API request code` and take notes of the following information
    * `dataset`: name of the dataset
    * `request`: a dictionary summarizes your download request
* Replace the values of `dataset` and `request` in the below cell correspondingly

In [None]:
# replace dataset and request with your own values
dataset = "reanalysis-era5-land-monthly-means"
request = {
    "product_type": ["monthly_averaged_reanalysis"],
    "variable": ["2m_temperature", "total_precipitation"],
    "year": ["2016", "2017"],
    "month": [
        "01",
        "02",
        "03",
        "04",
        "05",
        "06",
        "07",
        "08",
        "09",
        "10",
        "11",
        "12",
    ],
    "time": ["00:00"],
    "data_format": "netcdf",
    "download_format": "unarchived",
}

In [None]:
data_format = request.get("data_format")

# file name of downladed data
era5_fname = inout.get_filename(
    ds_name=dataset,
    data_format=data_format,
    years=request["year"],
    months=request["month"],
    has_area=bool("area" in request),
    base_name="era5_data",
    variable=request["variable"],
)
era5_fpath = data_folder / era5_fname

In [None]:
# download data
if not era5_fpath.exists():
    print("Downloading data...")
    inout.download_data(era5_fpath, dataset, request)
else:
    print("Data already exists at {}".format(era5_fpath))

## Download ISIMIP data (population data)

To download ISIMIP data manually, please follow the instruction in [Data](../../data.md).

To download the data using ISIMIP's APIs, please perform these steps:

In [None]:
# initialize ISIMIP client
client = ISIMIPClient()

In [None]:
# search for population data
response = client.datasets(
    path="ISIMIP3a/InputData/socioeconomic/pop/histsoc/population"
)  # this path is similar to the one in ISIMIP's website

for dataset in response["results"]:
    print("Dataset found: {}".format(dataset["path"]))

# download population data file, 1901_2021
for dataset in response["results"]:
    for file in dataset["files"]:
        if "1901_2021" in file["name"]:
            isimip_fpath = data_folder / file["name"]
            if isimip_fpath.exists():
                print(f"Population data file already exists: {file['name']}")
            else:
                print(f"Downloading population data file: {file['name']}")
                client.download(file["file_url"], path=data_folder)
            break  # exit after first match

## Preprocess data

We use `preprocess` module to perform preprocessing steps, using function named `preprocess_data_file()`

```python
def preprocess_data_file(
    netcdf_file: Path,
    source: Literal["era5", "isimip"] = "era5",
    settings: Path | str = "default",
    new_settings: Dict[str, Any] | None = None,
    unique_tag: str | None = None,
) -> Tuple[xr.Dataset, str]:
```

Here, `netcdf_file` holds the path file, while `source` indicates whether the `.nc` file is downloaded from ERA5-Land or ISIMIP as these two sources have different preprocessing steps.

We determine preprocessing steps using a JSON settings file, providied through the `settings` parameter. This parameter can either be set to a file path or to the string `"default"`. If a file path is given, the settings will be loaded from that file; if loading fails, the default settings for the corresponding source will be used instead. If `"default"` is specified, the default settings of the relevant source are loaded directly.

If only certain fields of the default settings need to be updated, these fields and their values can be supplied as a dictionary via the `new_settings` parameter.

The final settings used for preprocessing are saved to a file in the same directory as the preprocessed `.nc` file. This output directory is defined in the provided settings file. The `unique_tag` is appended to both the settings file and the resulting `.nc` file to link them together.

The following subsections illustrate how preprocessing is applied to ERA5-Land data and ISIMIP data.

### Preprocess ERA5-Land data

Default settings for ERA5-Land... TBU.

In [None]:
print("Preprocessing ERA5-Land data...")
t0 = time.time()
preprocessed_dataset = preprocess.preprocess_data_file(
    netcdf_file=era5_fpath,
)
t_preprocess = time.time()
print("Preprocessing completed in {:.2f} seconds.".format(t_preprocess - t0))

The preprocessed dataset is also saved in a `.nc` file under the same folder, namely `era5_data_2016_2017_all_2t_tp_monthly_unicoords_adjlon_celsius_mm_05deg_trim`

Details on regulation for the file name can be found in [Data](../../data.md).

### Preprocess population data

Instructions for downloading population data (i.e. ISIMIP data) are presented in [Data](../../data.md) and [Data Lake](../../datalake.md).

In [None]:
popu_file = data_folder / "population_histsoc_30arcmin_annual_1901_2021.nc"

In [None]:
print("Preprocessing population data...")
t0 = time.time()
preprocessed_popu = preprocess.preprocess_data_file(netcdf_file=isimip_fpath)
t_popu = time.time()
print("Preprocessing population data completed in {:.2f} seconds.".format(t_popu - t0))

The preprocessed dataset is also saved in a `.nc` file under the same folder.

## Aggregate data by NUTS regions

For analyzing data across European regions, it is more convenient to work with data aggregated by NUTS regions rather than by grid points (latitude and longitude). In this section, we demonstrate how ERA5‑Land and Population data can be aggregated into NUTS regions.

This feature can also be applied to other datasets in NetCDF format that include the coordinates `latitude`, `longitude`, and `time` (e.g. prediction model outputs).

*Note: file names of preprocessed era5 and population data should be obtained directly after the `preprocess_data_file()` function ([issue #15](https://github.com/ssciwr/onehealth-data-backend/issues/15))*

In [None]:
# NUTS shapefile
nuts_file = data_folder / "NUTS_RG_20M_2024_4326.shp.zip"

# preprocess ERA5-Land file
preprocessed_era5_file = (
    data_folder
    / "era5_data_2016_2017_all_2t_tp_monthly_unicoords_adjlon_celsius_mm_05deg_trim.nc"
)

# preprocess population file
preprocessed_popu_file = (
    data_folder / "population_histsoc_30arcmin_annual_1901_2021_unicoords_2020_2021.nc"
)

We can aggregate singple or multiple NetCDF files with one NUTS shape file. These NetCDF files should be structured into a dictionary, as follows:

In [None]:
# forming a dictionary for non-NUTS data
non_nuts_data = {
    "era5": (preprocessed_era5_file, None),
    "popu": (preprocessed_popu_file, None),
}

Here, the keys represent dataset names (used to form the resulting file name), and the values are tuples containing the file path and the aggregation mapping.

By default, the aggregation mapping is set to `None`, which means the `mean` function will be applied to all data variables during aggregation.

An example of aggregation mapping is:

```
{
    "t2m": "mean", 
    "tp": "sum"
}
```

The resulting file name would be:

`<NUTS_shapefile_name>_agg_<nc_dataset_names>_<min_yyyy-mm>-<max_yyy-mm>.nc`

In [None]:
# aggregate data by NUTS regions
t0 = time.time()
aggregated_file = preprocess.aggregate_data_by_nuts(
    non_nuts_data, nuts_file, normalize_time=True, output_dir=None
)
t1 = time.time()
print(f"Aggregation completed in {t1 - t0:.2f} seconds")

In this example, we use the default values for `normalize_time` and `output_dir`, which are `True` and `None`, respectively.

The `normalize_time` option ensures that time values in the NetCDF file are reset to the start of the day. For example, `2025-10-01T12:00:00` becomes `2025-10-01T00:00:00`. This is particularly useful for population data, where time values are recorded at midday.

When `output_dir` is set to `None`, the aggregated file is saved in the same directory as the NUTS shapefile.

*Note: Since the ERA5-Land data is for 2016-2017, while the popuation data is for 2020-2021, the aggregated data would range from 2016 to 2021.*

*We should truncate the population data to the same time frame for demonstration purpose. Update this after addressing [issue #6](https://github.com/ssciwr/onehealth-data-backend/issues/6)*

In [None]:
# inspect the aggregated data
agg_ds = xr.open_dataset(aggregated_file)
agg_ds

In [None]:
agg_ds[["t2m", "total-population"]].sel(NUTS_ID="DE").to_dataframe().tail(5)