# Preparing data files

Preparing data files according to the [data flowchart](../../../datalake_database/#data-flowchart)

In [None]:
from onehealth_data_backend import inout
from onehealth_data_backend import preprocess, utils
from pathlib import Path
import time
import xarray as xr

In [None]:
# change to your own data folder, if needed
data_folder = Path("../../../data/in/")

## Download ERA5-Land data

To download ERA5-Land data using CDS's API:
* Select the target dataset, e.g. ERA5-Land monthly averaged data from 1950 to present
* Go to tab `Download` of the dataset and select the data variables, time range, geographical area, etc. that you want to download
* At the end of the page, click on `Show API request code` and take notes of the following information
    * `dataset`: name of the dataset
    * `request`: a dictionary summarizes your download request
* Replace the values of `dataset` and `request` in the below cell correspondingly

In [None]:
# replace dataset and request with your own values
dataset = "reanalysis-era5-land-monthly-means"
request = {
    "product_type": ["monthly_averaged_reanalysis"],
    "variable": ["2m_temperature", "total_precipitation"],
    "year": ["2016", "2017"],
    "month": [
        "01",
        "02",
        "03",
        "04",
        "05",
        "06",
        "07",
        "08",
        "09",
        "10",
        "11",
        "12",
    ],
    "time": ["00:00"],
    "data_format": "netcdf",
    "download_format": "unarchived",
}

In [None]:
data_format = request.get("data_format")

# file name of downladed data
file_name = inout.get_filename(
    ds_name=dataset,
    data_format=data_format,
    years=request["year"],
    months=request["month"],
    has_area=bool("area" in request),
    base_name="era5_data",
    variable=request["variable"],
)
output_file = data_folder / file_name

In [None]:
# download data
if not output_file.exists():
    print("Downloading data...")
    inout.download_data(output_file, dataset, request)
else:
    print("Data already exists at {}".format(output_file))

## Load settings

First we need to load the default settings which setup preprocessing steps.

In [None]:
settings = utils.get_settings(
    setting_path="default",
    new_settings={},
    updated_setting_dir=None,
    save_updated_settings=False,
)

TBU: more details about the default settings will be provided...

*Note: for future PRs, preprocessing settings may be integrated into preprocess function to make sure that these settings are saved in the same place with preprocessed files ([issue #4](https://github.com/ssciwr/onehealth-data-backend/issues/4))*

## Preprocess data

### Preprocess ERA5-Land data

In [None]:
# disable truncation of dates
settings["truncate_date"] = False

print("Preprocessing ERA5-Land data...")
t0 = time.time()
preprocessed_dataset = preprocess.preprocess_data_file(
    netcdf_file=output_file,
    settings=settings,
)
t_preprocess = time.time()
print("Preprocessing completed in {:.2f} seconds.".format(t_preprocess - t0))

The preprocessed dataset is also saved in a `.nc` file under the same folder, namely `era5_data_2016_2017_all_2t_tp_monthly_unicoords_adjlon_celsius_mm_05deg_trim`

Details on regulation for the file name can be found in [Data](../../data.md).

### Preprocess population data

Instructions for downloading population data (i.e. ISIMIP data) are presented in [Data](../../data.md) and [Data Lake](../../datalake.md).

In [None]:
popu_file = data_folder / "population_histsoc_30arcmin_annual_1901_2021.nc"

In [None]:
settings["truncate_date"] = True
# disable uncessary preprocessing steps
settings["adjust_longitude"] = False
settings["convert_kelvin_to_celsius"] = False
settings["convert_m_to_mm_precipitation"] = False
settings["resample_grid"] = False

print("Preprocessing population data...")
t0 = time.time()
preprocessed_popu = preprocess.preprocess_data_file(
    netcdf_file=popu_file,
    settings=settings,
)
t_popu = time.time()
print("Preprocessing population data completed in {:.2f} seconds.".format(t_popu - t0))

The preprocessed dataset is also saved in a `.nc` file under the same folder.

## Aggregate data by NUTS regions

For analyzing data across European regions, it is more convenient to work with data aggregated by NUTS regions rather than by grid points (latitude and longitude). In this section, we demonstrate how ERA5‑Land and Population data can be aggregated into NUTS regions.

This feature can also be applied to other datasets in NetCDF format that include the coordinates `latitude`, `longitude`, and `time` (e.g. prediction model outputs).

*Note: file names of preprocessed era5 and population data should be obtained directly after the `preprocess_data_file()` function ([issue #15](https://github.com/ssciwr/onehealth-data-backend/issues/15))*

In [None]:
# NUTS shapefile
nuts_file = data_folder / "NUTS_RG_20M_2024_4326.shp.zip"

# preprocess ERA5-Land file
preprocessed_era5_file = (
    data_folder
    / "era5_data_2016_2017_all_2t_tp_monthly_unicoords_adjlon_celsius_mm_05deg_trim.nc"
)

# preprocess population file
preprocessed_popu_file = (
    data_folder / "population_histsoc_30arcmin_annual_1901_2021_unicoords_2020_2021.nc"
)

We can aggregate singple or multiple NetCDF files with one NUTS shape file. These NetCDF files should be structured into a dictionary, as follows:

In [None]:
# forming a dictionary for non-NUTS data
non_nuts_data = {
    "era5": (preprocessed_era5_file, None),
    "popu": (preprocessed_popu_file, None),
}

Here, the keys represent dataset names (used to form the resulting file name), and the values are tuples containing the file path and the aggregation mapping.

By default, the aggregation mapping is set to `None`, which means the `mean` function will be applied to all data variables during aggregation.

An example of aggregation mapping is:

```
{
    "t2m": "mean", 
    "tp": "sum"
}
```

The resulting file name would be:

`<NUTS_shapefile_name>_agg_<nc_dataset_names>_<min_yyyy-mm>-<max_yyy-mm>.nc`

In [None]:
# aggregate data by NUTS regions
t0 = time.time()
aggregated_file = preprocess.aggregate_data_by_nuts(
    non_nuts_data, nuts_file, normalize_time=True, output_dir=None
)
t1 = time.time()
print(f"Aggregation completed in {t1 - t0:.2f} seconds")

In this example, we use the default values for `normalize_time` and `output_dir`, which are `True` and `None`, respectively.

The `normalize_time` option ensures that time values in the NetCDF file are reset to the start of the day. For example, `2025-10-01T12:00:00` becomes `2025-10-01T00:00:00`. This is particularly useful for population data, where time values are recorded at midday.

When `output_dir` is set to `None`, the aggregated file is saved in the same directory as the NUTS shapefile.

*Note: Since the ERA5-Land data is for 2016-2017, while the popuation data is for 2020-2021, the aggregated data would range from 2016 to 2021.*

*We should truncate the population data to the same time frame for demonstration purpose. Update this after addressing [issue #6](https://github.com/ssciwr/onehealth-data-backend/issues/6)*

In [None]:
# inspect the aggregated data
agg_ds = xr.open_dataset(aggregated_file)
agg_ds

In [None]:
agg_ds[["t2m", "total-population"]].sel(NUTS_ID="DE").to_dataframe().tail(5)