# Preparing data files

Preparing data files according to the [data flowchart](../../../datalake_database/#data-flowchart)

In [1]:
from onehealth_data_backend import inout
from onehealth_data_backend import preprocess, utils
from pathlib import Path
import time

In [2]:
# change to your own data folder, if needed
data_folder = Path("../../../data/in/")

## Download ERA5-Land data

To download ERA5-Land data using CDS's API:
* Select the target dataset, e.g. ERA5-Land monthly averaged data from 1950 to present
* Go to tab `Download` of the dataset and select the data variables, time range, geographical area, etc. that you want to download
* At the end of the page, click on `Show API request code` and take notes of the following information
    * `dataset`: name of the dataset
    * `request`: a dictionary summarizes your download request
* Replace the values of `dataset` and `request` in the below cell correspondingly

In [3]:
# replace dataset and request with your own values
dataset = "reanalysis-era5-land-monthly-means"
request = {
    "product_type": ["monthly_averaged_reanalysis"],
    "variable": ["2m_temperature", "total_precipitation"],
    "year": ["2016", "2017"],
    "month": [
        "01",
        "02",
        "03",
        "04",
        "05",
        "06",
        "07",
        "08",
        "09",
        "10",
        "11",
        "12",
    ],
    "time": ["00:00"],
    "data_format": "netcdf",
    "download_format": "unarchived",
}

In [4]:
data_format = request.get("data_format")

# file name of downladed data
file_name = inout.get_filename(
    ds_name=dataset,
    data_format=data_format,
    years=request["year"],
    months=request["month"],
    has_area=bool("area" in request),
    base_name="era5_data",
    variable=request["variable"],
)
output_file = data_folder / file_name

In [5]:
# download data
if not output_file.exists():
    print("Downloading data...")
    inout.download_data(output_file, dataset, request)
else:
    print("Data already exists at {}".format(output_file))

Downloading data...


2025-07-11 10:21:34,611 INFO [2024-09-26T00:00:00] Watch our [Forum](https://forum.ecmwf.int/) for Announcements, news and other discussed topics.
2025-07-11 10:21:34,810 INFO Request ID is 53f4b1a5-3f48-46e1-83f0-dea8cdbe61c2
2025-07-11 10:21:34,900 INFO status has been updated to accepted
2025-07-11 10:21:48,584 INFO status has been updated to running
2025-07-11 10:22:24,863 INFO status has been updated to successful


b279032b3e9c239adc0131799b46dd10.nc:   0%|          | 0.00/189M [00:00<?, ?B/s]

Data downloaded successfully to ../../../data/in/era5_data_2016_2017_all_2t_tp_monthly_raw.nc


## Load settings

First we need to load the default settings which setup preprocessing steps.

In [6]:
settings = utils.get_settings(
    setting_path="default",
    new_settings={},
    updated_setting_dir=None,
    save_updated_settings=False,
)

TBU: more details about the default settings will be provided...

## Preprocess data

### Preprocess ERA5-Land data

In [7]:
# disable truncation of dates
settings["truncate_date"] = False

print("Preprocessing ERA5-Land data...")
t0 = time.time()
preprocessed_dataset = preprocess.preprocess_data_file(
    netcdf_file=output_file,
    settings=settings,
)
t_preprocess = time.time()
print("Preprocessing completed in {:.2f} seconds.".format(t_preprocess - t0))

Preprocessing ERA5-Land data...
Renaming coordinates to unify them across datasets...
Adjusting longitude from 0-360 to -180-180...
Converting temperature from Kelvin to Celsius...




Converting precipitation from meters to millimeters...
Resampling grid to a new resolution...
Processed dataset saved to: ../../../data/in/era5_data_2016_2017_all_2t_tp_monthly_unicoords_adjlon_celsius_mm_05deg_trim.nc
Preprocessing completed in 22.91 seconds.


The preprocessed dataset is also saved in a `.nc` file under the same folder, namely `era5_data_2016_2017_all_2t_tp_monthly_unicoords_adjlon_celsius_mm_05deg_trim`

Details on regulation for the file name can be found in [Data](../../data.md).

### Preprocess population data

Instructions for downloading population data (i.e. ISIMIP data) are presented in [Data](../../data.md) and [Data Lake](../../datalake.md).

In [12]:
popu_file = data_folder / "population_histsoc_30arcmin_annual_1901_2021.nc"

In [11]:
settings["truncate_date"] = True
# disable uncessary preprocessing steps
settings["adjust_longitude"] = False
settings["convert_kelvin_to_celsius"] = False
settings["convert_m_to_mm_precipitation"] = False
settings["resample_grid"] = False

print("Preprocessing population data...")
t0 = time.time()
preprocessed_popu = preprocess.preprocess_data_file(
    netcdf_file=popu_file,
    settings=settings,
)
t_popu = time.time()
print("Preprocessing population data completed in {:.2f} seconds.".format(t_popu - t0))

Preprocessing population data...


ValueError: netcdf_file must be a valid file path.

The preprocessed dataset is also saved in a `.nc` file under the same folder.