A small utility to retrieve ECMWF forecast data for lists of geographic points and time ranges, and to store the results as NetCDF (.nc) files together with metadata.
Key features
- Build and execute ECMWF requests for a set of coordinates and a time range
- Compute the smallest bounding box for requested points and choose an appropriate grid resolution
- Save retrieved NetCDF files and store retrieval metadata and queries in a simple index (index.csv)
- Provide a small CLI and configuration via YAML (and optional environment overrides)
- Python 3.12
- Mamba (recommended) to create and manage the environment
Refer to requirements.txt for the exact dependency list and versions.
Create the environment with Mamba:
mamba create -n ecmwf-utils python=3.11Install dependencies into the environment:
mamba run -n ecmwf-utils pip install -r requirements.txt
# or
mamba env create -n ecmwf-utils -f environment.ymlTo run any commands in this project, it is necessary to have ECMWF API credentials. These can be set as environment variables or they can be set in an .ecmwfapirc file. The logic is as follows:
-
Step 1: the environment is checked for variables
ECMWF_API_KEY,ECMWF_API_URL,ECMWF_API_EMAIL. To use add these in a.envfile within project root.- If all found, and not empty, return their values in Python tuple format.
- If only some found, and not empty, assume an incomplete API key, and raise APIKeyFetchError.
- If none found, or found but empty, assume no API key available in the environment, and continue to the next step.
-
Step 2: the environment is checked for variable
ECMWF_API_RC_FILE, meant to point to a user defined API key file. To use add this in a.envfile within project root.- If found, but pointing to a file not found, raise APIKeyNotFoundError.
- If found, and the file it points to exists, but cannot not be read, or contains an invalid API key, raise APIKeyFetchError.
- If found, and the file it points to exists, can be read, and contains a valid API key, return the API key in Python tuple format.
- If not found, or empty, assume no user provided API key file and continue to the next step.
-
Step 3: try the default
~/.ecmwfapircfile. To use add file within home folder.- Same as step 2, except for when
~/.ecmwfapircis not found, where we continue to the next step.
- Same as step 2, except for when
-
Step 4: No API key found, so fall back to anonymous access. This will fail since we request MARS which is access controlled.
Configuration is read from ./config/config.yml. Important settings are:
name: string — optional identifier for the configuration, stored in the retrieval index for traceability. Defaults todefaultif not provided.model: string — ECMWF model to query (eitherhresorens)level: string — the level to query (currently onlysurfacesupported)retrieval_mode: string — whether to retrieve a grid or a single point (eithergridorpoint)batch_issue: bool or int — controls batching of issue datetimes during retrieval.- If
False, each issue datetime is queried independently (one request per issue). - If an integer
N > 0, issue datetimes are grouped into batches spanningNconsecutive days, and each batch is retrieved in a single request. This reduces the number of API calls at the cost of larger individual requests.
- If
format: string — the format of the output files (eithernetcdffor.ncfiles orgrib2for.gribfiles)variables: list of string — ECMWF parameter codes to request (e.g.['2t', '10u', '10v'])issue_hours: list of string — hours of the day to retrieve the issued forecasts (e.g.["00", "12"]for modelhresor["00", "06", "12", "18"]for modelens)lookback: integer — forecast window (hours)step_granularity: integer — step interval in hours (e.g.1for hourly output)
Here is an example for config/config.yml:
name: default
model: hres # str, either 'hres' or 'ens'
level: surface # str, only surface is supported for now
retrieval_mode: point # str, either 'point' or 'grid'
batch_issue: 10 # bool or int, if False, process each issue hour separately; if int, process that many issue day at once
format: grib2 # str, either 'grib2' or 'netcdf'
variables: ["2t", "10u", "10v", "msl", "2d", "tp", "sf", "cp", "lsp", "sd"]
issue_hours: ["00", "06", "12", "18"]
lookback: 48 # int, in hours
step_granularity: 1 # int, in hoursThis configuration requests the HRES model at surface level with six specific variables, retrieves forecasts from the last 48 hours, and uses a 1-hour step interval.
More on the variables can be found at the end of this README.
Three environment variables can be defined:
LOG_FILE_PATH: override the default value for the log file (DEBUG level)LANDING_PATH: path to the landing directorySTAGING_PATH: path to the staging file path (CSV file, parquet accepted in a future update)
To define those variables, either use a .env file or run the following command directly into your terminal:
export LOG_FILE_PATH="./logs/DEBUG.log"
export LANDING_PATH="./data/landing/"
export STAGING_PATH="./data/staging/main.csv"If you're using a .env file, keep it in the root of this repository as best practice.
The package exposes a module-based CLI that now uses subcommands. There are two primary subcommands:
retrieval— run the data retrieval pipelinepreprocess— run the data preprocessing pipeline (WIP)
Examples:
# Retrieval: run the default query ./queries/default.json
mamba run -n ecmwf-utils python -m src retrieval
# Retrieval: run a specific query
mamba run -n ecmwf-utils python -m src retrieval --query-path ./queries/example.json
# Retrieval: run a specific query & a specific config file
mamba run -n ecmwf-utils python -m src retrieval --query-path ./queries/example.json --config-path ./config/config_example.yml
# Retrieval: run a specific query with a specific model in parallel processing
mamba run -n ecmwf-utils python -m src retrieval --query-path ./queries/example.json --model ens --concurrent-jobs 5
# Retrieval: dry run (performs queries but does not save any files)
mamba run -n ecmwf-utils python -m src retrieval --dry-run
# Retrieval: cost-only mode (runs only cost query, no data query or save)
mamba run -n ecmwf-utils python -m src retrieval --skip-query
# Retrieval: skip cost estimation (directly runs data queries)
mamba run -n ecmwf-utils python -m src retrieval --skip-cost
# Preprocess (using env variables)
mamba run -n ecmwf-utils python -m src preprocess
# Preprocess (overrides env variables)
mamba run -n ecmwf-utils python -m src preprocess --landing-path ./data/landing/ --staging-path ./data/staging/main.csvRetrieval options (summary):
--model: model type (hresorens)--level: level type (onlysurfaceis implemented)--query-path: path to the query JSON--landing-path: path to the folder where retrieved data files are saved (overridesLANDING_PATHenv variable)--config-path: path to the configuration file to use. Overrides the default config path (./config/config.yml).--dry-run: simulate retrievals without finalizing saved entries--skip-cost: skip the cost query step entirely.--skip-query: skip the actual data retrieval (no save occurs, even if--dry-runis not set).--concurrent-jobs: maximum number of simultaneous API requests to execute. Use >1 for parallel execution (e.g., 5). Default is 1 (sequential).--verbose: enable more verbose logging (not implemented yet)
Preprocess options (WIP):
--landing-path: folder with raw retrieved files (overridesLANDING_PATHenv variable)--staging-path: output file path for preprocessed data (overridesSTAGING_PATHenv variable)
CLI parsing lives in src/setup/cli.py.
The CLI parameters override environment variables and evnironment variables override YAML configuration values. The table below is a summary of all configuration variable the user has access to:
| Parameter | YAML config file | Environment variable | CLI | Default | Type |
|---|---|---|---|---|---|
| Model | Y | - | Y | hres |
str |
| Level | Y | - | Y | surface (only one implemented) |
str |
| Retrieval Mode | Y | - | - | point |
str |
| Batch Issue | Y | - | - | False |
bool or int |
| Format | Y | - | - | netcdf |
str |
| Variables | Y | - | - | [] (empty list) |
list of str |
| Issue Hours | Y | - | - | [] (empty list) |
list of str |
| Lookback (window) | Y | - | - | 48 |
int |
| Step granularity | Y | - | - | 1 |
int |
| Logging file path | - | Y | - | ./logs/DEBUG.log |
Path |
| Concurrent Jobs | - | - | Y | 1 |
int |
| Logging verbosity | - | - | Y | INFO |
str |
| Query path | - | - | Y | ./queries/default.json |
Path |
| Landing path | - | Y | Y | ./data/landing/ |
Path |
| Staging path | - | Y | Y | ./data/staging/ |
Path |
| Dry run | - | - | Y | False |
bool |
| … | … | … | … | … | … |
A query is a JSON file with a time_range (ISO 8601 strings) and a points array of [lat, lon] pairs (and an optional name, with default empty str). Example:
{
"name": "default",
"time_range": {
"start": "2016-01-01T00:00:00Z",
"end": "2016-01-15T00:00:00Z"
},
"points": [
[55.902502, -2.306389],
[55.900008, -2.301268]
]
}The query is parsed by src/query.py into Query, PointCloud and TimeRange dataclasses.
Core steps performed by the code:
- Load configuration and parse the query JSON
- Build a base MARS request using for instance
variables,lookbackandstep-granularity - Create requests for each issued time for the ECMWF API:
- If the retrieval mode is
grid, compute the smallest bounding box for the given points and generate the appropriateareaandgridrequest parameters - If the retrieval mode is
point, create one request per point in the query
- If the retrieval mode is
- Iterate over the requested dates and issued hours (
issue_hours) and request forecasts - Allocate storage paths, write the NetCDF file returned by ECMWF, save the query JSON alongside it, and add an entry to
index.csv
Key modules:
src/ecmwf_client— manages the builder and executer of MARS requestssrc/storage.py— manages allocation, finalization and theindex.csvsrc/query.py— query dataclasses and parsing
By default the output folder is at ./data/landing/ and can be defined using the environment variable LANDING_PATH or the CLI flag --landing-path. The layout created by StorageManager is:
landing/
├── index.csv
├── queries/
│ ├── query_A.json
│ ├── query_B.json
│ └── ...
└── data/
├── ecmwf_hres_sfc_YYYY-mm-DD HH:MM_timestamp1.nc
├── ecmwf_hres_sfc_YYYY-mm-DD HH:MM_timestam2.nc
├── ecmwf_ens_sfc_YYYY-mm-DD HH:MM_timestam3.nc
├── ecmwf_ens_sfc_YYYY-mm-DD HH:MM_timestam4.nc
└── ...
Each retrieval is described by a RetrievalMeta and RetrievalTicket and includes deterministic IDs (SHA-256 truncated) used in the index.
Logging is configured in config/logging.yml and is set up at program start (see src/__main__.py). Important points:
- Console handler prints INFO+ messages by default
- A debug file handler writes DEBUG logs to
logs/DEBUG.log(can be overridden by env variableLOG_FILE_PATH) - Module loggers (e.g.
src,src.setup,ecmwfapi) are configured to use both handlers
Adjust config/logging.yml to change handler levels or formats. The --verbose flag is reserved for enabling more verbose console output but currently only exists as a placeholder flag in the CLI.
- If a retrieval fails an error is logged and partial files (if any) are removed by the storage manager
--dry-runexercises allocation and request construction but finalization intoindex.csvis skipped
Here is a non exhaustif mapping of variables that can be retrieved from one or the other model.
| Short Name | Long Name | Field Code | Available in HRES | Available in ENS |
|---|---|---|---|---|
mx2t3 |
Maximum 2 m temperature (last 3 h) | 26.228 | Not tested | Not tested |
mn2t3 |
Minimum 2 m temperature (last 3 h) | 27.228 | Not tested | Not tested |
10fg3 |
Maximum 10 m wind gust (last 3 h) | 28.228 | Not tested | Not tested |
10fg |
Maximum 10 m wind gust | 49.128 | Not tested | Not tested |
mx2t6 |
Maximum 2 m temperature (last 6 h) | 121.128 | Not tested | Not tested |
mn2t6 |
Minimum 2 m temperature (last 6 h) | 122.128 | Not tested | Not tested |
10fg6 |
Maximum 10 m wind gust (last 6 h) | 123.128 | Not tested | Not tested |
sd |
Snow depth | 141.228 | Yes | Not tested |
sf |
Snow fall | 144.128 | Yes | Not tested |
cp |
Convective precipitation | 143.128 | Yes | Not tested |
lsp |
Large scale precipitation | 142.128 | Yes | Not tested |
msl |
Mean sea level pressure | 151.128 | Yes | Not tested |
10u |
10-metre eastward wind component | 165.128 | Yes | Yes |
10v |
10-metre northward wind component | 166.128 | Yes | Yes |
2t |
2 metre temperature | 167.128 | Yes | Yes |
2d |
2 metre dewpoint temperature | 168.128 | Yes | Yes |
mx2t |
Maximum 2 m temperature (since previous post-processing) | 201.128 | Yes | Not tested |
mn2t |
Minimum 2 m temperature (since previous post-processing) | 202.128 | Yes | Not tested |
tp |
Total precipitation | 228.128 | Yes | Not tested |
100u |
100 m eastward wind component | 246.228 | Yes | Yes |
100v |
100 m northward wind component | 247.228 | Yes | Yes |
Note:
The number following the dot (e.g., .128, .228) refers to the GRIB parameter table from which the variable originates. In most practical cases, the table number does not affect data retrieval, as the short name (e.g., 10u, tp) uniquely identifies the field within ECMWF’s datasets. However, it can matter when working directly with raw GRIB files or older data streams, where identical parameter numbers may exist in different tables.
Disclaimer:
Availability of variables in the ENS dataset has not been fully verified. Some entries are marked as “Not tested” because ENS data requests can take several hours to complete. If you test any of these variables and confirm whether they work (or don’t), feel free to update this README so future users can benefit from your results.
- Consider adding a guard to prevent extremely large queries (too many points) that could overload the API or hit request limits
- Model-level
levelistused insrc/ecmwf_client.pyis a placeholder — confirm the correct levels for your use case - Issued times currently use
00and12. If 06/18 are required confirm availability and support in theecmwfapiclient - Add grid resolution to config if you want it configurable per-run
WARNING: The preprocessing part might have been broken with the recent features to the retrieval pipeline.