# Performance Analysis: Earthkit-Climate Indicators

This notebook provides a comparative performance analysis of the `earthkit-climate` indicators on the SSP5-8.5 dataset. We compare two execution modes:
1. **Lazy Execution**: The baseline approach where dask graphs are built and executed without specific optimizations.
2. **Optimized Execution**: An enhanced approach using pre-computation of heavy statistics (like percentiles) and strategic re-chunking.

We profile 5 key indicators:
- **WSDI**: Warm Spell Duration Index
- **CWD**: Maximum Consecutive Wet Days
- **DTR**: Daily Temperature Range
- **HDD**: Heating Degree Days
- **SDII**: Simple Daily Precipitation Intensity Index

In [1]:
import os
import warnings

from earthkit.data import config

warnings.filterwarnings("ignore")

# Configure robust caching to avoid re-downloading
cache_dir = os.path.expanduser("~/.cache/earthkit/data")
os.makedirs(cache_dir, exist_ok=True)
settings_earthkit = {
    "cache-policy": "user",
    "temporary-directory-root": cache_dir,
}
config.set(settings_earthkit)

In [2]:
import os

from IPython.display import Markdown, display


def get_cpu_info():
    try:
        with open("/proc/cpuinfo", "r") as f:
            for line in f:
                if "model name" in line:
                    return line.split(":")[1].strip()
    except Exception:
        return "Unknown CPU"
    return "Unknown CPU"


def get_ram_info():
    try:
        with open("/proc/meminfo", "r") as f:
            for line in f:
                if "MemTotal" in line:
                    total_kb = int(line.split()[1])
                    return f"{total_kb / 1024 / 1024:.1f} GB"
    except Exception:
        return "Unknown RAM"
    return "Unknown RAM"


cpu_model = get_cpu_info()
ram_size = get_ram_info()

display(
    Markdown(f"""
## Resources Used

### Hardware Configuration
The performance analysis was conducted on the following hardware (dynamically detected):
- **CPU**: {cpu_model}
- **RAM**: {ram_size}
""")
)


## Resources Used

### Hardware Configuration
The performance analysis was conducted on the following hardware (dynamically detected):
- **CPU**: AMD Ryzen 5 5500H with Radeon Graphics
- **RAM**: 15.0 GB


In [3]:
import os
from typing import Any, Dict, List

import xarray as xr
from IPython.display import Markdown, display

import earthkit.data

# Dataset URLs
DATASET_URLS = [
    "https://sites.ecmwf.int/repository/earthkit-climate/tasmax_gridded_day_CMIP6_ACCESS-CM2_r1i1p1f1_deepESD_day_historical.nc",
    "https://sites.ecmwf.int/repository/earthkit-climate/tasmax_gridded_day_CMIP6_ACCESS-CM2_r1i1p1f1_deepESD_day_ssp585.nc",
    "https://sites.ecmwf.int/repository/earthkit-climate/tasmin_gridded_day_CMIP6_ACCESS-CM2_r1i1p1f1_deepESD_day_ssp585.nc",
    "https://sites.ecmwf.int/repository/earthkit-climate/pr_gridded_day_CMIP6_ACCESS-CM2_r1i1p1f1_deepESD_day_ssp585.nc",
]


def format_size(size_bytes: float) -> str:
    """
    Convert a byte size into a human-readable string.

    Parameters
    ----------
    size_bytes : float
        File size in bytes.

    Returns
    -------
    str
        Size formatted as B, KB, MB, or GB.
    """
    for unit in ["B", "KB", "MB", "GB"]:
        if size_bytes < 1024:
            return f"{size_bytes:.1f} {unit}"
        size_bytes /= 1024
    return f"{size_bytes:.1f} TB"


def extract_dataset_info(url: str) -> Dict[str, Any]:
    """
    Extract key metadata information from a NetCDF dataset available via URL.

    Parameters
    ----------
    url : str
        URL to the NetCDF dataset.

    Returns
    -------
    dict
        Dictionary containing:
        - Variable
        - Scenario
        - Description
        - Dimensions
        - Size
        - Status ("Cached" or "Remote")
        - URL
    """
    ds = earthkit.data.from_source("url", url)

    # Detect file size & cache status
    if getattr(ds, "path", None) and os.path.exists(ds.path):
        size = format_size(os.path.getsize(ds.path))
        status = "Cached"
    else:
        size = "Unknown"
        status = "Remote"

    # Convert to xarray
    xr_ds = ds.to_xarray()

    # Extract primary variable
    variables = list(xr_ds.data_vars)
    variable = f"`{variables[0]}`" if variables else "Unknown"

    # Scenario metadata
    scenario = xr_ds.attrs.get("scenario", "Unknown")

    # Dimension string, e.g. "(time: 365, lat: 180, lon: 360)"
    dims_str = "(" + ", ".join(f"{k}: {v}" for k, v in xr_ds.dims.items()) + ")"

    return {
        "Variable": variable,
        "Scenario": scenario,
        "Dimensions": dims_str,
        "Size": size,
        "Status": status,
        "URL": url,
    }


def generate_dataset_table(urls: List[str]) -> str:
    """
    Create a Markdown table summarizing metadata for a list of dataset URLs.

    Parameters
    ----------
    urls : list of str
        List of dataset URLs.

    Returns
    -------
    str
        A Markdown-formatted table of dataset metadata.
    """
    header = (
        "| Variable | Scenario | Dimensions | Size | Status | URL |\n"
        "|----------|----------|------------|------|--------|-----|\n"
    )

    rows = []
    for url in urls:
        info = extract_dataset_info(url)
        rows.append(
            f"| {info['Variable']} | {info['Scenario']} | "
            f"{info['Dimensions']} | {info['Size']} | {info['Status']} | "
            f"[Download]({info['URL']}) |"
        )

    return header + "\n".join(rows)


# Display Markdown report
display(
    Markdown(
        f"""
### Dataset Information

The analysis uses the following climate datasets derived from CMIP6 projections
(ACCESS-CM2 model, DeepESD downscaling). These datasets are hosted in the ECMWF
repository and are automatically downloaded or cached by **earthkit-data**.

{generate_dataset_table(DATASET_URLS)}

> **Note**: Dimensions and sizes are extracted dynamically. The first run may
download the files.
"""
    )
)


### Dataset Information

The analysis uses the following climate datasets derived from CMIP6 projections
(ACCESS-CM2 model, DeepESD downscaling). These datasets are hosted in the ECMWF
repository and are automatically downloaded or cached by **earthkit-data**.

| Variable | Scenario | Dimensions | Size | Status | URL |
|----------|----------|------------|------|--------|-----|
| `tasmax` | historical | (time: 7305, lat: 48, lon: 84) | 67.3 MB | Cached | [Download](https://sites.ecmwf.int/repository/earthkit-climate/tasmax_gridded_day_CMIP6_ACCESS-CM2_r1i1p1f1_deepESD_day_historical.nc) |
| `tasmax` | ssp585 | (time: 14610, lat: 48, lon: 84) | 126.9 MB | Cached | [Download](https://sites.ecmwf.int/repository/earthkit-climate/tasmax_gridded_day_CMIP6_ACCESS-CM2_r1i1p1f1_deepESD_day_ssp585.nc) |
| `tasmin` | ssp585 | (time: 14610, lat: 48, lon: 84) | 132.1 MB | Cached | [Download](https://sites.ecmwf.int/repository/earthkit-climate/tasmin_gridded_day_CMIP6_ACCESS-CM2_r1i1p1f1_deepESD_day_ssp585.nc) |
| `pr` | ssp585 | (time: 14610, lat: 48, lon: 84) | 111.5 MB | Cached | [Download](https://sites.ecmwf.int/repository/earthkit-climate/pr_gridded_day_CMIP6_ACCESS-CM2_r1i1p1f1_deepESD_day_ssp585.nc) |

> **Note**: Dimensions and sizes are extracted dynamically. The first run may
download the files.


In [4]:
import time

import pandas as pd

from earthkit.climate.indicators.precipitation import (
    daily_precipitation_intensity,
    maximum_consecutive_wet_days,
)
from earthkit.climate.indicators.temperature import (
    daily_temperature_range,
    heating_degree_days,
    warm_spell_duration_index,
)
from earthkit.climate.utils.percentile import percentile_doy

In [5]:
# Data URLs (Access-CM2)
URLS = {
    "tasmax_hist": "https://sites.ecmwf.int/repository/earthkit-climate/tasmax_gridded_day_CMIP6_ACCESS-CM2_r1i1p1f1_deepESD_day_historical.nc",
    "pr_ssp": "https://sites.ecmwf.int/repository/earthkit-climate/pr_gridded_day_CMIP6_ACCESS-CM2_r1i1p1f1_deepESD_day_ssp585.nc",
    "tasmin_ssp": "https://sites.ecmwf.int/repository/earthkit-climate/tasmin_gridded_day_CMIP6_ACCESS-CM2_r1i1p1f1_deepESD_day_ssp585.nc",
    "tasmax_ssp": "https://sites.ecmwf.int/repository/earthkit-climate/tasmax_gridded_day_CMIP6_ACCESS-CM2_r1i1p1f1_deepESD_day_ssp585.nc",
}

In [6]:
def load_data():
    print("Loading datasets...")
    tasmax_hist = earthkit.data.from_source("url", URLS["tasmax_hist"]).to_xarray()
    tasmax_ssp = earthkit.data.from_source("url", URLS["tasmax_ssp"]).to_xarray()
    tasmin_ssp = earthkit.data.from_source("url", URLS["tasmin_ssp"]).to_xarray()
    pr_ssp = earthkit.data.from_source("url", URLS["pr_ssp"]).to_xarray()
    return tasmax_hist, tasmax_ssp, tasmin_ssp, pr_ssp


tasmax_hist, tasmax_ssp, tasmin_ssp, pr_ssp = load_data()

Loading datasets...


In [7]:
# Create 'tas' for Heating Degree Days (Mean of Max and Min)
print("Computing proxy 'tas' for HDD...")
tas_ssp = (tasmax_ssp["tasmax"] + tasmin_ssp["tasmin"]) / 2
tas_ssp.name = "tas"
tas_ssp_ds = tas_ssp.to_dataset()

Computing proxy 'tas' for HDD...


In [8]:
def profile_run(name, func, kwargs):
    print(f"  Running {name}...")
    start = time.perf_counter()
    res = func(**kwargs)
    out = res.to_xarray()
    if hasattr(out, "compute"):
        out.compute()
    elapsed = time.perf_counter() - start
    print(f"  > Done in {elapsed:.4f}s")
    return elapsed

## 1. Lazy Execution (Baseline)
In this mode, we simply merge datasets and pass them to the indicators without any specific handling of chunks or pre-computation.

In [9]:
results = []

# WSDI (Lazy)
tasmax_per_lazy = percentile_doy(tasmax_hist["tasmax"], per=90)
tasmax_per_lazy.name = "tasmax_per"
wsdi_ds_lazy = xr.merge([tasmax_ssp, tasmax_per_lazy])

# CWD (Lazy)
cwd_ds_lazy = pr_ssp

# DTR (Lazy)
dtr_ds_lazy = xr.merge([tasmax_ssp, tasmin_ssp])

# HDD (Lazy)
hdd_ds_lazy = tas_ssp_ds

# SDII (Lazy)
sdii_ds_lazy = pr_ssp

In [10]:
t_wsdi_lazy = profile_run("WSDI (Lazy)", warm_spell_duration_index, {"ds": wsdi_ds_lazy})
t_cwd_lazy = profile_run("CWD (Lazy)", maximum_consecutive_wet_days, {"ds": cwd_ds_lazy})
t_dtr_lazy = profile_run("DTR (Lazy)", daily_temperature_range, {"ds": dtr_ds_lazy})
t_hdd_lazy = profile_run("HDD (Lazy)", heating_degree_days, {"ds": hdd_ds_lazy})
t_sdii_lazy = profile_run("SDII (Lazy)", daily_precipitation_intensity, {"ds": sdii_ds_lazy})

results.append({"Indicator": "WSDI", "Mode": "Lazy", "Time": t_wsdi_lazy})
results.append({"Indicator": "CWD", "Mode": "Lazy", "Time": t_cwd_lazy})
results.append({"Indicator": "DTR", "Mode": "Lazy", "Time": t_dtr_lazy})
results.append({"Indicator": "HDD", "Mode": "Lazy", "Time": t_hdd_lazy})
results.append({"Indicator": "SDII", "Mode": "Lazy", "Time": t_sdii_lazy})

  Running WSDI (Lazy)...
  > Done in 29.2237s
  Running CWD (Lazy)...
  > Done in 6.6821s
  Running DTR (Lazy)...
  > Done in 2.8461s
  Running HDD (Lazy)...
  > Done in 2.2843s
  Running SDII (Lazy)...
  > Done in 1.8180s


## 2. Optimized Execution
In this mode, we apply two key optimizations:
1. **Pre-computing Percentiles**: We force the computation of the percentile threshold before passing it to the indicator. This simplifies the dask graph significantly.
2. **Re-chunking**: We re-chunk the data along the time dimension (`time=-1`) to ensure optimal processing for time-series based indicators.

In [11]:
# WSDI (Optimized)
print("  Pre-computing percentile for WSDI...")
start_per = time.perf_counter()
tasmax_per_opt = percentile_doy(tasmax_hist["tasmax"], per=90)
tasmax_per_opt.name = "tasmax_per"
tasmax_per_opt = tasmax_per_opt.compute()
print(f"  > Percentile computed in {time.perf_counter() - start_per:.4f}s")

tasmax_ssp_opt = tasmax_ssp.chunk({"time": -1})
wsdi_ds_opt = xr.merge([tasmax_ssp_opt, tasmax_per_opt])

# CWD (Optimized)
cwd_ds_opt = pr_ssp.chunk({"time": -1})

# DTR (Optimized)
tasmin_ssp_opt = tasmin_ssp.chunk({"time": -1})
dtr_ds_opt = xr.merge([tasmax_ssp_opt, tasmin_ssp_opt])

# HDD (Optimized)
hdd_ds_opt = tas_ssp_ds.chunk({"time": -1})

# SDII (Optimized)
sdii_ds_opt = pr_ssp.chunk({"time": -1})

  Pre-computing percentile for WSDI...
  > Percentile computed in 2.4377s


In [12]:
t_wsdi_opt = profile_run("WSDI (Optimized)", warm_spell_duration_index, {"ds": wsdi_ds_opt})
t_cwd_opt = profile_run("CWD (Optimized)", maximum_consecutive_wet_days, {"ds": cwd_ds_opt})
t_dtr_opt = profile_run("DTR (Optimized)", daily_temperature_range, {"ds": dtr_ds_opt})
t_hdd_opt = profile_run("HDD (Optimized)", heating_degree_days, {"ds": hdd_ds_opt})
t_sdii_opt = profile_run("SDII (Optimized)", daily_precipitation_intensity, {"ds": sdii_ds_opt})

results.append({"Indicator": "WSDI", "Mode": "Optimized", "Time": t_wsdi_opt})
results.append({"Indicator": "CWD", "Mode": "Optimized", "Time": t_cwd_opt})
results.append({"Indicator": "DTR", "Mode": "Optimized", "Time": t_dtr_opt})
results.append({"Indicator": "HDD", "Mode": "Optimized", "Time": t_hdd_opt})
results.append({"Indicator": "SDII", "Mode": "Optimized", "Time": t_sdii_opt})

  Running WSDI (Optimized)...
  > Done in 5.7216s
  Running CWD (Optimized)...
  > Done in 4.0565s
  Running DTR (Optimized)...
  > Done in 2.2120s
  Running HDD (Optimized)...
  > Done in 2.5596s
  Running SDII (Optimized)...
  > Done in 1.9346s


## 3. Summary of Results

In [13]:
df = pd.DataFrame(results)
pivot = df.pivot(index="Indicator", columns="Mode", values="Time")
pivot["Speedup"] = pivot["Lazy"] / pivot["Optimized"]
pivot

Mode,Lazy,Optimized,Speedup
Indicator,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
CWD,6.682069,4.056488,1.647255
DTR,2.846073,2.211951,1.28668
HDD,2.284253,2.559555,0.892442
SDII,1.817978,1.934581,0.939727
WSDI,29.223651,5.721642,5.107563
