# CORDEX Regional Climate Data Loading and Exploration

## Overview

The **Coordinated Regional Climate Downscaling Experiment (CORDEX)** is an international effort to generate high-resolution regional climate projections using Regional Climate Models (RCMs) driven by Global Climate Models (GCMs). This notebook provides a comprehensive guide to loading, exploring, and understanding CORDEX data structures.

### What is CORDEX?

CORDEX produces regional climate projections at resolutions of ~12-50 km (compared to GCM resolutions of ~100-300 km), making them suitable for regional impact studies. The framework consists of:

- **GCMs (Global Climate Models)**: Provide boundary conditions (e.g., CNRM-CM5, MPI-ESM-LR)
- **RCMs (Regional Climate Models)**: Dynamically downscale GCM outputs (e.g., RCA4, RegCM4, WRF)
- **CORDEX Domains**: Geographic regions (e.g., EUR-44, NAM-44, AFR-44)
- **Scenarios**: RCP2.6, RCP4.5, RCP8.5 (Representative Concentration Pathways)

### Learning Objectives

By the end of this notebook, you will be able to:

1. Understand CORDEX data structure and naming conventions
2. Load and inspect CORDEX NetCDF files using xarray
3. Visualize regional climate data on geographic projections
4. Extract time series for specific locations or regions
5. Perform data quality checks and validation
6. Handle multiple files and large datasets efficiently using dask
7. Apply best practices for memory management with climate data
8. Export processed data for downstream analysis

### Notebook Structure

This notebook follows a production-ready workflow for climate data analysis, demonstrating:
- Proper handling of CF-compliant NetCDF files
- Efficient memory management for large datasets
- Professional visualization using cartopy
- Reproducible data processing pipelines

## 1. Setup and Configuration

Import required libraries for climate data analysis. This notebook uses:

- **xarray**: For labeled multi-dimensional arrays (CF-compliant NetCDF)
- **numpy/pandas**: For numerical operations and tabular data
- **matplotlib**: For visualization
- **cartopy**: For geographic projections and mapping
- **dask**: For lazy loading and parallel computation on large datasets

In [None]:
# Core scientific computing libraries
# Standard library
import os
import warnings
from datetime import datetime
from pathlib import Path

# Geographic projections and mapping
import cartopy.crs as ccrs
import cartopy.feature as cfeature

# Parallel computing and lazy loading
import dask

# Visualization libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import xarray as xr
from dask.diagnostics import ProgressBar

# Configure display options
warnings.filterwarnings("ignore")
xr.set_options(display_style="html")
np.set_printoptions(precision=3, suppress=True)

# Matplotlib configuration
plt.style.use("seaborn-v0_8-darkgrid")
plt.rcParams["figure.figsize"] = (12, 8)
plt.rcParams["font.size"] = 10

print(f"xarray version: {xr.__version__}")
print(f"dask version: {dask.__version__}")
print(f"numpy version: {np.__version__}")
print(f"pandas version: {pd.__version__}")
print("\nSetup complete!")

## 2. Understanding CORDEX Data Structure

### CORDEX Naming Convention

CORDEX files follow a standardized naming convention:

```
<variable>_<domain>_<driving_model>_<experiment>_<ensemble>_<rcm_model>_<rcm_version>_<frequency>_<start_time>-<end_time>.nc
```

**Example:**
```
tas_EUR-44_MPI-ESM-LR_rcp85_r1i1p1_SMHI-RCA4_v1_day_20060101-20101231.nc
```

Breaking this down:
- **tas**: Near-surface air temperature (variable)
- **EUR-44**: European domain at ~44 km resolution
- **MPI-ESM-LR**: Driving GCM (Max Planck Institute Earth System Model)
- **rcp85**: Representative Concentration Pathway 8.5 W/m²
- **r1i1p1**: Realization, initialization, physics ensemble member
- **SMHI-RCA4**: Regional model (Rossby Centre RCA4)
- **v1**: Model version
- **day**: Daily temporal frequency
- **20060101-20101231**: Time range

### Model Hierarchy

```
GCM (Global Climate Model)
  ↓ boundary conditions
RCM (Regional Climate Model)
  ↓ high-resolution output
CORDEX Data (50km, 25km, or 12km)
```

### Common CORDEX Variables

| Variable | Standard Name | Units | Description |
|----------|---------------|-------|-------------|
| tas | air_temperature | K | Near-surface air temperature |
| pr | precipitation_flux | kg m⁻² s⁻¹ | Precipitation |
| tasmax | air_temperature | K | Daily maximum temperature |
| tasmin | air_temperature | K | Daily minimum temperature |
| hurs | relative_humidity | % | Near-surface relative humidity |
| sfcWind | wind_speed | m s⁻¹ | Near-surface wind speed |
| psl | air_pressure_at_sea_level | Pa | Sea level pressure |

### Emission Scenarios (RCPs)

- **RCP2.6**: Strong mitigation, peak ~3 W/m² by 2100, decline to 2.6 W/m²
- **RCP4.5**: Intermediate stabilization, ~4.5 W/m² by 2100
- **RCP8.5**: High emissions, business-as-usual, ~8.5 W/m² by 2100

## 3. Create Sample CORDEX Data

For demonstration purposes, we'll create synthetic CORDEX-like data with realistic structure, dimensions, and metadata. This allows us to work through the analysis workflow even without access to actual CORDEX files.

The synthetic data will represent:
- **Domain**: EUR-44 (European domain, 0.44° resolution)
- **Variables**: tas (temperature), pr (precipitation)
- **Time period**: 2006-2010 (5 years, daily data)
- **Spatial extent**: Central Europe

In [None]:
def create_sample_cordex_data(output_dir="./data", create_file=True):
    """
    Create synthetic CORDEX-like NetCDF data with realistic structure.

    Parameters:
    -----------
    output_dir : str
        Directory to save the sample NetCDF file
    create_file : bool
        If True, save to disk; if False, return xarray Dataset only

    Returns:
    --------
    ds : xarray.Dataset
        Sample CORDEX dataset
    """

    # Define spatial domain (Central Europe)
    lat = np.arange(42.0, 58.0, 0.44)  # ~44 km resolution
    lon = np.arange(2.0, 18.0, 0.44)

    # Define temporal dimension (5 years, daily)
    start_date = "2006-01-01"
    end_date = "2010-12-31"
    time = pd.date_range(start=start_date, end=end_date, freq="D")

    # Create meshgrid for spatial coordinates
    _lon_grid, lat_grid = np.meshgrid(lon, lat)

    # Generate synthetic temperature data (tas)
    # Base temperature with seasonal cycle and spatial gradient
    days_of_year = time.dayofyear.values
    years = time.year.values

    # Seasonal cycle (warmer in summer, colder in winter)
    seasonal_cycle = 15 * np.sin(2 * np.pi * (days_of_year - 80) / 365.25)

    # Warming trend (0.03 K/year for RCP8.5)
    trend = 0.03 * (years - 2006)

    # Spatial pattern (cooler in north, warmer in south)
    base_temp = 273.15 + 10 + 0.5 * (lat_grid - lat.mean())

    # Combine components with random variability
    tas_data = np.zeros((len(time), len(lat), len(lon)))
    for i in range(len(time)):
        tas_data[i] = (
            base_temp + seasonal_cycle[i] + trend[i] + np.random.normal(0, 2, (len(lat), len(lon)))
        )

    # Generate synthetic precipitation data (pr)
    # Higher precipitation in winter, spatial variability
    seasonal_pr = 2.0e-5 * (1 - 0.4 * np.sin(2 * np.pi * (days_of_year - 80) / 365.25))

    pr_data = np.zeros((len(time), len(lat), len(lon)))
    for i in range(len(time)):
        # Precipitation with spatial pattern and random events
        base_pr = seasonal_pr[i] * (1 + 0.3 * np.sin(lat_grid / 10))
        random_pr = np.random.gamma(2, seasonal_pr[i] / 2, (len(lat), len(lon)))
        pr_data[i] = np.maximum(0, base_pr + random_pr)

    # Create xarray Dataset with proper metadata
    ds = xr.Dataset(
        {
            "tas": (
                ["time", "lat", "lon"],
                tas_data,
                {
                    "standard_name": "air_temperature",
                    "long_name": "Near-Surface Air Temperature",
                    "units": "K",
                    "cell_methods": "time: mean",
                    "missing_value": 1.0e20,
                },
            ),
            "pr": (
                ["time", "lat", "lon"],
                pr_data,
                {
                    "standard_name": "precipitation_flux",
                    "long_name": "Precipitation",
                    "units": "kg m-2 s-1",
                    "cell_methods": "time: mean",
                    "missing_value": 1.0e20,
                },
            ),
        },
        coords={
            "time": (
                ["time"],
                time,
                {"standard_name": "time", "long_name": "time", "axis": "T", "calendar": "standard"},
            ),
            "lat": (
                ["lat"],
                lat,
                {
                    "standard_name": "latitude",
                    "long_name": "latitude",
                    "units": "degrees_north",
                    "axis": "Y",
                },
            ),
            "lon": (
                ["lon"],
                lon,
                {
                    "standard_name": "longitude",
                    "long_name": "longitude",
                    "units": "degrees_east",
                    "axis": "X",
                },
            ),
        },
        attrs={
            "Conventions": "CF-1.6",
            "title": "Sample CORDEX EUR-44 Regional Climate Model Data",
            "institution": "Climate Research Lab (Synthetic Data)",
            "source": "SMHI-RCA4 driven by MPI-ESM-LR",
            "experiment": "RCP8.5",
            "experiment_id": "rcp85",
            "driving_model_id": "MPI-ESM-LR",
            "driving_model_ensemble_member": "r1i1p1",
            "model_id": "SMHI-RCA4",
            "rcm_version_id": "v1",
            "CORDEX_domain": "EUR-44",
            "frequency": "day",
            "contact": "cordex@climate.research",
            "creation_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
            "history": "Synthetic data created for demonstration purposes",
        },
    )

    # Optionally save to disk
    if create_file:
        Path(output_dir).mkdir(parents=True, exist_ok=True)
        filename = "tas_pr_EUR-44_MPI-ESM-LR_rcp85_r1i1p1_SMHI-RCA4_v1_day_20060101-20101231.nc"
        filepath = os.path.join(output_dir, filename)

        # Use compression for efficient storage
        encoding = {"tas": {"zlib": True, "complevel": 5}, "pr": {"zlib": True, "complevel": 5}}

        ds.to_netcdf(filepath, encoding=encoding)
        print(f"Sample data saved to: {filepath}")
        print(f"File size: {os.path.getsize(filepath) / 1e6:.2f} MB")

    return ds


# Create sample dataset
ds_sample = create_sample_cordex_data(output_dir="./data", create_file=True)
print("\nSample CORDEX dataset created successfully!")

## 4. Load CORDEX Data

Load the CORDEX NetCDF file using xarray. The `open_dataset` function reads the file metadata immediately but loads data arrays lazily (only when needed).

### Loading Strategies:

1. **`xr.open_dataset()`**: Single file, lazy loading
2. **`xr.open_mfdataset()`**: Multiple files, concatenate along dimension
3. **`chunks` parameter**: Control dask chunking for parallel operations

In [None]:
# Define file path
data_dir = "./data"
filename = "tas_pr_EUR-44_MPI-ESM-LR_rcp85_r1i1p1_SMHI-RCA4_v1_day_20060101-20101231.nc"
filepath = os.path.join(data_dir, filename)

# Load dataset with chunking for efficient memory usage
ds = xr.open_dataset(
    filepath,
    chunks={"time": 365, "lat": 20, "lon": 20},  # Chunk by year and spatial blocks
)

print("Dataset loaded successfully!\n")
print(f"File: {filename}")
print("Memory: Lazy loading enabled via dask")

# Display dataset structure
ds

## 5. Explore Data Structure

Examine the dataset's dimensions, coordinates, variables, and metadata. Understanding the data structure is crucial for correct analysis.

In [None]:
# Dataset dimensions
print("=" * 60)
print("DIMENSIONS")
print("=" * 60)
for dim, size in ds.dims.items():
    print(f"{dim:15s}: {size:6d}")

# Coordinates
print("\n" + "=" * 60)
print("COORDINATES")
print("=" * 60)
for coord in ds.coords:
    coord_data = ds.coords[coord]
    print(f"\n{coord}:")
    print(f"  Shape: {coord_data.shape}")
    print(f"  Dtype: {coord_data.dtype}")
    if coord == "time":
        print(f"  Range: {coord_data.values[0]} to {coord_data.values[-1]}")
        print(f"  Duration: {len(coord_data)} days ({len(coord_data) / 365.25:.1f} years)")
    else:
        print(f"  Range: [{coord_data.values.min():.2f}, {coord_data.values.max():.2f}]")
        print(f"  Resolution: {np.diff(coord_data.values).mean():.4f}°")

# Data variables
print("\n" + "=" * 60)
print("DATA VARIABLES")
print("=" * 60)
for var in ds.data_vars:
    var_data = ds[var]
    print(f"\n{var}:")
    print(f"  Long name: {var_data.attrs.get('long_name', 'N/A')}")
    print(f"  Standard name: {var_data.attrs.get('standard_name', 'N/A')}")
    print(f"  Units: {var_data.attrs.get('units', 'N/A')}")
    print(f"  Shape: {var_data.shape}")
    print(f"  Dtype: {var_data.dtype}")
    print(f"  Chunks: {var_data.chunks if hasattr(var_data.data, 'chunks') else 'Not chunked'}")

In [None]:
# Global attributes (metadata)
print("=" * 60)
print("GLOBAL ATTRIBUTES (Metadata)")
print("=" * 60)

important_attrs = [
    "title",
    "institution",
    "source",
    "experiment",
    "experiment_id",
    "driving_model_id",
    "model_id",
    "CORDEX_domain",
    "frequency",
]

for attr in important_attrs:
    if attr in ds.attrs:
        print(f"{attr:25s}: {ds.attrs[attr]}")

print("\nAll attributes:")
for key, value in ds.attrs.items():
    if key not in important_attrs:
        print(f"  {key}: {value}")

In [None]:
# Time range analysis
print("=" * 60)
print("TEMPORAL COVERAGE")
print("=" * 60)

time_values = ds.time.values
time_index = pd.DatetimeIndex(time_values)

print(f"Start date: {time_index[0]}")
print(f"End date: {time_index[-1]}")
print(f"Total days: {len(time_index)}")
print(f"Total years: {len(time_index) / 365.25:.2f}")
print(f"\nYears covered: {sorted(time_index.year.unique().tolist())}")

# Check for temporal gaps
time_diffs = np.diff(time_index)
expected_diff = np.timedelta64(1, "D")
gaps = np.where(time_diffs != expected_diff)[0]

if len(gaps) > 0:
    print(f"\nWARNING: Found {len(gaps)} temporal gaps!")
    for gap_idx in gaps[:5]:  # Show first 5 gaps
        print(f"  Gap at index {gap_idx}: {time_diffs[gap_idx]}")
else:
    print("\nTemporal continuity: ✓ No gaps detected")

In [None]:
# Spatial extent analysis
print("=" * 60)
print("SPATIAL COVERAGE")
print("=" * 60)

lat = ds.lat.values
lon = ds.lon.values

print(f"Latitude range: [{lat.min():.2f}°N, {lat.max():.2f}°N]")
print(f"Longitude range: [{lon.min():.2f}°E, {lon.max():.2f}°E]")
print(f"\nGrid dimensions: {len(lat)} × {len(lon)} points")
print(f"Latitude resolution: ~{np.diff(lat).mean():.4f}° (~{np.diff(lat).mean() * 111:.1f} km)")
print(
    f"Longitude resolution: ~{np.diff(lon).mean():.4f}° (~{np.diff(lon).mean() * 111 * np.cos(np.radians(lat.mean())):.1f} km)"
)

# Calculate approximate area
lat_extent = lat.max() - lat.min()
lon_extent = lon.max() - lon.min()
area_approx = lat_extent * lon_extent * (111 * 111)  # Rough approximation in km²

print(f"\nApproximate domain area: {area_approx:,.0f} km²")
print(f"Domain extent: {lat_extent:.1f}° lat × {lon_extent:.1f}° lon")

## 6. Spatial Visualization

Visualize the CORDEX regional domain and climate variables using cartopy for geographic projections. We'll create professional maps showing:

1. Mean temperature field
2. Mean precipitation field
3. Seasonal composites

In [None]:
# Calculate temporal means for visualization
tas_mean = ds.tas.mean(dim="time").compute()  # Compute to load into memory
pr_mean = ds.pr.mean(dim="time").compute()

# Convert units for better readability
tas_mean_celsius = tas_mean - 273.15  # K to °C
pr_mean_mmday = pr_mean * 86400  # kg m-2 s-1 to mm/day

print("Temporal means calculated.")
print(f"Mean temperature: {tas_mean_celsius.mean().values:.2f} °C")
print(f"Mean precipitation: {pr_mean_mmday.mean().values:.2f} mm/day")

In [None]:
# Plot mean temperature field
fig = plt.figure(figsize=(14, 10))
ax = plt.axes(projection=ccrs.PlateCarree())

# Add map features
ax.add_feature(cfeature.LAND, facecolor="lightgray", alpha=0.3)
ax.add_feature(cfeature.COASTLINE, linewidth=0.8)
ax.add_feature(cfeature.BORDERS, linewidth=0.5, linestyle=":")
ax.add_feature(cfeature.LAKES, alpha=0.5)
ax.add_feature(cfeature.RIVERS, linewidth=0.5)

# Plot temperature data
im = ax.pcolormesh(
    ds.lon,
    ds.lat,
    tas_mean_celsius,
    transform=ccrs.PlateCarree(),
    cmap="RdYlBu_r",
    vmin=tas_mean_celsius.quantile(0.05),
    vmax=tas_mean_celsius.quantile(0.95),
    shading="auto",
)

# Add colorbar
cbar = plt.colorbar(im, ax=ax, orientation="horizontal", pad=0.05, shrink=0.8)
cbar.set_label("Mean Near-Surface Air Temperature (°C)", fontsize=12)

# Add gridlines
gl = ax.gridlines(draw_labels=True, linewidth=0.5, alpha=0.5, linestyle="--")
gl.top_labels = False
gl.right_labels = False

# Set extent and title
ax.set_extent([ds.lon.min(), ds.lon.max(), ds.lat.min(), ds.lat.max()], crs=ccrs.PlateCarree())
plt.title(
    f"CORDEX EUR-44: Mean Temperature (2006-2010)\n"
    f"Model: {ds.attrs.get('model_id', 'N/A')} driven by {ds.attrs.get('driving_model_id', 'N/A')} ({ds.attrs.get('experiment_id', 'N/A')})",
    fontsize=14,
    pad=20,
)

plt.tight_layout()
plt.show()

In [None]:
# Plot mean precipitation field
fig = plt.figure(figsize=(14, 10))
ax = plt.axes(projection=ccrs.PlateCarree())

# Add map features
ax.add_feature(cfeature.LAND, facecolor="lightgray", alpha=0.3)
ax.add_feature(cfeature.COASTLINE, linewidth=0.8)
ax.add_feature(cfeature.BORDERS, linewidth=0.5, linestyle=":")
ax.add_feature(cfeature.LAKES, alpha=0.5)
ax.add_feature(cfeature.RIVERS, linewidth=0.5)

# Plot precipitation data with logarithmic colormap for better visibility
im = ax.pcolormesh(
    ds.lon,
    ds.lat,
    pr_mean_mmday,
    transform=ccrs.PlateCarree(),
    cmap="YlGnBu",
    vmin=0,
    vmax=pr_mean_mmday.quantile(0.95),
    shading="auto",
)

# Add colorbar
cbar = plt.colorbar(im, ax=ax, orientation="horizontal", pad=0.05, shrink=0.8)
cbar.set_label("Mean Precipitation (mm/day)", fontsize=12)

# Add gridlines
gl = ax.gridlines(draw_labels=True, linewidth=0.5, alpha=0.5, linestyle="--")
gl.top_labels = False
gl.right_labels = False

# Set extent and title
ax.set_extent([ds.lon.min(), ds.lon.max(), ds.lat.min(), ds.lat.max()], crs=ccrs.PlateCarree())
plt.title(
    f"CORDEX EUR-44: Mean Precipitation (2006-2010)\n"
    f"Model: {ds.attrs.get('model_id', 'N/A')} driven by {ds.attrs.get('driving_model_id', 'N/A')} ({ds.attrs.get('experiment_id', 'N/A')})",
    fontsize=14,
    pad=20,
)

plt.tight_layout()
plt.show()

In [None]:
# Seasonal composites
# Group data by season and calculate means
ds_seasonal = ds.groupby("time.season").mean(dim="time")

# Define season order
season_order = ["DJF", "MAM", "JJA", "SON"]
season_names = {"DJF": "Winter", "MAM": "Spring", "JJA": "Summer", "SON": "Autumn"}

# Create subplot figure
fig, axes = plt.subplots(2, 2, figsize=(16, 12), subplot_kw={"projection": ccrs.PlateCarree()})
axes = axes.flatten()

for i, season in enumerate(season_order):
    ax = axes[i]

    # Add map features
    ax.add_feature(cfeature.LAND, facecolor="lightgray", alpha=0.3)
    ax.add_feature(cfeature.COASTLINE, linewidth=0.5)
    ax.add_feature(cfeature.BORDERS, linewidth=0.3, linestyle=":")

    # Get seasonal data
    tas_season = ds_seasonal.tas.sel(season=season) - 273.15  # Convert to °C

    # Plot
    im = ax.pcolormesh(
        ds.lon,
        ds.lat,
        tas_season,
        transform=ccrs.PlateCarree(),
        cmap="RdYlBu_r",
        vmin=-5,
        vmax=25,
        shading="auto",
    )

    # Colorbar
    cbar = plt.colorbar(im, ax=ax, orientation="horizontal", pad=0.05, shrink=0.9)
    cbar.set_label("Temperature (°C)", fontsize=10)

    # Gridlines
    gl = ax.gridlines(draw_labels=True, linewidth=0.3, alpha=0.5, linestyle="--")
    gl.top_labels = False
    gl.right_labels = False
    if i < 2:
        gl.bottom_labels = False
    if i % 2 == 1:
        gl.left_labels = False

    # Title
    ax.set_title(f"{season_names[season]} ({season})", fontsize=12, pad=10)
    ax.set_extent([ds.lon.min(), ds.lon.max(), ds.lat.min(), ds.lat.max()], crs=ccrs.PlateCarree())

plt.suptitle("CORDEX EUR-44: Seasonal Mean Temperature (2006-2010)", fontsize=16, y=0.98)
plt.tight_layout()
plt.show()

## 7. Time Series Extraction

Extract and analyze time series data for specific locations or regional averages. This is essential for understanding temporal variability and trends.

In [None]:
# Define points of interest (major European cities)
cities = {
    "Paris": {"lat": 48.85, "lon": 2.35},
    "Berlin": {"lat": 52.52, "lon": 13.40},
    "Vienna": {"lat": 48.20, "lon": 16.37},
    "Munich": {"lat": 48.14, "lon": 11.58},
}

# Extract time series for each city (nearest grid point)
city_data = {}
for city_name, coords in cities.items():
    # Select nearest grid point
    point_data = ds.sel(lat=coords["lat"], lon=coords["lon"], method="nearest")
    city_data[city_name] = point_data
    print(f"{city_name}: lat={point_data.lat.values:.2f}, lon={point_data.lon.values:.2f}")

print("\nTime series extracted for all cities.")

In [None]:
# Plot temperature time series
fig, axes = plt.subplots(2, 1, figsize=(16, 10), sharex=True)

# Temperature panel
ax = axes[0]
for city_name, data in city_data.items():
    tas_celsius = data.tas - 273.15
    tas_celsius.plot(ax=ax, label=city_name, linewidth=0.8, alpha=0.8)

ax.set_ylabel("Temperature (°C)", fontsize=12)
ax.set_title("Daily Near-Surface Air Temperature", fontsize=14, pad=10)
ax.legend(loc="best", ncol=4, framealpha=0.9)
ax.grid(True, alpha=0.3)
ax.set_xlabel("")

# Precipitation panel
ax = axes[1]
for city_name, data in city_data.items():
    pr_mmday = data.pr * 86400  # Convert to mm/day
    pr_mmday.plot(ax=ax, label=city_name, linewidth=0.8, alpha=0.8)

ax.set_ylabel("Precipitation (mm/day)", fontsize=12)
ax.set_xlabel("Date", fontsize=12)
ax.set_title("Daily Precipitation", fontsize=14, pad=10)
ax.legend(loc="best", ncol=4, framealpha=0.9)
ax.grid(True, alpha=0.3)

plt.suptitle("CORDEX EUR-44: Time Series for Selected Cities (2006-2010)", fontsize=16, y=0.995)
plt.tight_layout()
plt.show()

In [None]:
# Calculate and plot monthly climatology
fig, axes = plt.subplots(2, 1, figsize=(14, 10), sharex=True)

# Temperature climatology
ax = axes[0]
for city_name, data in city_data.items():
    tas_celsius = data.tas - 273.15
    monthly_mean = tas_celsius.groupby("time.month").mean()
    monthly_std = tas_celsius.groupby("time.month").std()

    months = monthly_mean.month.values
    ax.plot(months, monthly_mean.values, "o-", label=city_name, linewidth=2, markersize=6)
    ax.fill_between(
        months,
        monthly_mean.values - monthly_std.values,
        monthly_mean.values + monthly_std.values,
        alpha=0.2,
    )

ax.set_ylabel("Temperature (°C)", fontsize=12)
ax.set_title("Monthly Mean Temperature (2006-2010)", fontsize=14, pad=10)
ax.legend(loc="best", ncol=4, framealpha=0.9)
ax.grid(True, alpha=0.3)
ax.set_xticks(range(1, 13))
ax.set_xlabel("")

# Precipitation climatology
ax = axes[1]
for city_name, data in city_data.items():
    pr_mmday = data.pr * 86400
    monthly_mean = pr_mmday.groupby("time.month").mean()
    monthly_std = pr_mmday.groupby("time.month").std()

    months = monthly_mean.month.values
    ax.plot(months, monthly_mean.values, "o-", label=city_name, linewidth=2, markersize=6)
    ax.fill_between(
        months,
        monthly_mean.values - monthly_std.values,
        monthly_mean.values + monthly_std.values,
        alpha=0.2,
    )

ax.set_ylabel("Precipitation (mm/day)", fontsize=12)
ax.set_xlabel("Month", fontsize=12)
ax.set_title("Monthly Mean Precipitation (2006-2010)", fontsize=14, pad=10)
ax.legend(loc="best", ncol=4, framealpha=0.9)
ax.grid(True, alpha=0.3)
ax.set_xticks(range(1, 13))
ax.set_xticklabels(
    ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
)

plt.suptitle("CORDEX EUR-44: Monthly Climatology (shaded area = ±1 std)", fontsize=16, y=0.995)
plt.tight_layout()
plt.show()

In [None]:
# Calculate regional average (spatial mean)
regional_mean = ds.mean(dim=["lat", "lon"])

# Calculate annual means
tas_annual = (regional_mean.tas - 273.15).groupby("time.year").mean()
pr_annual = (regional_mean.pr * 86400 * 365.25).groupby("time.year").mean()  # mm/year

# Plot annual means with trend
fig, axes = plt.subplots(2, 1, figsize=(12, 10))

# Temperature
ax = axes[0]
years = tas_annual.year.values
ax.plot(years, tas_annual.values, "o-", linewidth=2, markersize=8, label="Annual mean")

# Linear trend
z = np.polyfit(years, tas_annual.values, 1)
p = np.poly1d(z)
ax.plot(years, p(years), "--", linewidth=2, alpha=0.7, label=f"Trend: {z[0]:.3f} °C/year")

ax.set_ylabel("Temperature (°C)", fontsize=12)
ax.set_title("Regional Mean Annual Temperature", fontsize=14, pad=10)
ax.legend(loc="best", framealpha=0.9)
ax.grid(True, alpha=0.3)

# Precipitation
ax = axes[1]
ax.bar(years, pr_annual.values, width=0.6, alpha=0.7, label="Annual total")
ax.axhline(
    pr_annual.mean().values,
    color="red",
    linestyle="--",
    linewidth=2,
    label=f"Mean: {pr_annual.mean().values:.1f} mm/year",
)

ax.set_ylabel("Precipitation (mm/year)", fontsize=12)
ax.set_xlabel("Year", fontsize=12)
ax.set_title("Regional Mean Annual Precipitation", fontsize=14, pad=10)
ax.legend(loc="best", framealpha=0.9)
ax.grid(True, alpha=0.3, axis="y")

plt.suptitle("CORDEX EUR-44: Regional Average Annual Statistics", fontsize=16, y=0.995)
plt.tight_layout()
plt.show()

print(f"Temperature trend: {z[0]:.4f} °C/year")
print(f"Total warming (2006-2010): {z[0] * (years[-1] - years[0]):.3f} °C")

## 8. Data Quality Checks

Perform comprehensive quality control checks on the CORDEX data:

1. Missing values detection
2. Physical range validation
3. Temporal continuity
4. Spatial consistency
5. Statistical outliers

In [None]:
def quality_check_variable(data, var_name, valid_min, valid_max, units):
    """
    Perform quality checks on a climate variable.

    Parameters:
    -----------
    data : xarray.DataArray
        Variable data array
    var_name : str
        Variable name for reporting
    valid_min : float
        Minimum valid value
    valid_max : float
        Maximum valid value
    units : str
        Units for reporting
    """
    print("=" * 70)
    print(f"QUALITY CHECK: {var_name}")
    print("=" * 70)

    # 1. Missing values
    n_total = data.size
    n_missing = data.isnull().sum().values
    pct_missing = (n_missing / n_total) * 100

    print("\n1. Missing Values:")
    print(f"   Total points: {n_total:,}")
    print(f"   Missing: {n_missing:,} ({pct_missing:.4f}%)")
    if pct_missing == 0:
        print("   Status: ✓ PASS - No missing values")
    elif pct_missing < 1:
        print("   Status: ⚠ WARNING - Few missing values")
    else:
        print("   Status: ✗ FAIL - Significant missing data")

    # 2. Physical range validation
    data_min = float(data.min().values)
    data_max = float(data.max().values)
    data_mean = float(data.mean().values)
    data_std = float(data.std().values)

    print("\n2. Physical Range Validation:")
    print(f"   Valid range: [{valid_min}, {valid_max}] {units}")
    print(f"   Data range: [{data_min:.4f}, {data_max:.4f}] {units}")
    print(f"   Mean: {data_mean:.4f} {units}")
    print(f"   Std dev: {data_std:.4f} {units}")

    # Check for out-of-range values
    out_of_range = ((data < valid_min) | (data > valid_max)).sum().values
    pct_out_of_range = (out_of_range / n_total) * 100

    if out_of_range == 0:
        print("   Status: ✓ PASS - All values within valid range")
    else:
        print(f"   Out of range: {out_of_range:,} ({pct_out_of_range:.4f}%)")
        print("   Status: ✗ FAIL - Some values outside valid range")

    # 3. Statistical outliers (values beyond 5 sigma)
    print("\n3. Statistical Outliers (>5σ):")
    outliers = (np.abs(data - data_mean) > 5 * data_std).sum().values
    pct_outliers = (outliers / n_total) * 100

    print(f"   Outliers: {outliers:,} ({pct_outliers:.4f}%)")
    if pct_outliers < 0.1:
        print("   Status: ✓ PASS - Few statistical outliers")
    else:
        print("   Status: ⚠ WARNING - Many statistical outliers")

    # 4. Spatial consistency
    print("\n4. Spatial Consistency:")
    spatial_mean = data.mean(dim="time")
    spatial_std = spatial_mean.std().values
    spatial_range = spatial_mean.max().values - spatial_mean.min().values

    print(f"   Spatial std dev: {spatial_std:.4f} {units}")
    print(f"   Spatial range: {spatial_range:.4f} {units}")
    print("   Status: ✓ INFO - Check values against expected patterns")

    # 5. Distribution summary
    print("\n5. Distribution Percentiles:")
    percentiles = [1, 5, 25, 50, 75, 95, 99]
    for p in percentiles:
        val = float(data.quantile(p / 100).values)
        print(f"   {p:2d}th: {val:.4f} {units}")

    print()


# Run quality checks
quality_check_variable(
    ds.tas,
    var_name="Near-Surface Air Temperature (tas)",
    valid_min=200.0,  # -73°C
    valid_max=330.0,  # 57°C
    units="K",
)

quality_check_variable(
    ds.pr,
    var_name="Precipitation (pr)",
    valid_min=0.0,
    valid_max=0.01,  # ~860 mm/day (extreme)
    units="kg m-2 s-1",
)

In [None]:
# Visualize data distributions
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Temperature histogram
ax = axes[0, 0]
tas_sample = ds.tas.isel(time=slice(0, 365)).values.flatten()  # Sample first year
ax.hist(tas_sample - 273.15, bins=50, alpha=0.7, edgecolor="black")
ax.set_xlabel("Temperature (°C)", fontsize=11)
ax.set_ylabel("Frequency", fontsize=11)
ax.set_title("Temperature Distribution (Year 1)", fontsize=12)
ax.grid(True, alpha=0.3)

# Temperature box plot by year
ax = axes[0, 1]
annual_data = []
years = sorted(ds.time.dt.year.values.tolist())
unique_years = sorted(set(years))
for year in unique_years:
    year_data = ds.tas.sel(time=str(year)) - 273.15
    annual_data.append(year_data.values.flatten())

bp = ax.boxplot(annual_data, labels=unique_years, patch_artist=True)
for patch in bp["boxes"]:
    patch.set_facecolor("lightblue")
ax.set_xlabel("Year", fontsize=11)
ax.set_ylabel("Temperature (°C)", fontsize=11)
ax.set_title("Temperature Distribution by Year", fontsize=12)
ax.grid(True, alpha=0.3, axis="y")

# Precipitation histogram (log scale)
ax = axes[1, 0]
pr_sample = ds.pr.isel(time=slice(0, 365)).values.flatten() * 86400  # Sample first year
pr_sample = pr_sample[pr_sample > 0]  # Remove zeros for log scale
ax.hist(pr_sample, bins=50, alpha=0.7, edgecolor="black")
ax.set_xlabel("Precipitation (mm/day)", fontsize=11)
ax.set_ylabel("Frequency", fontsize=11)
ax.set_title("Precipitation Distribution (Year 1, excluding zeros)", fontsize=12)
ax.set_yscale("log")
ax.grid(True, alpha=0.3)

# Precipitation box plot by season
ax = axes[1, 1]
seasonal_pr = []
season_labels = ["DJF", "MAM", "JJA", "SON"]
for season in season_labels:
    season_data = ds.pr.sel(time=ds.time.dt.season == season) * 86400
    seasonal_pr.append(season_data.values.flatten())

bp = ax.boxplot(seasonal_pr, labels=season_labels, patch_artist=True)
for patch in bp["boxes"]:
    patch.set_facecolor("lightgreen")
ax.set_xlabel("Season", fontsize=11)
ax.set_ylabel("Precipitation (mm/day)", fontsize=11)
ax.set_title("Precipitation Distribution by Season", fontsize=12)
ax.grid(True, alpha=0.3, axis="y")

plt.suptitle("CORDEX EUR-44: Data Distribution Analysis", fontsize=16, y=0.995)
plt.tight_layout()
plt.show()

## 9. Multi-File Handling

Demonstrate efficient handling of multiple CORDEX files using `xr.open_mfdataset()`. This is essential when working with:

- Multiple time periods (concatenate along time)
- Multiple ensemble members
- Multiple scenarios or models

### Best Practices:

1. Use glob patterns to match file sets
2. Specify chunking strategy for memory efficiency
3. Use `combine='by_coords'` for automatic dimension alignment
4. Enable parallel loading with `parallel=True`

In [None]:
# Create multiple sample files for demonstration
def create_multiple_cordex_files(output_dir="./data"):
    """
    Create multiple CORDEX files representing different time periods.
    """
    Path(output_dir).mkdir(parents=True, exist_ok=True)

    # Define time periods
    periods = [
        ("2006-01-01", "2007-12-31"),
        ("2008-01-01", "2009-12-31"),
        ("2010-01-01", "2011-12-31"),
    ]

    files_created = []

    for start_date, end_date in periods:
        # Spatial domain
        lat = np.arange(42.0, 58.0, 0.44)
        lon = np.arange(2.0, 18.0, 0.44)
        time = pd.date_range(start=start_date, end=end_date, freq="D")

        # Create simple synthetic data
        _lon_grid, lat_grid = np.meshgrid(lon, lat)

        # Temperature with seasonal cycle
        days_of_year = time.dayofyear.values
        years = time.year.values
        seasonal_cycle = 15 * np.sin(2 * np.pi * (days_of_year - 80) / 365.25)
        trend = 0.03 * (years - 2006)
        base_temp = 273.15 + 10 + 0.5 * (lat_grid - lat.mean())

        tas_data = np.zeros((len(time), len(lat), len(lon)))
        for i in range(len(time)):
            tas_data[i] = (
                base_temp
                + seasonal_cycle[i]
                + trend[i]
                + np.random.normal(0, 2, (len(lat), len(lon)))
            )

        # Precipitation
        seasonal_pr = 2.0e-5 * (1 - 0.4 * np.sin(2 * np.pi * (days_of_year - 80) / 365.25))
        pr_data = np.zeros((len(time), len(lat), len(lon)))
        for i in range(len(time)):
            base_pr = seasonal_pr[i] * (1 + 0.3 * np.sin(lat_grid / 10))
            random_pr = np.random.gamma(2, seasonal_pr[i] / 2, (len(lat), len(lon)))
            pr_data[i] = np.maximum(0, base_pr + random_pr)

        # Create dataset
        ds_period = xr.Dataset(
            {
                "tas": (
                    ["time", "lat", "lon"],
                    tas_data,
                    {
                        "standard_name": "air_temperature",
                        "long_name": "Near-Surface Air Temperature",
                        "units": "K",
                    },
                ),
                "pr": (
                    ["time", "lat", "lon"],
                    pr_data,
                    {
                        "standard_name": "precipitation_flux",
                        "long_name": "Precipitation",
                        "units": "kg m-2 s-1",
                    },
                ),
            },
            coords={"time": time, "lat": lat, "lon": lon},
            attrs={
                "Conventions": "CF-1.6",
                "institution": "Climate Research Lab",
                "source": "SMHI-RCA4 driven by MPI-ESM-LR",
                "experiment_id": "rcp85",
                "CORDEX_domain": "EUR-44",
            },
        )

        # Save file
        start_str = start_date.replace("-", "")
        end_str = end_date.replace("-", "")
        filename = (
            f"tas_pr_EUR-44_MPI-ESM-LR_rcp85_r1i1p1_SMHI-RCA4_v1_day_{start_str}-{end_str}.nc"
        )
        filepath = os.path.join(output_dir, filename)

        encoding = {"tas": {"zlib": True, "complevel": 5}, "pr": {"zlib": True, "complevel": 5}}
        ds_period.to_netcdf(filepath, encoding=encoding)

        files_created.append(filepath)
        print(f"Created: {filename}")

    return files_created


# Create multiple files
print("Creating multiple CORDEX files...\n")
files = create_multiple_cordex_files(output_dir="./data")
print(f"\nTotal files created: {len(files)}")

In [None]:
# Load multiple files using xr.open_mfdataset
print("Loading multiple CORDEX files...\n")

# Use glob pattern to match all files
file_pattern = "./data/tas_pr_EUR-44_MPI-ESM-LR_rcp85_r1i1p1_SMHI-RCA4_v1_day_*.nc"

# Open multiple datasets with chunking
ds_multi = xr.open_mfdataset(
    file_pattern,
    chunks={"time": 365, "lat": 20, "lon": 20},
    combine="by_coords",
    parallel=True,
    engine="netcdf4",
)

print("Multi-file dataset loaded successfully!\n")
print(f"Time range: {ds_multi.time.values[0]} to {ds_multi.time.values[-1]}")
print(f"Total time steps: {len(ds_multi.time)}")
print(f"Duration: {len(ds_multi.time) / 365.25:.2f} years\n")

# Display dataset
ds_multi

In [None]:
# Compare single vs multi-file datasets
print("=" * 70)
print("COMPARISON: Single File vs Multi-File Dataset")
print("=" * 70)

print("\nSingle file dataset:")
print(f"  Time steps: {len(ds.time)}")
print(f"  Duration: {len(ds.time) / 365.25:.2f} years")
print(f"  Date range: {ds.time.values[0]} to {ds.time.values[-1]}")

print("\nMulti-file dataset:")
print(f"  Time steps: {len(ds_multi.time)}")
print(f"  Duration: {len(ds_multi.time) / 365.25:.2f} years")
print(f"  Date range: {ds_multi.time.values[0]} to {ds_multi.time.values[-1]}")

print(f"\nDimensions match: {ds.dims == ds_multi.dims[next(iter(ds.dims.keys()))]}")
print(f"Variables match: {set(ds.data_vars) == set(ds_multi.data_vars)}")

In [None]:
# Demonstrate efficient computation on multi-file dataset
print("Computing regional mean from multi-file dataset...\n")

# Calculate regional mean with progress bar
with ProgressBar():
    regional_mean_multi = ds_multi[["tas", "pr"]].mean(dim=["lat", "lon"]).compute()

print("Computation complete!\n")

# Plot time series
fig, axes = plt.subplots(2, 1, figsize=(16, 10), sharex=True)

# Temperature
ax = axes[0]
tas_celsius = regional_mean_multi.tas - 273.15
tas_celsius.plot(ax=ax, linewidth=1, color="red", alpha=0.7)

# Add annual mean
tas_annual = tas_celsius.resample(time="1Y").mean()
tas_annual.plot(ax=ax, linewidth=3, color="darkred", label="Annual mean")

ax.set_ylabel("Temperature (°C)", fontsize=12)
ax.set_title("Regional Mean Temperature (Multi-File Dataset)", fontsize=14, pad=10)
ax.legend(loc="best")
ax.grid(True, alpha=0.3)
ax.set_xlabel("")

# Precipitation
ax = axes[1]
pr_mmday = regional_mean_multi.pr * 86400
pr_mmday.plot(ax=ax, linewidth=1, color="blue", alpha=0.7)

# Add annual mean
pr_annual = pr_mmday.resample(time="1Y").mean()
pr_annual.plot(ax=ax, linewidth=3, color="darkblue", label="Annual mean")

ax.set_ylabel("Precipitation (mm/day)", fontsize=12)
ax.set_xlabel("Date", fontsize=12)
ax.set_title("Regional Mean Precipitation (Multi-File Dataset)", fontsize=14, pad=10)
ax.legend(loc="best")
ax.grid(True, alpha=0.3)

plt.suptitle("CORDEX EUR-44: Multi-File Dataset Analysis", fontsize=16, y=0.995)
plt.tight_layout()
plt.show()

## 10. Memory Management with Dask

Efficient memory management is crucial when working with large climate datasets. This section demonstrates:

1. Lazy loading with dask arrays
2. Optimal chunking strategies
3. Parallel computation
4. Memory monitoring

### Chunking Best Practices:

- **Temporal chunks**: Typically 1 year (365 days) for daily data
- **Spatial chunks**: 10-50 grid points per dimension
- **Chunk size**: Aim for 100-200 MB per chunk
- **Analysis dimension**: Larger chunks in dimensions you aggregate over

In [None]:
# Inspect dask chunking
print("=" * 70)
print("DASK CHUNKING ANALYSIS")
print("=" * 70)

for var in ds_multi.data_vars:
    data = ds_multi[var]
    print(f"\n{var}:")
    print(f"  Shape: {data.shape}")
    print(f"  Dtype: {data.dtype}")

    if hasattr(data.data, "chunks"):
        print(f"  Chunks: {data.chunks}")
        print(f"  Number of chunks: {data.data.npartitions}")

        # Estimate chunk size
        chunk_elements = np.prod([c[0] if isinstance(c, tuple) else c for c in data.chunks])
        chunk_bytes = chunk_elements * data.dtype.itemsize
        chunk_mb = chunk_bytes / (1024 * 1024)

        print(f"  Approximate chunk size: {chunk_mb:.2f} MB")

        # Total memory if fully loaded
        total_mb = data.nbytes / (1024 * 1024)
        print(f"  Total size if loaded: {total_mb:.2f} MB")
    else:
        print("  Not chunked (loaded in memory)")

In [None]:
# Demonstrate different chunking strategies
print("=" * 70)
print("CHUNKING STRATEGIES COMPARISON")
print("=" * 70)

# Strategy 1: Small temporal chunks (good for time series analysis)
ds_chunk1 = xr.open_dataset(
    "./data/tas_pr_EUR-44_MPI-ESM-LR_rcp85_r1i1p1_SMHI-RCA4_v1_day_20060101-20101231.nc",
    chunks={"time": 30, "lat": -1, "lon": -1},  # Monthly chunks, full spatial domain
)

# Strategy 2: Large temporal chunks (good for spatial analysis)
ds_chunk2 = xr.open_dataset(
    "./data/tas_pr_EUR-44_MPI-ESM-LR_rcp85_r1i1p1_SMHI-RCA4_v1_day_20060101-20101231.nc",
    chunks={"time": -1, "lat": 10, "lon": 10},  # Full time, small spatial blocks
)

# Strategy 3: Balanced chunks (good for general analysis)
ds_chunk3 = xr.open_dataset(
    "./data/tas_pr_EUR-44_MPI-ESM-LR_rcp85_r1i1p1_SMHI-RCA4_v1_day_20060101-20101231.nc",
    chunks={"time": 365, "lat": 20, "lon": 20},  # Balanced
)

print("\nStrategy 1: Time series optimized (monthly temporal chunks)")
print(f"  tas chunks: {ds_chunk1.tas.chunks}")
print(f"  Number of chunks: {ds_chunk1.tas.data.npartitions}")

print("\nStrategy 2: Spatial analysis optimized (full time, small spatial blocks)")
print(f"  tas chunks: {ds_chunk2.tas.chunks}")
print(f"  Number of chunks: {ds_chunk2.tas.data.npartitions}")

print("\nStrategy 3: Balanced (annual temporal, medium spatial chunks)")
print(f"  tas chunks: {ds_chunk3.tas.chunks}")
print(f"  Number of chunks: {ds_chunk3.tas.data.npartitions}")

print("\n" + "=" * 70)
print("Use Strategy 1 for: Time series extraction, temporal statistics")
print("Use Strategy 2 for: Spatial maps, spatial statistics")
print("Use Strategy 3 for: General-purpose analysis, mixed operations")
print("=" * 70)

In [None]:
# Demonstrate lazy vs eager evaluation
import time

print("=" * 70)
print("LAZY vs EAGER EVALUATION")
print("=" * 70)

# Lazy evaluation (doesn't load data)
print("\n1. Lazy evaluation (setup only):")
start_time = time.time()
result_lazy = ds_multi.tas.mean(dim=["lat", "lon"])
lazy_time = time.time() - start_time
print(f"   Time: {lazy_time:.6f} seconds")
print(f"   Result type: {type(result_lazy.data)}")
print("   Memory loaded: No (dask array)")

# Eager evaluation (loads and computes)
print("\n2. Eager evaluation (actual computation):")
start_time = time.time()
with ProgressBar():
    result_eager = ds_multi.tas.mean(dim=["lat", "lon"]).compute()
eager_time = time.time() - start_time
print(f"   Time: {eager_time:.6f} seconds")
print(f"   Result type: {type(result_eager.data)}")
print("   Memory loaded: Yes (numpy array)")

print(f"\nSpeedup factor: {eager_time / lazy_time:.0f}x")
print("\nKey insight: Lazy evaluation allows you to build complex")
print("computation graphs before executing, optimizing memory usage.")

In [None]:
# Demonstrate parallel computation with dask
print("=" * 70)
print("PARALLEL COMPUTATION DEMONSTRATION")
print("=" * 70)

# Complex operation: Calculate anomalies relative to long-term mean
print("\nCalculating temperature anomalies...")
print("Operation: (T - T_mean) for each time step\n")

# Calculate climatology (long-term mean)
with ProgressBar():
    print("Step 1: Computing long-term mean...")
    climatology = ds_multi.tas.mean(dim="time").compute()

# Calculate anomalies (lazy)
print("\nStep 2: Setting up anomaly calculation (lazy)...")
anomalies = ds_multi.tas - climatology
print(f"Anomalies type: {type(anomalies.data)}")
print(f"Anomalies shape: {anomalies.shape}")

# Compute a subset to demonstrate
print("\nStep 3: Computing anomalies for first year...")
with ProgressBar():
    anomalies_year1 = anomalies.isel(time=slice(0, 365)).compute()

print("\nAnomaly statistics (Year 1):")
print(f"  Mean: {float(anomalies_year1.mean()):.6f} K")
print(f"  Std: {float(anomalies_year1.std()):.6f} K")
print(f"  Min: {float(anomalies_year1.min()):.6f} K")
print(f"  Max: {float(anomalies_year1.max()):.6f} K")

In [None]:
# Visualize memory-efficient workflow
print("=" * 70)
print("MEMORY-EFFICIENT WORKFLOW EXAMPLE")
print("=" * 70)

print("\nScenario: Calculate seasonal means for large multi-year dataset")
print("without loading entire dataset into memory.\n")

# Build computation graph (lazy)
print("Building computation graph...")
seasonal_means = ds_multi.groupby("time.season").mean(dim="time")
print(f"  Type: {type(seasonal_means.tas.data)}")
print("  Memory loaded: No\n")

# Execute computation with progress bar
print("Executing computation...")
with ProgressBar():
    seasonal_means_computed = seasonal_means.compute()

print("\nResults:")
for season in ["DJF", "MAM", "JJA", "SON"]:
    tas_mean = float(seasonal_means_computed.tas.sel(season=season).mean()) - 273.15
    pr_mean = float(seasonal_means_computed.pr.sel(season=season).mean()) * 86400
    print(f"  {season}: T = {tas_mean:.2f}°C, P = {pr_mean:.3f} mm/day")

# Close datasets
ds_chunk1.close()
ds_chunk2.close()
ds_chunk3.close()

## 11. Save Processed Data

Export processed data for downstream analysis or sharing. Best practices include:

1. Compress output files (zlib compression)
2. Preserve metadata and attributes
3. Use appropriate file formats (NetCDF4, Zarr)
4. Document processing steps in file attributes

In [None]:
# Create output directory
output_dir = "./processed_data"
Path(output_dir).mkdir(parents=True, exist_ok=True)

print("=" * 70)
print("SAVING PROCESSED DATA")
print("=" * 70)

In [None]:
# Example 1: Save regional mean time series
print("\n1. Saving regional mean time series...")

# Calculate regional means
regional_mean = ds.mean(dim=["lat", "lon"])

# Add processing metadata
regional_mean.attrs.update(
    {
        "title": "CORDEX EUR-44 Regional Mean Time Series",
        "processing_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
        "processing_steps": "Spatial mean over entire domain",
        "source_file": "tas_pr_EUR-44_MPI-ESM-LR_rcp85_r1i1p1_SMHI-RCA4_v1_day_20060101-20101231.nc",
    }
)

# Save with compression
output_file = os.path.join(output_dir, "regional_mean_timeseries.nc")
encoding = {
    "tas": {"zlib": True, "complevel": 5, "dtype": "float32"},
    "pr": {"zlib": True, "complevel": 5, "dtype": "float32"},
}

regional_mean.to_netcdf(output_file, encoding=encoding)
file_size = os.path.getsize(output_file) / 1024
print(f"   Saved: {output_file}")
print(f"   Size: {file_size:.2f} KB")

In [None]:
# Example 2: Save seasonal climatology
print("\n2. Saving seasonal climatology...")

# Calculate seasonal means
seasonal_clim = ds.groupby("time.season").mean(dim="time")

# Add metadata
seasonal_clim.attrs.update(
    {
        "title": "CORDEX EUR-44 Seasonal Climatology (2006-2010)",
        "processing_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
        "processing_steps": "Seasonal means calculated from daily data",
        "time_period": "2006-2010",
    }
)

# Compute and save
output_file = os.path.join(output_dir, "seasonal_climatology.nc")
encoding = {
    "tas": {"zlib": True, "complevel": 5, "dtype": "float32"},
    "pr": {"zlib": True, "complevel": 5, "dtype": "float32"},
}

with ProgressBar():
    seasonal_clim.to_netcdf(output_file, encoding=encoding)

file_size = os.path.getsize(output_file) / 1024
print(f"   Saved: {output_file}")
print(f"   Size: {file_size:.2f} KB")

In [None]:
# Example 3: Save point time series for selected cities
print("\n3. Saving city time series...")

# Create dataset with all city time series
city_datasets = []
for city_name, coords in cities.items():
    point_data = ds.sel(lat=coords["lat"], lon=coords["lon"], method="nearest")

    # Rename variables to include city name
    point_data = point_data.rename({"tas": f"tas_{city_name}", "pr": f"pr_{city_name}"})

    city_datasets.append(point_data)

# Merge all city data
cities_combined = xr.merge(city_datasets)

# Add metadata
cities_combined.attrs.update(
    {
        "title": "CORDEX EUR-44 City Time Series",
        "cities": ", ".join(cities.keys()),
        "processing_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
        "note": "Nearest grid point extraction for selected European cities",
    }
)

# Save
output_file = os.path.join(output_dir, "city_timeseries.nc")
encoding = {
    var: {"zlib": True, "complevel": 5, "dtype": "float32"} for var in cities_combined.data_vars
}

cities_combined.to_netcdf(output_file, encoding=encoding)
file_size = os.path.getsize(output_file) / 1024
print(f"   Saved: {output_file}")
print(f"   Size: {file_size:.2f} KB")

In [None]:
# Example 4: Export to CSV for easy sharing/analysis
print("\n4. Exporting to CSV format...")

# Regional mean as CSV
regional_df = regional_mean.to_dataframe()
regional_df["tas_celsius"] = regional_df["tas"] - 273.15
regional_df["pr_mmday"] = regional_df["pr"] * 86400
regional_df = regional_df[["tas_celsius", "pr_mmday"]]
regional_df.columns = ["Temperature_C", "Precipitation_mm_day"]

output_file = os.path.join(output_dir, "regional_mean_timeseries.csv")
regional_df.to_csv(output_file)
file_size = os.path.getsize(output_file) / 1024
print(f"   Saved: {output_file}")
print(f"   Size: {file_size:.2f} KB")
print(f"   Rows: {len(regional_df)}")

In [None]:
# Summary of saved files
print("\n" + "=" * 70)
print("SUMMARY OF SAVED FILES")
print("=" * 70)

saved_files = list(Path(output_dir).glob("*"))
total_size = 0

for i, filepath in enumerate(sorted(saved_files), 1):
    file_size = os.path.getsize(filepath)
    total_size += file_size
    print(f"{i}. {filepath.name}")
    print(f"   Size: {file_size / 1024:.2f} KB")
    print(f"   Type: {filepath.suffix[1:].upper()}")

print(f"\nTotal files: {len(saved_files)}")
print(f"Total size: {total_size / 1024:.2f} KB ({total_size / (1024 * 1024):.2f} MB)")
print(f"\nOutput directory: {os.path.abspath(output_dir)}")

## Summary and Next Steps

### What We've Learned

This notebook covered the essential workflow for working with CORDEX regional climate data:

1. **Data Structure**: Understanding CORDEX naming conventions, model hierarchy, and CF-compliant NetCDF format
2. **Data Loading**: Using xarray for lazy loading and efficient memory management
3. **Exploration**: Examining dimensions, coordinates, attributes, and temporal/spatial coverage
4. **Visualization**: Creating professional maps with cartopy and analyzing spatial patterns
5. **Time Series**: Extracting point and regional time series, calculating climatologies
6. **Quality Control**: Checking for missing values, physical ranges, and statistical outliers
7. **Multi-File Handling**: Loading and concatenating multiple files efficiently
8. **Memory Management**: Leveraging dask for lazy evaluation and parallel computation
9. **Data Export**: Saving processed data with proper metadata and compression

### Best Practices Summary

- Always use chunking for large datasets
- Leverage lazy evaluation to build computation graphs
- Preserve metadata in processed files
- Document processing steps in file attributes
- Use compression to reduce file sizes
- Validate data quality before analysis
- Choose appropriate chunking strategies for your analysis type

### Next Steps

Continue your CORDEX analysis with:

1. **Bias Correction**: Compare RCM outputs with observations and apply bias correction methods
2. **Climate Indices**: Calculate climate extremes indices (Rx1day, CDD, WSDI, etc.)
3. **Ensemble Analysis**: Compare multiple RCM-GCM combinations and calculate ensemble statistics
4. **Trend Analysis**: Detect and quantify long-term climate trends
5. **Impact Studies**: Apply climate data to sector-specific impact models (hydrology, agriculture, etc.)

### Additional Resources

- CORDEX website: https://cordex.org/
- xarray documentation: https://xarray.pydata.org/
- Dask documentation: https://dask.org/
- CF Conventions: http://cfconventions.org/
- Cartopy documentation: https://scitools.org.uk/cartopy/

In [None]:
# Clean up: Close all datasets
ds.close()
ds_multi.close()

print("All datasets closed successfully.")
print("\nNotebook complete!")