## 2026 EY AI & Data Challenge - TerraClimate Data Extraction Notebook

This notebooks demonstrates how to access the TerraClimate dataset. TerraClimate is a dataset of monthly climate and climatic water balance for global terrestrial surfaces from 1958 to the present. These data provide important inputs for ecological and hydrological studies at global scales that require high spatial resolution and time-varying data. All data have monthly temporal resolution and a ~4-km (1/24th degree) spatial resolution. This dataset is provided in Zarr format. 

For more information, visit: https://planetarycomputer.microsoft.com/dataset/terraclimate#overview 

In [1]:
import warnings
warnings.filterwarnings("ignore")

# Data manipulation and analysis
import numpy as np
import pandas as pd

# Multi-dimensional arrays and datasets (e.g., NetCDF, Zarr)
import xarray as xr

from scipy.spatial import cKDTree

# Planetary Computer tools for STAC API access and authentication
import pystac_client
import planetary_computer as pc

from datetime import date
from tqdm import tqdm
import os

<h2>Extracting TerraClimate Data Using API Calls</h2> <p align="justify"> The API-based method allows us to efficiently access <b>TerraClimate</b> data for specific regions and time periods through the <a href="https://planetarycomputer.microsoft.com/">Microsoft Planetary Computer</a>, ensuring scalability and reproducibility of the process. </p> <p align="justify"> Through the API, we can extract climate variables such as <b>Potential Evapotranspiration (PET)</b>, which represents the atmospheric demand for water. This variable provides critical insights into surface moisture balance and helps improve the accuracy of water quality modeling. </p> <p align="justify"> This approach ensures consistent, automated retrieval of high-resolution climate data that can be easily integrated with satellite-derived features for comprehensive environmental and hydrological analysis. </p>



<h3>Loading and Mapping TerraClimate Data:</h3>

<p>This section demonstrates how <b>TerraClimate climate variables</b>, such as <b>Potential Evapotranspiration (PET)</b>, are loaded and mapped to sampling locations:</p>

<ul>
  <li>The <b>load_terraclimate_dataset</b> function opens the TerraClimate Zarr/NetCDF dataset from the Microsoft Planetary Computer, handling storage options automatically.</li>
  <li>The <b>filterg</b> function filters the dataset for the desired time range (2011–2015) and spatial extent corresponding to the study region. The resulting data is converted to a pandas DataFrame with standardized column names.</li>
  <li>The <b>assign_nearest_climate</b> function maps each sampling location to its <b>nearest TerraClimate grid point</b> using a KD-tree and assigns the climate variable values corresponding to the closest time stamp.</li>
</ul>

<p>This workflow ensures efficient, reproducible retrieval of climate variables, while allowing participants to work with pre-extracted CSV files for faster benchmarking and analysis.</p>


In [2]:
def load_terraclimate_dataset():
    catalog = pystac_client.Client.open(
        "https://planetarycomputer.microsoft.com/api/stac/v1",
        modifier=pc.sign_inplace,
    )
    collection = catalog.get_collection("terraclimate")
    asset = collection.assets["zarr-abfs"]

    if "xarray:storage_options" in asset.extra_fields:
        ds = xr.open_zarr(
            asset.href,
            storage_options=asset.extra_fields["xarray:storage_options"],
            consolidated=True,
        )
    else:
        ds = xr.open_dataset(
            asset.href,
            **asset.extra_fields["xarray:open_kwargs"],
        )

    return ds

In [3]:
# --- Filtering function (kept identical) ---
def filterg(ds, var):
    ds_2011_2015 = ds[var].sel(time=slice("2011-01-01", "2015-12-31"))

    df_var_append = []
    for i in tqdm(range(len(ds_2011_2015.time))):
        df_var = ds_2011_2015.isel(time=i).to_dataframe().reset_index()
        df_var_filter = df_var[
            (df_var['lat'] > -35.18) & (df_var['lat'] < -21.72) &
            (df_var['lon'] > 14.97) & (df_var['lon'] < 32.79)
        ]
        df_var_append.append(df_var_filter)

    df_var_final = pd.concat(df_var_append, ignore_index=True)
    print(f"Filtering for {var} completed")

    df_var_final['time'] = df_var_final['time'].astype(str)

    # Column mapping
    col_mapping = {"lat": "Latitude", "lon": "Longitude", "time": "Sample Date"}
    df_var_final = df_var_final.rename(columns=col_mapping)

    return df_var_final


In [4]:
# --- Climate variable assignment function (unchanged logic) ---
def assign_nearest_climate(sa_df, climate_df, var_name):
    """
    Map nearest climate variable values to a new DataFrame 
    containing only the specified variable column.
    """
    sa_coords = np.radians(sa_df[['Latitude', 'Longitude']].values)
    climate_coords = np.radians(climate_df[['Latitude', 'Longitude']].values)

    tree = cKDTree(climate_coords)
    dist, idx = tree.query(sa_coords, k=1)

    nearest_points = climate_df.iloc[idx].reset_index(drop=True)

    sa_df = sa_df.reset_index(drop=True)
    sa_df[['nearest_lat', 'nearest_lon']] = nearest_points[['Latitude', 'Longitude']]

    sa_df['Sample Date'] = pd.to_datetime(sa_df['Sample Date'], dayfirst=True, errors='coerce')
    climate_df['Sample Date'] = pd.to_datetime(climate_df['Sample Date'], dayfirst=True, errors='coerce')

    climate_values = []

    for i in tqdm(range(len(sa_df)), desc=f"Mapping {var_name.upper()} values"):
        sample_date = sa_df.loc[i, 'Sample Date']
        nearest_lat = sa_df.loc[i, 'nearest_lat']
        nearest_lon = sa_df.loc[i, 'nearest_lon']

        subset = climate_df[
            (climate_df['Latitude'] == nearest_lat) &
            (climate_df['Longitude'] == nearest_lon)
        ]

        if subset.empty:
            climate_values.append(np.nan)
            continue

        nearest_idx = (subset['Sample Date'] - sample_date).abs().idxmin()
        climate_values.append(subset.loc[nearest_idx, var_name])

    output_df = pd.DataFrame({var_name: climate_values})

    
    return output_df

### Extracting features for the training dataset

In [5]:
Water_Quality_df=pd.read_csv('water_quality_training_dataset.csv')
Water_Quality_df.head()

Unnamed: 0,Latitude,Longitude,Sample Date,Total Alkalinity,Electrical Conductance,Dissolved Reactive Phosphorus
0,-28.760833,17.730278,02-01-2011,128.912,555.0,10.0
1,-26.861111,28.884722,03-01-2011,74.72,162.9,163.0
2,-26.45,28.085833,03-01-2011,89.254,573.0,80.0
3,-27.671111,27.236944,03-01-2011,82.0,203.6,101.0
4,-27.356667,27.286389,03-01-2011,56.1,145.1,151.0


In [6]:
Water_Quality_df.shape

(9319, 6)

In [7]:
# Load TerraClimate dataset, filter (time,region,parameter), filter for nearest parameter values
ds = load_terraclimate_dataset()
tc_parameter = filterg(ds,'pet')
Terraclimate_training_df = assign_nearest_climate(Water_Quality_df, tc_parameter, 'pet')

100%|██████████| 60/60 [14:02<00:00, 14.05s/it]


Filtering for pet completed


Mapping PET values: 100%|██████████| 9319/9319 [01:52<00:00, 82.91it/s]


In [8]:
Terraclimate_training_df['Latitude'] = Water_Quality_df['Latitude']
Terraclimate_training_df['Longitude'] = Water_Quality_df['Longitude']
Terraclimate_training_df['Sample Date'] = Water_Quality_df['Sample Date']
Terraclimate_training_df = Terraclimate_training_df[['Latitude', 'Longitude', 'Sample Date', 'pet']]
Terraclimate_training_df.to_csv('terraclimate_features_training.csv', index=False)

In [9]:
# Preview File
Terraclimate_training_df.head()

Unnamed: 0,Latitude,Longitude,Sample Date,pet
0,-28.760833,17.730278,02-01-2011,174.199997
1,-26.861111,28.884722,03-01-2011,124.099998
2,-26.45,28.085833,03-01-2011,127.5
3,-27.671111,27.236944,03-01-2011,129.699997
4,-27.356667,27.286389,03-01-2011,129.199997


### Extracting features for the validation dataset

In [10]:
Validation_df=pd.read_csv('submission_template.csv')
Validation_df.head()

Unnamed: 0,Latitude,Longitude,Sample Date,Total Alkalinity,Electrical Conductance,Dissolved Reactive Phosphorus
0,-32.043333,27.822778,01-09-2014,,,
1,-33.329167,26.0775,16-09-2015,,,
2,-32.991639,27.640028,07-05-2015,,,
3,-34.096389,24.439167,07-02-2012,,,
4,-32.000556,28.581667,01-10-2014,,,


In [11]:
Validation_df.shape

(200, 6)

In [12]:
# Load TerraClimate dataset, filter (time,region,parameter), filter for nearest parameter values
Terraclimate_validation_df = assign_nearest_climate(Validation_df, tc_parameter, 'pet')

Mapping PET values: 100%|██████████| 200/200 [00:02<00:00, 82.26it/s]


In [13]:
Terraclimate_validation_df['Latitude'] = Validation_df['Latitude']
Terraclimate_validation_df['Longitude'] = Validation_df['Longitude']
Terraclimate_validation_df['Sample Date'] = Validation_df['Sample Date']
Terraclimate_validation_df = Terraclimate_validation_df[['Latitude', 'Longitude', 'Sample Date', 'pet']]
Terraclimate_validation_df.to_csv('terraclimate_features_validation.csv', index=False)

In [14]:
# Preview File
Terraclimate_validation_df.head()

Unnamed: 0,Latitude,Longitude,Sample Date,pet
0,-32.043333,27.822778,01-09-2014,161.900009
1,-33.329167,26.0775,16-09-2015,177.600006
2,-32.991639,27.640028,07-05-2015,158.400009
3,-34.096389,24.439167,07-02-2012,130.0
4,-32.000556,28.581667,01-10-2014,152.5
