# **Forecast Pipeline**

## **1. Setup**

This initial section prepares the Python environment by importing all the necessary libraries for the forecasting pipeline and defining a custom function for a key performance metric.

*   **Core Libraries**: It starts by importing `numpy` for numerical operations and `pandas` for data manipulation and analysis, which are fundamental for handling the weather dataset.
*   **Standard Utilities**: Libraries like `warnings` (to suppress unnecessary warnings and keep the output clean), `os` (for interacting with the operating system), and `re` (for regular expressions) are included for general-purpose tasks.
*   **Visualization**: `matplotlib` and `seaborn` are imported for creating static, animated, and interactive visualizations, which are essential for data exploration and result analysis.
*   **Statistical Analysis**: `statsmodels` is used for statistical tests and plotting functions like the Autocorrelation Function (ACF) plot, which is vital for time series analysis. `scipy` is included for scientific and technical computing.
*   **Machine Learning Models**: A wide range of regression models are imported from popular libraries:
    *   `scikit-learn`: `SGDRegressor`, `RandomForestRegressor`, and `StackingRegressor`.
    *   `xgboost`: `XGBRegressor` and `XGBClassifier`, known for their high performance.
    *   `lightgbm`: `LGBMRegressor`, another powerful gradient-boosting framework.
*   **Model Evaluation & Preprocessing**:
    *   From `scikit-learn`, various metrics like `mean_squared_error` and `r2_score` are imported to evaluate model performance. `MinMaxScaler` is used for feature scaling.
    *   `TimeSeriesSplit` is specifically included for cross-validation on time series data, which respects the temporal order of observations.
*   **Hyperparameter Tuning**: `optuna` is imported for efficient hyperparameter optimization, helping to find the best settings for the models.
*   **Model Explainability**: `shap` is included to help explain the output of the machine learning models, providing insights into which features are most important for the predictions.
*   **Custom Metric**: A function `rmse` (Root Mean Squared Error) is defined. This is a standard metric for regression tasks that measures the average magnitude of the errors, giving more weight to larger errors.

In [1]:
# 1. Setup & Imports
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import re
import os

import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

import shap
from statsmodels.graphics.tsaplots import plot_acf

from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import (
    mean_absolute_percentage_error,
    make_scorer,
    mean_squared_error,
    root_mean_squared_error,
    mean_absolute_error,
    r2_score
)
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
from sklearn.feature_selection import mutual_info_regression

from sklearn.linear_model import SGDRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor, XGBClassifier
from lightgbm import LGBMRegressor

import optuna
from optuna.samplers import TPESampler
from optuna.pruners import MedianPruner
from sklearn.ensemble import StackingRegressor


In [2]:
def rmse(y, yhat):
    return np.sqrt(mean_squared_error(y, yhat))

## **2. Load and Inspect Local Weather Data**


This section loads the historical hourly weather data from a local Excel file and performs an initial inspection to understand its structure and identify any missing values. This is the first step in the data preprocessing pipeline.

### **2.1. Data Loading and Initial Preview**


*   **Data Loading**: The code uses the `pd.read_excel()` function from the pandas library to read the weather data from a file named `"HCMWeatherHourly.xlsx"`. This dataset is loaded into a pandas DataFrame called `df`, which is a powerful and flexible data structure for handling tabular data.

*   **Initial Data Preview**: Immediately after loading, the notebook displays the first 24 rows of the DataFrame using `df.head(24)`. Since the data is recorded hourly, this provides a convenient snapshot of the full 24-hour cycle for the first day in the dataset. This allows for a quick visual check of the column names (`name`, `datetime`, `temp`, etc.), the format of the data in each column, and confirms that the file was read correctly.

In [3]:
df = pd.read_excel("HCMWeatherHourly.xlsx")
df.head(24)


Unnamed: 0,name,datetime,temp,feelslike,dew,humidity,precip,precipprob,preciptype,snow,...,sealevelpressure,cloudcover,visibility,solarradiation,solarenergy,uvindex,severerisk,conditions,icon,stations
0,"10.82, 106.67",2015-01-01T00:00:00,24.9,24.9,19.0,69.35,0.0,0,,0.0,...,1012.0,50.0,7.0,0.0,0.0,0.0,,Partially cloudy,partly-cloudy-night,"48900099999,VVTS"
1,"10.82, 106.67",2015-01-01T01:00:00,24.9,24.9,20.0,73.79,0.0,0,,0.0,...,1012.0,50.0,7.0,0.0,0.0,0.0,,Partially cloudy,partly-cloudy-night,"48894099999,48900099999,VVTS"
2,"10.82, 106.67",2015-01-01T02:00:00,24.0,24.0,20.0,78.35,0.0,0,,0.0,...,1012.0,50.0,7.0,0.0,0.0,0.0,,Partially cloudy,partly-cloudy-night,"48900099999,VVTS"
3,"10.82, 106.67",2015-01-01T03:00:00,24.0,24.0,20.0,78.35,0.0,0,,0.0,...,1012.0,50.0,6.0,0.0,0.0,0.0,,Partially cloudy,partly-cloudy-night,"48900099999,VVTS"
4,"10.82, 106.67",2015-01-01T04:00:00,24.0,24.0,20.9,83.33,0.0,0,,0.0,...,1012.0,50.0,6.0,0.0,0.0,0.0,,Partially cloudy,partly-cloudy-night,"48894099999,48900099999,VVTS"
5,"10.82, 106.67",2015-01-01T05:00:00,24.0,24.0,19.9,78.35,0.0,0,,0.0,...,1012.0,50.0,6.0,0.0,0.0,0.0,,Partially cloudy,partly-cloudy-night,"48900099999,VVTS"
6,"10.82, 106.67",2015-01-01T06:00:00,23.0,23.0,17.9,73.45,0.0,0,,0.0,...,1013.0,50.0,6.0,0.0,0.0,0.0,,Partially cloudy,partly-cloudy-night,"48900099999,VVTS"
7,"10.82, 106.67",2015-01-01T07:00:00,23.0,23.0,17.0,68.96,0.0,0,,0.0,...,1013.0,88.0,8.0,38.0,0.1,0.0,,Partially cloudy,partly-cloudy-day,"48894099999,48900099999,VVTS"
8,"10.82, 106.67",2015-01-01T08:00:00,24.0,24.0,17.0,64.92,0.0,0,,0.0,...,1015.0,88.0,8.0,226.0,0.8,2.0,,Partially cloudy,partly-cloudy-day,"48900099999,VVTS"
9,"10.82, 106.67",2015-01-01T09:00:00,26.0,26.0,18.0,61.38,0.0,0,,0.0,...,1014.0,25.0,8.0,436.7,1.6,4.0,,Partially cloudy,partly-cloudy-day,"48900099999,VVTS"


### **2.2 Missing Value Assessment**

*   **Missing Value Assessment**: The `df.isnull().sum()` command is then executed to perform a crucial data quality check.
    *   It systematically scans each column in the DataFrame and returns a count of the total number of missing or null (`NaN`) values within that column.
    *   This step is fundamental in any data science project, as it reveals the completeness of the dataset. Machine learning models typically cannot handle missing values, so identifying them is the first step toward deciding on a strategy for cleaning the data.
    *   The output shows that several columns like **`preciptype`** and **`severerisk`** have a very high number of missing values (81767 and 61603, respectively), suggesting they may not be reliable for modeling. Other columns like `precip` and `visibility` have a much smaller number of missing entries, which will need to be addressed in the next cleaning stage. Many other columns have no missing data at all.

In [4]:
df.isnull().sum()

name                    0
datetime                0
temp                    0
feelslike               0
dew                     0
humidity                0
precip                 39
precipprob              0
preciptype          81767
snow                   42
snowdepth              42
windgust               49
windspeed               0
winddir                10
sealevelpressure        0
cloudcover              0
visibility            198
solarradiation         36
solarenergy            36
uvindex                36
severerisk          61603
conditions              0
icon                    0
stations                0
dtype: int64

## **3. Cleaning and Encoding**

This section focuses on essential data preprocessing tasks. It involves converting data types, removing irrelevant or problematic columns, creating new, more useful features from existing ones (feature engineering), and standardizing units of measurement to ensure the data is clean, consistent, and ready for modeling.

### **3.1. Data Type Conversion, Column Removal and Inspecting**


*   **Data Type Conversion and Column Removal**:
    *   The `datetime` column is converted from a generic object (string) into a proper `datetime` format using `pd.to_datetime()`. This is a critical step for any time series analysis.
    *   Several columns are dropped using `df.drop()` based on factors like high missing values (`severerisk`), irrelevance (`snow`), or redundancy (`name`, `stations`).

*   **Inspecting the Cleaned Data**:
    *   The `df.head(24)` command is called again to display the first 24 rows of the modified DataFrame. This allows for a quick visual confirmation that the irrelevant columns have been successfully removed.

In [5]:
df['datetime'] = pd.to_datetime(df['datetime'])

df.drop(columns = ['name','snow','snowdepth','conditions','icon','stations','severerisk'], inplace = True)

In [6]:
df.head(24)

Unnamed: 0,datetime,temp,feelslike,dew,humidity,precip,precipprob,preciptype,windgust,windspeed,winddir,sealevelpressure,cloudcover,visibility,solarradiation,solarenergy,uvindex
0,2015-01-01 00:00:00,24.9,24.9,19.0,69.35,0.0,0,,9.4,3.6,240.0,1012.0,50.0,7.0,0.0,0.0,0.0
1,2015-01-01 01:00:00,24.9,24.9,20.0,73.79,0.0,0,,9.0,3.6,290.0,1012.0,50.0,7.0,0.0,0.0,0.0
2,2015-01-01 02:00:00,24.0,24.0,20.0,78.35,0.0,0,,6.5,5.4,320.0,1012.0,50.0,7.0,0.0,0.0,0.0
3,2015-01-01 03:00:00,24.0,24.0,20.0,78.35,0.0,0,,6.8,5.4,330.0,1012.0,50.0,6.0,0.0,0.0,0.0
4,2015-01-01 04:00:00,24.0,24.0,20.9,83.33,0.0,0,,9.0,5.4,340.0,1012.0,50.0,6.0,0.0,0.0,0.0
5,2015-01-01 05:00:00,24.0,24.0,19.9,78.35,0.0,0,,10.1,9.4,330.0,1012.0,50.0,6.0,0.0,0.0,0.0
6,2015-01-01 06:00:00,23.0,23.0,17.9,73.45,0.0,0,,10.8,7.6,10.0,1013.0,50.0,6.0,0.0,0.0,0.0
7,2015-01-01 07:00:00,23.0,23.0,17.0,68.96,0.0,0,,13.7,9.4,360.0,1013.0,88.0,8.0,38.0,0.1,0.0
8,2015-01-01 08:00:00,24.0,24.0,17.0,64.92,0.0,0,,19.1,13.0,30.0,1015.0,88.0,8.0,226.0,0.8,2.0
9,2015-01-01 09:00:00,26.0,26.0,18.0,61.38,0.0,0,,20.2,7.6,30.0,1014.0,25.0,8.0,436.7,1.6,4.0


### **3.2. One-Hot Encoding, Unit Conversion, and Final Preview**

*   **One-Hot Encoding for Precipitation Type**:
    *   The `preciptype` column originally contained text describing the type of precipitation but had many missing values.
    *   To handle this, a new binary feature called `has_rain` is created. This process checks if the word "rain" is present in the `preciptype` entry for each row. The result is a simple `1` (for rain) or `0` (for no rain/missing data) feature.
    *   The original `preciptype` column is then dropped to avoid redundancy.

*   **Unit Conversion and Standardization**:
    *   The script standardizes the units of measurement for consistency, which is crucial for the model to interpret the features correctly.
        *   **Precipitation**: Converted from inches to millimeters (mm).
        *   **Percentage to Fraction**: Features like `humidity`, `cloudcover`, and `precipprob` are converted from percentages (0-100) to fractional values (0-1).
        *   **Wind Speed**: Converted from miles per hour (mph) to meters per second (m/s).
        *   **Visibility**: Converted from miles (mi) to kilometers (km).
        *   **Solar Energy**: Converted from MJ/m¬≤/day to an equivalent in Watts per square meter (W/m¬≤), creating a new `solarenergy_wm2eq` column.

*   **Final Data Preview**:
    *   The `df` is displayed one last time to show the result of all the cleaning and standardization steps. We can see the new standardized numerical values and the newly created `solarenergy_wm2eq` column.

In [7]:
# --- One-hot encode preciptype to has_rain (rain=1, NaN/others=0)
df['has_rain'] = (
    df['preciptype']
    .fillna('')          # handle NaN safely
    .str.lower()         # normalize capitalization
    .str.contains('rain')
    .astype(int)         # True ‚Üí 1, False ‚Üí 0
)

# Optional: drop the original column
df.drop(columns=['preciptype'], inplace=True)

In [8]:
# Quy ƒë·ªïi c√°c ƒë∆°n v·ªã ƒëo l∆∞·ªùng
# Precipitation (in ‚Üí mm)
df['precip'] = df['precip'] * 25.4

# Percentage-based to fraction
pct_cols = ['humidity','cloudcover','precipprob']
df[pct_cols] = df[pct_cols] / 100.0

# Wind (mph ‚Üí m/s)
wind_cols = ['windspeed','windgust']
df[wind_cols] = df[wind_cols] / 2.237

# Visibility (mi ‚Üí km)
df['visibility'] = df['visibility'] * 1.609

# Convert MJ/m¬≤/day ‚Üí W/m¬≤-equivalent
df['solarenergy_wm2eq'] = df['solarenergy'] * 11.6


In [10]:
df.columns

Index(['datetime', 'temp', 'feelslike', 'dew', 'humidity', 'precip',
       'precipprob', 'windgust', 'windspeed', 'winddir', 'sealevelpressure',
       'cloudcover', 'visibility', 'solarradiation', 'solarenergy', 'uvindex',
       'has_rain', 'solarenergy_wm2eq'],
      dtype='object')

## **4. Feature Engineering**

This section focuses on creating new, physically meaningful features from the existing data. The goal is to explicitly represent meteorological concepts and cyclical patterns, making it easier for the machine learning model to understand the underlying physics that drive temperature changes.

*   **Initial Setup**:
    *   The first step is to sort the DataFrame by `datetime` and reset the index. This ensures that the data is in strict chronological order, which is essential for any time-series operations that follow.

*   **Feature Creation**:
    *   **1. Radiation Efficiency**: This feature models how effectively solar radiation penetrates the atmosphere. It combines `solarradiation` and `solarenergy` and adjusts for `cloudcover`, providing a single metric for the "strength" of the sun's heating effect.
    *   **2. Moisture Ratio**: The `dew_humidity_ratio` is engineered by dividing the dew point by humidity. This provides a measure of air moisture saturation.
    *   **3. Wind Vector Components**: The wind direction (`winddir`) and speed (`windspeed`) are decomposed into `wind_u` (east-west) and `wind_v` (north-south) components. This transforms a cyclical feature (direction) into a format that machine learning models can easily use. The components are also "de-biased" by subtracting their means to stabilize the model's learning process.
    *   **4. Convective Potential (Storminess)**: Two features, `convective_potential` and `storminess`, are created to estimate the potential for convective storms by combining humidity, radiation, and wind data.
    *   **5. Seasonality (Annual and Diurnal)**: This part encodes the cyclical nature of time.
        *   **Annual Cycle**: The `dayofyear` is extracted and transformed into `doy_sin` and `doy_cos` components to represent the yearly cycle.
        *   **Diurnal Cycle**: The `hour` of the day is similarly converted into `hour_sin` and `hour_cos` to represent the 24-hour daily cycle.
        *   **Daylight Mask**: A binary feature, `is_daylight`, is created to explicitly tell the model whether it is daytime or nighttime.

*   **Verifying the New Features**:
    *   The `df.head(5)` command is called to display the first few rows of the DataFrame. This allows for a quick inspection of the newly created columns (like `radiation_efficiency`, `wind_u`, `hour_sin`, etc.) to ensure they have been calculated and added to the dataset correctly. The output shows the DataFrame has expanded to 31 columns.

In [11]:
# ============================================================================
# üå°Ô∏è PHYSICS-BASED FEATURES ‚Äî HOURLY VERSION (Leakage-Free, Stable)
# ============================================================================

# ---------------------------------------------------------------------------
# 0Ô∏è‚É£ Basic safety setup
# ---------------------------------------------------------------------------
df = df.sort_values("datetime").reset_index(drop=True)

# ---------------------------------------------------------------------------
# 1Ô∏è‚É£ Radiation Efficiency (hourly solar ‚Üí cloud attenuation)
# ---------------------------------------------------------------------------
# Better physical formulation:
# Clear-sky radiation proxy ~ (solarenergy OR solarradiation)
# Cloudcover attenuates radiation linearly (approx.)
df['radiation_efficiency'] = (
    (df['solarradiation'] + 0.1 * df['solarenergy'])
    / (1 + df['cloudcover'] / 100.0)
).fillna(0)

# ---------------------------------------------------------------------------
# 2Ô∏è‚É£ Moisture ratio ‚Äî dew point relative to humidity
# ---------------------------------------------------------------------------
# Avoid division spikes when humidity ~ 0
df['dew_humidity_ratio'] = df['dew'] / (df['humidity'].replace(0, np.nan))
df['dew_humidity_ratio'] = df['dew_humidity_ratio'].fillna(0)

# ---------------------------------------------------------------------------
# 3Ô∏è‚É£ Wind Vector Components (u, v) ‚Äî hourly wind representation
# ---------------------------------------------------------------------------
df['wind_u'] = df['windspeed'] * np.cos(np.deg2rad(df['winddir']))
df['wind_v'] = df['windspeed'] * np.sin(np.deg2rad(df['winddir']))

# De-bias (stabilizes SHAP + avoids mean offset leakage)
df['wind_u'] -= df['wind_u'].mean()
df['wind_v'] -= df['wind_v'].mean()

# ---------------------------------------------------------------------------
# 4Ô∏è‚É£ Hourly Convective Potential (storminess)
# ---------------------------------------------------------------------------
# More meteorologically meaningful:
#   convective storms ‚Üí high humidity + high radiation + gusty winds
df['convective_potential'] = (
    df['humidity'].clip(lower=0) *
    df['solarradiation'].clip(lower=0) *
    (df['windgust'] + 1)
)

# Earlier version relied only on wind¬≤ * precipprob, which is weaker.
df['storminess'] = (df['windspeed'] ** 2) * (df['precipprob'] / 100)

# ---------------------------------------------------------------------------
# 5Ô∏è‚É£ Seasonality (annual + diurnal)
# ---------------------------------------------------------------------------

# A. Annual seasonal cycle (day-of-year)
df['dayofyear'] = df['datetime'].dt.dayofyear
df['doy_sin'] = np.sin(2 * np.pi * df['dayofyear'] / 365.25)
df['doy_cos'] = np.cos(2 * np.pi * df['dayofyear'] / 365.25)

# B. Hour-of-day (diurnal heating cycle)
df['hour'] = df['datetime'].dt.hour
df['hour_sin']  = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos']  = np.cos(2 * np.pi * df['hour'] / 24)

# C. Whether solar radiation should be >0 (sunlight mask)
df['is_daylight'] = (df['solarradiation'] > 0).astype(int)


In [12]:
df.head(5)

Unnamed: 0,datetime,temp,feelslike,dew,humidity,precip,precipprob,windgust,windspeed,winddir,...,wind_v,convective_potential,storminess,dayofyear,doy_sin,doy_cos,hour,hour_sin,hour_cos,is_daylight
0,2015-01-01 00:00:00,24.9,24.9,19.0,0.6935,0.0,0.0,4.202056,1.609298,240.0,...,-0.864997,0.0,0.0,1,0.017202,0.999852,0,0.0,1.0,0
1,2015-01-01 01:00:00,24.9,24.9,20.0,0.7379,0.0,0.0,4.023245,1.609298,290.0,...,-0.98355,0.0,0.0,1,0.017202,0.999852,1,0.258819,0.965926,0
2,2015-01-01 02:00:00,24.0,24.0,20.0,0.7835,0.0,0.0,2.905677,2.413947,320.0,...,-1.022959,0.0,0.0,1,0.017202,0.999852,2,0.5,0.866025,0
3,2015-01-01 03:00:00,24.0,24.0,20.0,0.7835,0.0,0.0,3.039785,2.413947,330.0,...,-0.678278,0.0,0.0,1,0.017202,0.999852,3,0.707107,0.707107,0
4,2015-01-01 04:00:00,24.0,24.0,20.9,0.8333,0.0,0.0,4.023245,2.413947,340.0,...,-0.296923,0.0,0.0,1,0.017202,0.999852,4,0.866025,0.5,0


## **5. Advanced Feature Engineering with Rolling Windows**

This section creates a comprehensive set of advanced features by building upon the previous physics-based concepts and introducing time-series specific techniques like rolling window statistics, gradients, and climate indicators. This is the most critical feature engineering step, designed to give the model a deep understanding of both recent trends and long-term climate patterns.

### **5.1. Hourly Feature Engineering**


*   **1. Physics-Based Base Features**:
    *   This part re-implements the core physics-based features from the previous section. This ensures all necessary base features are present before creating more complex ones that depend on them. It includes the calculation of `radiation_efficiency`, `dew_humidity_ratio`, wind vectors (`wind_u`, `wind_v`), `convective_potential`, `storminess`, and seasonality (`doy_sin`, `hour_sin`, etc.).

*   **2. Rolling / Lag Features (Windows: 24h, 168h, 336h)**:
    *   This is a classic time series technique where statistics are calculated over a moving "window" of past data. It allows the model to see recent history and trends.
    *   **Rolling Wind Statistics**: The code calculates the `mean` and `variance` of the `wind_u` and `wind_v` components over 168-hour (7-day) and 336-hour (14-day) windows. This provides the model with insights into the prevailing wind patterns and their stability over the past one to two weeks.

*   **3. Pressure & Humidity Gradients**:
    *   Gradients measure the rate of change of a variable, which is a powerful predictor in time series forecasting.
    *   **Hourly Changes**: The code uses `.diff(1)` to calculate the change in `sealevelpressure`, `humidity`, and `dew` point from the previous hour. A rapid drop in pressure, for example, can be a strong indicator of an impending storm.
    *   **Rolling Statistics on Gradients**: It then calculates the `mean` and `variance` of pressure and humidity over 24-hour and 168-hour windows to capture the average trend and volatility of these key meteorological variables.

*   **4. Climate & Regime Indicators**:
    *   These features are designed to capture the broader climate context beyond just the immediate weather.
    *   **Seasonal Indicators**: `is_wet_season` is a binary flag based on the month, and `season_progress` indicates how far into the wet/dry season the current date is.
    *   **Soil Moisture**: `soil_wetness_index` is a sophisticated feature that models moisture retention in the ground. It calculates the total rainfall over the past 14 days (336 hours) and uses a negative exponential function to estimate soil saturation. Wetter soil can significantly impact local temperature and humidity.
    *   **Monsoon Logic**: Based on the direction of the north-south wind (`wind_v`), the code creates flags to identify the Southwest (`sw_monsoon_flag`) and Northeast (`ne_monsoon_flag`) monsoon periods, which are dominant climate drivers in the region.

*   **5. Rain Type Classification**:
    *   This part aims to classify precipitation type using the available data, creating more nuanced rain features.
    *   **Rain Ratios**: Features like `precip_cloud_ratio` and `precip_efficiency` are engineered to characterize the nature of the rainfall (e.g., light rain from heavy clouds vs. heavy rain with low humidity).
    *   **Convective vs. Stratiform Rain**: A rule-based system is implemented to classify rain as either `is_convective_rain` (intense, stormy rain) or `is_stratiform_rain` (steady, less intense rain). This is based on whether precipitation, wind gust, and humidity are above their median values.

*   **6. Advanced Physical Coherence Features**:
    *   These features capture the dynamic interplay between different weather variables over time.
    *   `wind_dir_consistency`: Measures how stable the wind direction has been over 24 and 168 hours.
    *   **Covariances**: It calculates the 24-hour rolling `covariance` for several physically related pairs of variables (e.g., `humidity` and `solarradiation`). This tells the model how these variables have been trending together in the recent past.
    *   **Mean Wind Direction**: A robust 24-hour mean wind direction is calculated by averaging the U and V components, which correctly handles the circular nature of wind direction data.

*   **7. Cleanup**:
    *   **Handling Missing Values**: All the rolling window calculations produce `NaN` (Not a Number) values at the beginning of the dataset where a full window of past data is not available. The `df.fillna(0)` command fills these missing values with zero.
    *   **Preventing Data Leakage**: The `.iloc[max(windows):]` command is a crucial final step. It removes all rows from the beginning of the DataFrame that do not have a complete 336-hour (14-day) look-back window. This ensures that every row used for training has a full set of valid, calculated features, preventing the model from learning on incomplete data.
    *   The `print` statements at the end confirm that the feature engineering is complete and report the new total number of features, which has grown to 76.

In [13]:
# ============================================================================
# üå°Ô∏è HOURLY FEATURE ENGINEERING (NO TEMP LEAKAGE, WINDOWS 1‚Äì14 DAYS)
# ============================================================================

df = df.sort_values("datetime").reset_index(drop=True)

# ===============================================================
# 0Ô∏è‚É£ PHYSICS-BASED BASE FEATURES
# ===============================================================

# --- Radiation efficiency (solar vs cloud attenuation) ---
df['radiation_efficiency'] = (
    (df['solarradiation'] + 0.05 * df['solarenergy'])
    / (df['cloudcover'] + 1e-3)
).fillna(0)

# --- Moisture ratio ---
df['dew_humidity_ratio'] = (df['dew'] / df['humidity'].replace(0, np.nan)).fillna(0)

# --- Wind vectors ---
df['wind_u'] = df['windspeed'] * np.cos(np.deg2rad(df['winddir']))
df['wind_v'] = df['windspeed'] * np.sin(np.deg2rad(df['winddir']))

# Center for SHAP clarity
df['wind_u'] -= df['wind_u'].mean()
df['wind_v'] -= df['wind_v'].mean()

# --- Convective potential ---
df['convective_potential'] = (
    df['humidity'].clip(lower=0) *
    df['solarradiation'].clip(lower=0) *
    (df['windgust'] + 1)
)

df['storminess'] = (df['windspeed']**2) * (df['precipprob'] / 100)

# --- Seasonality ---
df['dayofyear'] = df['datetime'].dt.dayofyear
df['doy_sin'] = np.sin(2*np.pi*df['dayofyear']/365.25)
df['doy_cos'] = np.cos(2*np.pi*df['dayofyear']/365.25)

df['hour'] = df['datetime'].dt.hour
df['hour_sin'] = np.sin(2*np.pi*df['hour']/24)
df['hour_cos'] = np.cos(2*np.pi*df['hour']/24)
df['is_daylight'] = (df['solarradiation'] > 0).astype(int)

# ===============================================================
# 1Ô∏è‚É£ ROLLING / LAGS ‚Äî WINDOWS: 24h, 168h, 336h
# ===============================================================

windows = [24, 168, 336]    # 1 day, 7 days, 14 days

# Rolling wind statistics
for w in [168, 336]:
    df[f'wind_u_mean_w{w}h'] = df['wind_u'].rolling(w, 1).mean()
    df[f'wind_v_mean_w{w}h'] = df['wind_v'].rolling(w, 1).mean()
    df[f'wind_u_var_w{w}h']  = df['wind_u'].rolling(w, 1).var()
    df[f'wind_v_var_w{w}h']  = df['wind_v'].rolling(w, 1).var()

# ===============================================================
# 2Ô∏è‚É£ PRESSURE & HUMIDITY GRADIENTS ‚Äî HOURLY
# ===============================================================

df['pressure_change_1h'] = df['sealevelpressure'].diff(1)
df['humidity_change_1h'] = df['humidity'].diff(1)
df['dew_change_1h']      = df['dew'].diff(1)

for w in [24, 168]:
    df[f'pressure_mean_w{w}h'] = df['sealevelpressure'].rolling(w,1).mean()
    df[f'pressure_var_w{w}h']  = df['sealevelpressure'].rolling(w,1).var()
    df[f'humidity_mean_w{w}h'] = df['humidity'].rolling(w,1).mean()
    df[f'humidity_var_w{w}h']  = df['humidity'].rolling(w,1).var()

# ===============================================================
# 3Ô∏è‚É£ CLIMATE & REGIME INDICATORS
# ===============================================================

m = df['datetime'].dt.month
df['is_wet_season'] = m.isin([5,6,7,8,9,10]).astype(int)

df['season_progress'] = np.where(
    m < 5, 0,
    np.where(m > 10, 1, (m - 5) / 5)
)

# Soil moisture over last 14 days (336 hours)
wet336 = df['precip'].rolling(336,1).sum()
df['soil_wetness_index'] = 1 - np.exp(-0.05 * wet336)

# Monsoon logic
df['sw_monsoon_flag'] = (df['wind_v'] > 0).astype(int)
df['ne_monsoon_flag'] = (df['wind_v'] < 0).astype(int)
df['wind_monsoon_index'] = df['wind_u'] + df['wind_v']
df['wind_monsoon_weighted'] = df['wind_monsoon_index'] * df['doy_sin']

# ===============================================================
# 4Ô∏è‚É£ RAIN TYPE CLASSIFICATION (NO PRECIPCOVER)
# ===============================================================

df['precip_intensity'] = df['precip']
df['precip_cloud_ratio'] = df['precip'] / (df['cloudcover'] + 1e-3)
df['precip_efficiency']  = df['precip'] / (df['humidity'] + 1e-3)

med_intensity = df['precip'].median()
med_gust      = df['windgust'].median()

df['is_convective_rain'] = (
    (df['precip'] > med_intensity) &
    (df['windgust'] > med_gust) &
    (df['humidity'] > 60)
).astype(int)

df['is_stratiform_rain'] = (
    (df['precip'] > 0) &
    (df['is_convective_rain'] == 0)
).astype(int)

df['convective_prev'] = df['is_convective_rain'].shift(1).fillna(0)

# ===============================================================
# 5Ô∏è‚É£ ADVANCED PHYSICAL COHERENCE FEATURES
# ===============================================================

u, v = df['wind_u'], df['wind_v']

df['wind_dir_consistency_w24h'] = (
    np.sqrt(u.rolling(24,1).mean()**2 + v.rolling(24,1).mean()**2)
    / (df['windspeed'].rolling(24,1).mean() + 1e-3)
)

df['wind_dir_consistency_w168h'] = (
    np.sqrt(u.rolling(168,1).mean()**2 + v.rolling(168,1).mean()**2)
    / (df['windspeed'].rolling(168,1).mean() + 1e-3)
)

df['wind_pressure_coupling'] = df['windspeed'] * (
    df['sealevelpressure'].rolling(3,1).mean() -
    df['sealevelpressure'].rolling(12,1).mean()
)

df['humid_radiation_balance'] = (
    df['humidity'].rolling(3,1).mean() *
    df['solarradiation'].rolling(3,1).mean()
)

df['humid_radiation_ratio'] = df['humidity'] / (df['solarradiation'] + 1e-3)

# 24h covariances
pairs = [
    ('humidity','solarradiation'),
    ('wind_u','wind_v'),
    ('humidity','precip'),
    ('sealevelpressure','windspeed')
]
for a,b in pairs:
    df[f'{a}_{b}_cov24h'] = df[a].rolling(24,1).cov(df[b])

# Wind direction (24h)
mean_u_24 = df['wind_u'].rolling(24,1).mean()
mean_v_24 = df['wind_v'].rolling(24,1).mean()

df['mean_wind_dir_w24h'] = np.degrees(np.arctan2(mean_v_24, mean_u_24)) % 360
df['mean_wind_dir_w24h_rad'] = np.deg2rad(df['mean_wind_dir_w24h'])
df['mean_wind_dir_w24h_sin'] = np.sin(df['mean_wind_dir_w24h_rad'])
df['mean_wind_dir_w24h_cos'] = np.cos(df['mean_wind_dir_w24h_rad'])

# ===============================================================
# 6Ô∏è‚É£ CLEANUP
# ===============================================================
df = df.fillna(0).iloc[max(windows):].reset_index(drop=True)

print("Hourly feature engineering complete (24‚Äì336h windows).")
print("Total feature count:", len(df.columns))


Hourly feature engineering complete (24‚Äì336h windows).
Total feature count: 76


### **5.2. Exponentially Weighted Moving Average (EWMA) Smoothing**

This section applies Exponentially Weighted Moving Average (EWMA) smoothing to a selection of key features. This is a sophisticated time-series technique used to create smoothed versions of features that capture underlying trends while reducing short-term noise.

*   *Feature Selection*: A specific list of important numerical features (ewma_features) is defined, including humidity, dew, solarradiation, sealevelpressure, windspeed, and precip. These are core meteorological variables that are likely to have meaningful trends over time.

*   *Configuration of Smoothing Levels*: A dictionary named ewma_configs is created to define three different smoothing levels, each with a different "memory":
    *   *Short Memory (72h)*: An alpha of 0.10 gives more weight to recent data, creating a smoothed feature that closely tracks recent changes over approximately the last 3 days.
    *   *Medium Memory (168h)*: An alpha of 0.05 provides a more balanced view, smoothing the data over roughly the past week (7 days).
    *   *Long Memory (336h)*: An alpha of 0.02 gives more weight to older data, capturing the long-term trend over approximately the last 14 days.

*   *Application of EWMA*: The code iterates through each selected feature and applies each of the three EWMA configurations to it.
    *   For each combination, it calculates the exponentially weighted moving average using the .ewm() method in pandas. The adjust=False argument is used to ensure that the weights are calculated recursively, giving more emphasis to recent observations in a consistent manner.
    *   This process generates 18 new features in total (6 original features √ó 3 smoothing levels). Each new feature is named descriptively, such as humidity_ewma_72h, making it clear which feature and which smoothing window it represents.

*   *Purpose and Benefit*: Unlike a simple moving average that weighs all past points equally, an EWMA gives exponentially more weight to recent data points. This makes it more responsive to recent changes while still providing a smoothed trend. By creating features with short, medium, and long-term memories, the model is given a multi-scale view of the trends in the data, which can significantly improve its predictive power. The "Leakage-Safe" comment in the code indicates that this operation is performed without using future information, making it safe for training a forecasting model.

In [14]:
# ===============================================================
# üå°Ô∏è EWMA Smoothing for HOURLY Data (Leakage-Safe)
# ===============================================================

ewma_features = [
    'humidity', 'dew', 'solarradiation',
    'sealevelpressure', 'windspeed', 'precip'
]

# Adjusted EWMA Œ± values for hourly resolution
ewma_configs = {
    '72h': 0.10,     # short memory (~3 days)
    '168h': 0.05,    # medium memory (~7 days)
    '336h': 0.02     # long memory (~14 days)
}

for col in ewma_features:
    for tag, alpha in ewma_configs.items():
        df[f'{col}_ewma_{tag}'] = df[col].ewm(alpha=alpha, adjust=False).mean()

print("EWMA (hourly) features added to df.")


EWMA (hourly) features added to df.


In [15]:
df.isnull().sum()

datetime               0
temp                   0
feelslike              0
dew                    0
humidity               0
                      ..
windspeed_ewma_168h    0
windspeed_ewma_336h    0
precip_ewma_72h        0
precip_ewma_168h       0
precip_ewma_336h       0
Length: 94, dtype: int64

## **6. Targets and Split Data**

This crucial section prepares the data for multi-horizon forecasting. Instead of just predicting the next hour, the goal is to predict temperatures at multiple future time steps. This section defines these multiple targets, splits the data chronologically, and organizes the features and targets for training.

*   **1. Define Multi-Horizon Targets**:
    *   **Action**: The code defines a dictionary of `horizons` (t+1, t+3, t+6, t+12, and t+24 hours). It then iterates through this dictionary, creating a new `target` column for each horizon by shifting the `temp` column backward by the corresponding number of hours (`.shift(-h)`).
    *   **Handling Missing Values**: Shifting the data creates `NaN` values at the end of the DataFrame for each new target column. The `.dropna()` command is used to remove any row where *any* of the target horizons are missing. This ensures that every data point used for training has a valid target for all forecast horizons.

*   **2. Chronological Train/Test Split**:
    *   **Action**: The dataset is split into a training set (the first 80% of the data) and a testing set (the final 20%). The `split_idx` is calculated as 80% of the total length of the DataFrame.
    *   **Reasoning**: This is a **chronological split**, which is essential for any time-series task. The model is trained exclusively on older data and then evaluated on newer, unseen data. This simulates a real-world forecasting scenario and prevents "data leakage" (using future information to train).

*   **3. Separate Features and Targets**:
    *   **Feature Columns (`X`)**: The code creates a list of columns to be excluded (`exclude_cols`), which includes the `datetime` column, the original `temp` column, and all the newly created `target_` columns. The remaining columns are considered features and are separated into `X_train_full` and `X_test_full`.
    *   **Target Dictionaries (`y`)**: The target variables for each horizon are organized into dictionaries (`y_train_dict` and `y_test_dict`). This provides a convenient way to access the specific target series for each horizon during the model training loop (e.g., `y_train_dict['t3']` will give the training targets for the 3-hour-ahead forecast).

*   **4. Verify Split Integrity**:
    *   The output from this cell confirms the results of the splitting process. It shows the new shape of the data after creating multiple targets, the exact date ranges for the train and test periods (confirming no overlap), and the total number of features (92) that will be used for model training.

In [None]:
# =====================================================================
# 1Ô∏è‚É£ DEFINE MULTI-HORIZON TARGETS (t+1, t+3, t+6, t+12, t+24)
# =====================================================================
horizons = {
    "t1": 1,
    "t3": 3,
    "t6": 6,
    "t12": 12,
    "t24": 24
}

for name, h in horizons.items():
    df[f"target_{name}"] = df['temp'].shift(-h)

# Drop rows where ANY horizon is missing
df = df.dropna(
    subset=[f"target_{h}" for h in horizons]
).reset_index(drop=True)

print("‚úÖ Multi-horizon targets created.")
print("Data shape after target creation:", df.shape)


# =====================================================================
# 2Ô∏è‚É£ TRAIN / TEST SPLIT (CHRONOLOGICAL, LEAKAGE-SAFE)
# =====================================================================
split_idx = int(len(df) * 0.8)
train = df.iloc[:split_idx].copy()
test  = df.iloc[split_idx:].copy()

print("Train period:", train['datetime'].min(), "‚Üí", train['datetime'].max())
print("Test  period:", test['datetime'].min(), "‚Üí", test['datetime'].max())

# Feature columns: remove datetime, temp, all target columns
exclude_cols = ['datetime', 'temp'] + [f"target_{h}" for h in horizons]
feature_cols = [c for c in df.columns if c not in exclude_cols]

X_train_full = train[feature_cols].copy()
X_test_full  = test[feature_cols].copy()

# Target dicts per horizon
y_train_dict = {h: train[f"target_{h}"].copy() for h in horizons}
y_test_dict  = {h: test[f"target_{h}"].copy()  for h in horizons}

print(f"Number of features before selection: {len(feature_cols)}")


‚úÖ Multi-horizon targets created.
Data shape after target creation: (94074, 99)
Train period: 2015-01-15 00:00:00 ‚Üí 2023-08-16 18:00:00
Test  period: 2023-08-16 19:00:00 ‚Üí 2025-10-08 17:00:00
Number of features before selection: 92


### **6.1. Defining Baselines and Calculating Their Performance**

This section establishes the performance benchmarks that the machine learning models must exceed. It defines two simple, common-sense baseline models‚ÄîPersistence and Climatology‚Äîand calculates their Root Mean Squared Error (RMSE) for each of the five forecast horizons.

*   **1. Defining Baseline Models**:
    *   **Persistence Model**: This model follows the simple heuristic: "the weather tomorrow will be the same as the weather today." For a forecast horizon of `h` hours, it naively assumes the temperature at `t+h` will be the same as the temperature at `t`. In this implementation, it uses the current hour's temperature (`test['temp']`) as the prediction for all future horizons.
    *   **Climatology Model**: This model predicts the temperature based on the long-term average for that specific day of the year. For example, the prediction for July 4th is the average of all temperatures ever recorded on July 4th in the training data. The `climatology_pred` function implements this by creating a lookup map from the training data and applying it to the test data.

*   **2. Evaluating Baselines for Each Horizon**:
    *   The code iterates through each forecast horizon (`t1`, `t3`, `t6`, `t12`, `t24`).
    *   In each iteration, it calculates the RMSE for both the Persistence and Climatology models against the actual test data for that specific horizon.
    *   The results are stored in a dictionary and printed. This output clearly shows that for short horizons (t1, t3), the **Persistence** model is much better, while for longer horizons (t12, t24), **Climatology** becomes the stronger baseline. This is an expected and intuitive result.

In [17]:
# =====================================================================
# 3Ô∏è‚É£ BASELINE MODELS (PERSISTENCE & CLIMATOLOGY)
# =====================================================================

def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

def climatology_pred(train_frame, test_frame, target_col):
    """Mean temp per day-of-year from TRAIN mapped onto TEST dates."""
    dt_train = pd.to_datetime(train_frame['datetime'])
    dt_test  = pd.to_datetime(test_frame['datetime'])

    train_doy = dt_train.dt.dayofyear
    test_doy  = dt_test.dt.dayofyear

    climo_map = (
        train_frame
        .assign(doy=train_doy)
        .groupby('doy')[target_col]
        .mean()
        .to_dict()
    )
    global_mean = train_frame[target_col].mean()
    return np.array([climo_map.get(d, global_mean) for d in test_doy])


baseline_metrics = {}

for h in horizons:
    print(f"\n--- Baselines for horizon {h} ---")
    y_train_h = y_train_dict[h]
    y_test_h  = y_test_dict[h]

    # Persistence: T(t+h) ‚âà T(t)
    # For simplicity we use same-hour temp as naive baseline on TEST:
    yhat_pers_test = test['temp'].values

    # Climatology: mean temp by day-of-year, built on TRAIN
    yhat_climo_test = climatology_pred(
        train.assign(target=y_train_h),
        test.assign(target=y_test_h),
        target_col='target'
    )

    baseline_metrics[h] = {
        "RMSE_persistence": rmse(y_test_h, yhat_pers_test),
        "RMSE_climatology": rmse(y_test_h, yhat_climo_test)
    }
    print("Persistence RMSE:", baseline_metrics[h]["RMSE_persistence"])
    print("Climatology RMSE:", baseline_metrics[h]["RMSE_climatology"])


--- Baselines for horizon t1 ---
Persistence RMSE: 1.1476882910877269
Climatology RMSE: 2.8316713801061533

--- Baselines for horizon t3 ---
Persistence RMSE: 2.4007678814098727
Climatology RMSE: 2.8319237309392706

--- Baselines for horizon t6 ---
Persistence RMSE: 3.7841430346263025
Climatology RMSE: 2.8320755424360335

--- Baselines for horizon t12 ---
Persistence RMSE: 4.723324835536202
Climatology RMSE: 2.831371809922301

--- Baselines for horizon t24 ---
Persistence RMSE: 1.7393551576957222
Climatology RMSE: 2.831654793258915


### **6.2. Unified Feature Selection**

This section defines and executes a sophisticated, multi-stage feature selection pipeline. The goal is to systematically reduce the large number of engineered features (from 92) down to a smaller, more powerful, and diverse subset. This entire process is performed using **only the training data for the t+1 horizon** to create a single, unified set of features that will be used to train models for *all* horizons.

*   **Step 1: Remove Temperature-Leaking Predictors**:
    *   It begins by explicitly removing any features that are direct derivatives or proxies of the target temperature, such as `feelslike`. This is a crucial step to prevent data leakage.

*   **Step 2: Mutual Information (MI) Filtering**:
    *   It calculates the Mutual Information score between each remaining feature and the `t+1` target variable.
    *   Only the top 15% of features (with a minimum of 30) with the highest MI scores are kept. This acts as a first-pass filter to eliminate the least relevant features.

*   **Step 3: Redundancy Pruning (Remove Highly Correlated Features)**:
    *   A correlation matrix is created for the features that survived the MI filter.
    *   The code then identifies and removes any feature that has a correlation greater than 0.92 with another feature. This step eliminates redundant information (multicollinearity).

*   **Step 4: Lightweight XGBoost for Importance Ranking**:
    *   A temporary, lightweight XGBoost model is trained on the now-reduced feature set. This model's purpose is to generate a reliable ranking of feature importance.

*   **Step 5: Family-Aware Selection**:
    *   This is a custom logic designed to ensure feature diversity. It groups features by their "family" (prefix, e.g., 'humidity', 'dew', 'hour').
    *   It then iterates through the most important families, selecting up to 4 of the top features from each, until a hard cap of 80 total features is reached.
    *   Finally, the code takes the resulting list and removes any duplicates while preserving the order.

*   **Execution and Output**:
    *   The `unified_feature_select` function is called using the base training/testing sets and the `t+1` horizon target.
    *   The output confirms that this entire process has successfully reduced the feature count from 92 down to a final, optimized set of **19 features**. These 19 features will now be the sole inputs for all subsequent models.

In [18]:
# =====================================================================
# 4Ô∏è‚É£ UNIFIED FEATURE SELECTION (BASED ON t+1 ONLY)
# =====================================================================

drop_cols_leak = ['feelslike', 'feelslikemax', 'feelslikemin',
                  'tempmax', 'tempmin']

X_train_base = X_train_full.drop(
    columns=[c for c in drop_cols_leak if c in X_train_full.columns],
    errors='ignore'
)
X_test_base  = X_test_full.drop(
    columns=[c for c in drop_cols_leak if c in X_test_full.columns],
    errors='ignore'
)

print("\nBase feature shape (after dropping temp-like columns):", X_train_base.shape)

y_train_t1 = y_train_dict["t1"]  # use t+1 horizon to shape feature space


from sklearn.feature_selection import mutual_info_regression

def unified_feature_select(X_tr, y_tr, X_te,
                           mi_keep_ratio=0.15,
                           max_corr=0.92,
                           max_features=80):
    """
    1) Mutual Information filter
    2) Correlation pruning
    3) XGB importance + family-based cap
    """
    # 1) MI
    mi = mutual_info_regression(X_tr, y_tr, random_state=42)
    mi_s = pd.Series(mi, index=X_tr.columns).sort_values(ascending=False)

    top_k = max(30, int(mi_keep_ratio * len(mi_s)))
    top_feats = mi_s.head(top_k).index
    X_tr_fs = X_tr[top_feats].copy()
    X_te_fs = X_te[top_feats].copy()

    # 2) Correlation pruning
    corr = X_tr_fs.corr().abs()
    upper = corr.where(np.triu(np.ones(corr.shape), k=1).astype(bool))
    to_drop = [col for col in upper.columns if any(upper[col] > max_corr)]

    X_tr_fs = X_tr_fs.drop(columns=to_drop, errors='ignore')
    X_te_fs = X_te_fs.drop(columns=to_drop, errors='ignore')

    # 3) Lightweight XGB for ranking
    xgb_temp = XGBRegressor(
        n_estimators=250,
        learning_rate=0.05,
        max_depth=6,
        subsample=0.8,
        colsample_bytree=0.8,
        reg_lambda=3,
        reg_alpha=0.5,
        min_child_weight=5,
        random_state=42,
        n_jobs=-1,
        eval_metric='rmse'
    )
    xgb_temp.fit(X_tr_fs, y_tr)
    imp = pd.Series(xgb_temp.feature_importances_, index=X_tr_fs.columns) \
          .sort_values(ascending=False)

    def base_prefix(c):
        m = re.match(r'([A-Za-z0-9]+)_', c)
        return m.group(1) if m else c

    VALID_PREFIXES = [
        'humidity', 'dew', 'solarradiation', 'sealevelpressure',
        'windspeed', 'wind', 'wind_u', 'wind_v', 'pressure', 'storminess',
        'radiation', 'cloud', 'hour', 'doy', 'convective', 'soil', 'precip'
    ]

    grp = imp.groupby(imp.index.map(base_prefix)).mean().sort_values(ascending=False)

    selected = []
    for fam in grp.index:
        if fam not in VALID_PREFIXES:
            continue
        fam_feats = imp[imp.index.map(base_prefix) == fam].index
        selected.extend(fam_feats[:4])  # cap per family
        if len(selected) >= max_features:
            break

    selected = list(dict.fromkeys(selected))  # unique while preserving order
    return X_tr_fs[selected].copy(), X_te_fs[selected].copy(), selected


X_train_sel, X_test_sel, selected_features = unified_feature_select(
    X_train_base, y_train_t1, X_test_base
)

print("\n‚úÖ Unified feature selection completed.")
print("Selected feature count:", len(selected_features))


Base feature shape (after dropping temp-like columns): (75259, 91)

‚úÖ Unified feature selection completed.
Selected feature count: 19


### **6.3. Preparing for Hyperparameter Tuning**

These cells set up the necessary components for running hyperparameter optimization with the Optuna library.

*   **Time Series Cross-Validation**:
    *   A `TimeSeriesSplit` object (`tscv`) is created with 5 splits. This cross-validation strategy is crucial for time series data because it ensures that the validation sets are always in the future relative to the training sets in each fold, preventing data leakage and mimicking a real-world forecasting scenario.

*   **Custom Scorer for Optuna**:
    *   Optuna works by maximizing an objective function. Since the goal here is to *minimize* the Root Mean Squared Error (RMSE), a custom scorer `rmse_scorer` is created that returns the *negative* RMSE.
    *   Maximizing this negative RMSE is mathematically equivalent to minimizing the actual RMSE. This scorer is then wrapped using `make_scorer` from scikit-learn so it can be used directly within the cross-validation process.


In [19]:
# =====================================================================
# 5Ô∏è‚É£ TIME SERIES CV + SCORER FOR OPTUNA
# =====================================================================

tscv = TimeSeriesSplit(n_splits=5)

# We want to MAXIMIZE negative RMSE (i.e., minimize RMSE)
def rmse_scorer(y_true, y_pred):
    return -rmse(y_true, y_pred)

scorer = make_scorer(rmse_scorer, greater_is_better=True)

### **6.4. Optuna Hyperparameter Tuning Loop**

This is the main computational block of the notebook. It iterates through each forecast horizon and uses Optuna to find the best hyperparameters for three different models: Random Forest, XGBoost, and LightGBM.

*   **Outer Loop**: The code begins a `for` loop that iterates through each defined horizon (`t1`, `t3`, etc.).

*   **Inner Tuning Process (for each model within the loop)**:
    *   **1. Define Objective Function**: An `objective` function (e.g., `objective_rf`) is defined for each model type. Inside this function:
        *   The `trial.suggest_*` methods are used to define a search space for a wide range of hyperparameters specific to that model (e.g., `n_estimators`, `max_depth`).
    *   **2. Create and Run Study**: An Optuna `study` is created with the direction set to `'maximize'`. The `study.optimize()` function is called, instructing Optuna to run the objective function for a set number of `n_trials` (40 in this case).
    *   **3. Store Best Parameters**: After the study is complete, the best set of hyperparameters found is stored in a dictionary corresponding to the current horizon (e.g., `rf_best_params['t1']`).

*   **Execution Flow and Output**:
    *   The code executes this entire tuning process sequentially for each horizon. For each horizon, it tunes a Random Forest, then an XGBoost, and finally a LightGBM model.
    *   The extensive output logs show the progress of each trial run by Optuna, displaying the parameters tested and the resulting score. At the end of each model's tuning for a specific horizon, it prints the best parameters found.
    *   The final message, "Hyperparameter optimization completed for all horizons," confirms that the entire process has finished successfully.

In [20]:
# =====================================================================
# 6Ô∏è‚É£ OPTUNA HYPERPARAMETER TUNING PER HORIZON (RF, XGB, LGBM)
# =====================================================================

rf_best_params  = {}
xgb_best_params = {}
lgbm_best_params = {}

for h in horizons:   # "t1", "t3", ...
    print(f"\n\n====================== Optimizing Horizon {h} ======================")
    y_train_h = y_train_dict[h]

    # --- RF objective ---
    def objective_rf(trial):
        params = {
            'n_estimators': trial.suggest_int('n_estimators', 100, 400, step=50),
            'max_depth': trial.suggest_int('max_depth', 4, 8),
            'min_samples_split': trial.suggest_int('min_samples_split', 5, 20),
            'min_samples_leaf': trial.suggest_int('min_samples_leaf', 3, 10),
            'max_features': trial.suggest_categorical('max_features', ['sqrt', 0.3, 0.5]),
            'random_state': 42,
            'n_jobs': -1
        }
        model = RandomForestRegressor(**params)
        scores = cross_val_score(
            model, X_train_sel, y_train_h,
            cv=tscv, scoring=scorer, n_jobs=-1
        )
        return np.mean(scores)

    study_rf = optuna.create_study(direction='maximize', sampler=TPESampler(seed=42))
    study_rf.optimize(objective_rf, n_trials=40)
    rf_best_params[h] = study_rf.best_params
    print("RF best params:", study_rf.best_params)

    # --- XGB objective ---
    def objective_xgb(trial):
        params = {
            'n_estimators': trial.suggest_int('n_estimators', 200, 600, step=100),
            'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.1, log=True),
            'max_depth': trial.suggest_int('max_depth', 3, 8),
            'min_child_weight': trial.suggest_int('min_child_weight', 1, 7),
            'subsample': trial.suggest_float('subsample', 0.6, 1.0),
            'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
            'reg_lambda': trial.suggest_float('reg_lambda', 1.0, 10.0),
            'reg_alpha': trial.suggest_float('reg_alpha', 0.0, 2.0),
            'random_state': 42,
            'n_jobs': -1,
            'objective': 'reg:squarederror'
        }
        model = XGBRegressor(**params)
        scores = cross_val_score(
            model, X_train_sel, y_train_h,
            cv=tscv, scoring=scorer, n_jobs=-1
        )
        return np.mean(scores)

    study_xgb = optuna.create_study(direction='maximize', sampler=TPESampler(seed=42))
    study_xgb.optimize(objective_xgb, n_trials=40)
    xgb_best_params[h] = study_xgb.best_params
    print("XGB best params:", study_xgb.best_params)

    # --- LGBM objective ---
    def objective_lgbm(trial):
        params = {
            'n_estimators': trial.suggest_int('n_estimators', 300, 700, step=100),
            'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.1, log=True),
            'num_leaves': trial.suggest_int('num_leaves', 20, 80),
            'max_depth': trial.suggest_int('max_depth', -1, 10),
            'subsample': trial.suggest_float('subsample', 0.6, 1.0),
            'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
            'reg_lambda': trial.suggest_float('reg_lambda', 0.0, 5.0),
            'reg_alpha': trial.suggest_float('reg_alpha', 0.0, 2.0),
            'min_child_samples': trial.suggest_int('min_child_samples', 5, 40),
            'random_state': 42,
            'n_jobs': -1
        }
        model = LGBMRegressor(**params)
        scores = cross_val_score(
            model, X_train_sel, y_train_h,
            cv=tscv, scoring=scorer, n_jobs=-1
        )
        return np.mean(scores)

    study_lgbm = optuna.create_study(direction='maximize', sampler=TPESampler(seed=42))
    study_lgbm.optimize(objective_lgbm, n_trials=40)
    lgbm_best_params[h] = study_lgbm.best_params
    print("LGBM best params:", study_lgbm.best_params)


print("\n‚úÖ Hyperparameter optimization completed for all horizons.")

[I 2025-11-16 14:40:19,234] A new study created in memory with name: no-name-0b083474-6d98-4842-b79e-b0c9ca269c34






[I 2025-11-16 14:40:23,374] Trial 0 finished with value: -0.9752266675413852 and parameters: {'n_estimators': 200, 'max_depth': 8, 'min_samples_split': 16, 'min_samples_leaf': 7, 'max_features': 'sqrt'}. Best is trial 0 with value: -0.9752266675413852.
[I 2025-11-16 14:40:29,110] Trial 1 finished with value: -1.0088974259515044 and parameters: {'n_estimators': 400, 'max_depth': 7, 'min_samples_split': 16, 'min_samples_leaf': 3, 'max_features': 'sqrt'}. Best is trial 0 with value: -0.9752266675413852.
[I 2025-11-16 14:40:32,580] Trial 2 finished with value: -1.0893214885028397 and parameters: {'n_estimators': 150, 'max_depth': 4, 'min_samples_split': 9, 'min_samples_leaf': 7, 'max_features': 0.5}. Best is trial 0 with value: -0.9752266675413852.
[I 2025-11-16 14:40:34,504] Trial 3 finished with value: -1.134548987673995 and parameters: {'n_estimators': 100, 'max_depth': 5, 'min_samples_split': 10, 'min_samples_leaf': 6, 'max_features': 'sqrt'}. Best is trial 0 with value: -0.97522666754

RF best params: {'n_estimators': 250, 'max_depth': 8, 'min_samples_split': 12, 'min_samples_leaf': 6, 'max_features': 0.5}


[I 2025-11-16 14:43:44,726] Trial 0 finished with value: -0.8961216455768157 and parameters: {'n_estimators': 300, 'learning_rate': 0.08927180304353628, 'max_depth': 7, 'min_child_weight': 5, 'subsample': 0.6624074561769746, 'colsample_bytree': 0.662397808134481, 'reg_lambda': 1.5227525095137953, 'reg_alpha': 1.7323522915498704}. Best is trial 0 with value: -0.8961216455768157.
[I 2025-11-16 14:43:46,883] Trial 1 finished with value: -0.8809037700606692 and parameters: {'n_estimators': 500, 'learning_rate': 0.051059032093947576, 'max_depth': 3, 'min_child_weight': 7, 'subsample': 0.9329770563201687, 'colsample_bytree': 0.6849356442713105, 'reg_lambda': 2.636424704863906, 'reg_alpha': 0.36680901970686763}. Best is trial 1 with value: -0.8809037700606692.
[I 2025-11-16 14:43:48,785] Trial 2 finished with value: -0.8790714634558819 and parameters: {'n_estimators': 300, 'learning_rate': 0.03347776308515933, 'max_depth': 5, 'min_child_weight': 3, 'subsample': 0.8447411578889518, 'colsample_

XGB best params: {'n_estimators': 400, 'learning_rate': 0.028595600004321653, 'max_depth': 5, 'min_child_weight': 4, 'subsample': 0.8246788363459768, 'colsample_bytree': 0.9018361272678286, 'reg_lambda': 2.9569018598160604, 'reg_alpha': 0.8408872677477731}


[I 2025-11-16 14:45:18,678] Trial 0 finished with value: -0.888134120219957 and parameters: {'n_estimators': 400, 'learning_rate': 0.08927180304353628, 'num_leaves': 64, 'max_depth': 6, 'subsample': 0.6624074561769746, 'colsample_bytree': 0.662397808134481, 'reg_lambda': 0.2904180608409973, 'reg_alpha': 1.7323522915498704, 'min_child_samples': 26}. Best is trial 0 with value: -0.888134120219957.
[I 2025-11-16 14:45:22,894] Trial 1 finished with value: -0.8775354142732746 and parameters: {'n_estimators': 600, 'learning_rate': 0.010485387725194618, 'num_leaves': 79, 'max_depth': 8, 'subsample': 0.6849356442713105, 'colsample_bytree': 0.6727299868828402, 'reg_lambda': 0.9170225492671691, 'reg_alpha': 0.6084844859190754, 'min_child_samples': 23}. Best is trial 1 with value: -0.8775354142732746.
[I 2025-11-16 14:45:26,197] Trial 2 finished with value: -0.8784042825619677 and parameters: {'n_estimators': 500, 'learning_rate': 0.019553708662745254, 'num_leaves': 57, 'max_depth': 0, 'subsample

LGBM best params: {'n_estimators': 600, 'learning_rate': 0.025873340793887655, 'num_leaves': 71, 'max_depth': 5, 'subsample': 0.6928056605696709, 'colsample_bytree': 0.8160380080454375, 'reg_lambda': 4.55835567581981, 'reg_alpha': 0.07260962646423674, 'min_child_samples': 26}




[I 2025-11-16 14:46:53,928] Trial 0 finished with value: -1.287196689169192 and parameters: {'n_estimators': 200, 'max_depth': 8, 'min_samples_split': 16, 'min_samples_leaf': 7, 'max_features': 'sqrt'}. Best is trial 0 with value: -1.287196689169192.
[I 2025-11-16 14:46:58,907] Trial 1 finished with value: -1.3223954516741006 and parameters: {'n_estimators': 400, 'max_depth': 7, 'min_samples_split': 16, 'min_samples_leaf': 3, 'max_features': 'sqrt'}. Best is trial 0 with value: -1.287196689169192.
[I 2025-11-16 14:47:01,312] Trial 2 finished with value: -1.4742739250216725 and parameters: {'n_estimators': 150, 'max_depth': 4, 'min_samples_split': 9, 'min_samples_leaf': 7, 'max_features': 0.5}. Best is trial 0 with value: -1.287196689169192.
[I 2025-11-16 14:47:02,357] Trial 3 finished with value: -1.4454212480070503 and parameters: {'n_estimators': 100, 'max_depth': 5, 'min_samples_split': 10, 'min_samples_leaf': 6, 'max_features': 'sqrt'}. Best is trial 0 with value: -1.28719668916919

RF best params: {'n_estimators': 100, 'max_depth': 8, 'min_samples_split': 9, 'min_samples_leaf': 8, 'max_features': 0.5}


[I 2025-11-16 14:49:45,980] Trial 0 finished with value: -1.2236506403640193 and parameters: {'n_estimators': 300, 'learning_rate': 0.08927180304353628, 'max_depth': 7, 'min_child_weight': 5, 'subsample': 0.6624074561769746, 'colsample_bytree': 0.662397808134481, 'reg_lambda': 1.5227525095137953, 'reg_alpha': 1.7323522915498704}. Best is trial 0 with value: -1.2236506403640193.
[I 2025-11-16 14:49:48,242] Trial 1 finished with value: -1.2048124768895907 and parameters: {'n_estimators': 500, 'learning_rate': 0.051059032093947576, 'max_depth': 3, 'min_child_weight': 7, 'subsample': 0.9329770563201687, 'colsample_bytree': 0.6849356442713105, 'reg_lambda': 2.636424704863906, 'reg_alpha': 0.36680901970686763}. Best is trial 1 with value: -1.2048124768895907.
[I 2025-11-16 14:49:50,125] Trial 2 finished with value: -1.1962981769379886 and parameters: {'n_estimators': 300, 'learning_rate': 0.03347776308515933, 'max_depth': 5, 'min_child_weight': 3, 'subsample': 0.8447411578889518, 'colsample_

XGB best params: {'n_estimators': 600, 'learning_rate': 0.01235257875882434, 'max_depth': 7, 'min_child_weight': 5, 'subsample': 0.6101257854689734, 'colsample_bytree': 0.8783547221751052, 'reg_lambda': 3.6759679188373866, 'reg_alpha': 1.8708175660200725}


[I 2025-11-16 14:52:25,113] Trial 0 finished with value: -1.218089711399104 and parameters: {'n_estimators': 400, 'learning_rate': 0.08927180304353628, 'num_leaves': 64, 'max_depth': 6, 'subsample': 0.6624074561769746, 'colsample_bytree': 0.662397808134481, 'reg_lambda': 0.2904180608409973, 'reg_alpha': 1.7323522915498704, 'min_child_samples': 26}. Best is trial 0 with value: -1.218089711399104.
[I 2025-11-16 14:52:30,380] Trial 1 finished with value: -1.194355820950572 and parameters: {'n_estimators': 600, 'learning_rate': 0.010485387725194618, 'num_leaves': 79, 'max_depth': 8, 'subsample': 0.6849356442713105, 'colsample_bytree': 0.6727299868828402, 'reg_lambda': 0.9170225492671691, 'reg_alpha': 0.6084844859190754, 'min_child_samples': 23}. Best is trial 1 with value: -1.194355820950572.
[I 2025-11-16 14:52:33,829] Trial 2 finished with value: -1.198080622764685 and parameters: {'n_estimators': 500, 'learning_rate': 0.019553708662745254, 'num_leaves': 57, 'max_depth': 0, 'subsample': 

LGBM best params: {'n_estimators': 600, 'learning_rate': 0.010485387725194618, 'num_leaves': 79, 'max_depth': 8, 'subsample': 0.6849356442713105, 'colsample_bytree': 0.6727299868828402, 'reg_lambda': 0.9170225492671691, 'reg_alpha': 0.6084844859190754, 'min_child_samples': 23}




[I 2025-11-16 14:54:28,313] Trial 0 finished with value: -1.4391621557200485 and parameters: {'n_estimators': 200, 'max_depth': 8, 'min_samples_split': 16, 'min_samples_leaf': 7, 'max_features': 'sqrt'}. Best is trial 0 with value: -1.4391621557200485.
[I 2025-11-16 14:54:33,232] Trial 1 finished with value: -1.4729799935845889 and parameters: {'n_estimators': 400, 'max_depth': 7, 'min_samples_split': 16, 'min_samples_leaf': 3, 'max_features': 'sqrt'}. Best is trial 0 with value: -1.4391621557200485.
[I 2025-11-16 14:54:35,703] Trial 2 finished with value: -1.6032496448584985 and parameters: {'n_estimators': 150, 'max_depth': 4, 'min_samples_split': 9, 'min_samples_leaf': 7, 'max_features': 0.5}. Best is trial 0 with value: -1.4391621557200485.
[I 2025-11-16 14:54:36,808] Trial 3 finished with value: -1.5922892471819372 and parameters: {'n_estimators': 100, 'max_depth': 5, 'min_samples_split': 10, 'min_samples_leaf': 6, 'max_features': 'sqrt'}. Best is trial 0 with value: -1.4391621557

RF best params: {'n_estimators': 200, 'max_depth': 8, 'min_samples_split': 13, 'min_samples_leaf': 7, 'max_features': 0.5}


[I 2025-11-16 14:57:40,905] Trial 0 finished with value: -1.4012040448938676 and parameters: {'n_estimators': 300, 'learning_rate': 0.08927180304353628, 'max_depth': 7, 'min_child_weight': 5, 'subsample': 0.6624074561769746, 'colsample_bytree': 0.662397808134481, 'reg_lambda': 1.5227525095137953, 'reg_alpha': 1.7323522915498704}. Best is trial 0 with value: -1.4012040448938676.
[I 2025-11-16 14:57:43,205] Trial 1 finished with value: -1.3794116479442269 and parameters: {'n_estimators': 500, 'learning_rate': 0.051059032093947576, 'max_depth': 3, 'min_child_weight': 7, 'subsample': 0.9329770563201687, 'colsample_bytree': 0.6849356442713105, 'reg_lambda': 2.636424704863906, 'reg_alpha': 0.36680901970686763}. Best is trial 1 with value: -1.3794116479442269.
[I 2025-11-16 14:57:45,152] Trial 2 finished with value: -1.3690330107098736 and parameters: {'n_estimators': 300, 'learning_rate': 0.03347776308515933, 'max_depth': 5, 'min_child_weight': 3, 'subsample': 0.8447411578889518, 'colsample_

XGB best params: {'n_estimators': 400, 'learning_rate': 0.014498674093474198, 'max_depth': 7, 'min_child_weight': 3, 'subsample': 0.8604856450699415, 'colsample_bytree': 0.7357794629123185, 'reg_lambda': 2.9054324754273293, 'reg_alpha': 0.09383215734480531}


[I 2025-11-16 15:00:16,807] Trial 0 finished with value: -1.4007985659496556 and parameters: {'n_estimators': 400, 'learning_rate': 0.08927180304353628, 'num_leaves': 64, 'max_depth': 6, 'subsample': 0.6624074561769746, 'colsample_bytree': 0.662397808134481, 'reg_lambda': 0.2904180608409973, 'reg_alpha': 1.7323522915498704, 'min_child_samples': 26}. Best is trial 0 with value: -1.4007985659496556.
[I 2025-11-16 15:00:22,661] Trial 1 finished with value: -1.3619616206401064 and parameters: {'n_estimators': 600, 'learning_rate': 0.010485387725194618, 'num_leaves': 79, 'max_depth': 8, 'subsample': 0.6849356442713105, 'colsample_bytree': 0.6727299868828402, 'reg_lambda': 0.9170225492671691, 'reg_alpha': 0.6084844859190754, 'min_child_samples': 23}. Best is trial 1 with value: -1.3619616206401064.
[I 2025-11-16 15:00:26,967] Trial 2 finished with value: -1.3740635803356303 and parameters: {'n_estimators': 500, 'learning_rate': 0.019553708662745254, 'num_leaves': 57, 'max_depth': 0, 'subsamp

LGBM best params: {'n_estimators': 500, 'learning_rate': 0.011724413210932693, 'num_leaves': 78, 'max_depth': 7, 'subsample': 0.827869997122202, 'colsample_bytree': 0.8578499492295439, 'reg_lambda': 4.9437795100783175, 'reg_alpha': 0.6890650670637457, 'min_child_samples': 19}




[I 2025-11-16 15:02:30,047] Trial 0 finished with value: -1.509565841079802 and parameters: {'n_estimators': 200, 'max_depth': 8, 'min_samples_split': 16, 'min_samples_leaf': 7, 'max_features': 'sqrt'}. Best is trial 0 with value: -1.509565841079802.
[I 2025-11-16 15:02:35,599] Trial 1 finished with value: -1.5394713689776123 and parameters: {'n_estimators': 400, 'max_depth': 7, 'min_samples_split': 16, 'min_samples_leaf': 3, 'max_features': 'sqrt'}. Best is trial 0 with value: -1.509565841079802.
[I 2025-11-16 15:02:38,070] Trial 2 finished with value: -1.6462519669413223 and parameters: {'n_estimators': 150, 'max_depth': 4, 'min_samples_split': 9, 'min_samples_leaf': 7, 'max_features': 0.5}. Best is trial 0 with value: -1.509565841079802.
[I 2025-11-16 15:02:39,213] Trial 3 finished with value: -1.639447536216198 and parameters: {'n_estimators': 100, 'max_depth': 5, 'min_samples_split': 10, 'min_samples_leaf': 6, 'max_features': 'sqrt'}. Best is trial 0 with value: -1.509565841079802

RF best params: {'n_estimators': 100, 'max_depth': 8, 'min_samples_split': 15, 'min_samples_leaf': 10, 'max_features': 0.5}


[I 2025-11-16 15:05:04,228] Trial 0 finished with value: -1.5092229028374586 and parameters: {'n_estimators': 300, 'learning_rate': 0.08927180304353628, 'max_depth': 7, 'min_child_weight': 5, 'subsample': 0.6624074561769746, 'colsample_bytree': 0.662397808134481, 'reg_lambda': 1.5227525095137953, 'reg_alpha': 1.7323522915498704}. Best is trial 0 with value: -1.5092229028374586.
[I 2025-11-16 15:05:06,577] Trial 1 finished with value: -1.4740451029504689 and parameters: {'n_estimators': 500, 'learning_rate': 0.051059032093947576, 'max_depth': 3, 'min_child_weight': 7, 'subsample': 0.9329770563201687, 'colsample_bytree': 0.6849356442713105, 'reg_lambda': 2.636424704863906, 'reg_alpha': 0.36680901970686763}. Best is trial 1 with value: -1.4740451029504689.
[I 2025-11-16 15:05:08,574] Trial 2 finished with value: -1.4607878291680314 and parameters: {'n_estimators': 300, 'learning_rate': 0.03347776308515933, 'max_depth': 5, 'min_child_weight': 3, 'subsample': 0.8447411578889518, 'colsample_

XGB best params: {'n_estimators': 600, 'learning_rate': 0.010804032574636408, 'max_depth': 6, 'min_child_weight': 1, 'subsample': 0.8604063212686379, 'colsample_bytree': 0.7747702774708582, 'reg_lambda': 9.99426194777396, 'reg_alpha': 1.3917249958948192}


[I 2025-11-16 15:07:24,001] Trial 0 finished with value: -1.5161252973781547 and parameters: {'n_estimators': 400, 'learning_rate': 0.08927180304353628, 'num_leaves': 64, 'max_depth': 6, 'subsample': 0.6624074561769746, 'colsample_bytree': 0.662397808134481, 'reg_lambda': 0.2904180608409973, 'reg_alpha': 1.7323522915498704, 'min_child_samples': 26}. Best is trial 0 with value: -1.5161252973781547.
[I 2025-11-16 15:07:28,494] Trial 1 finished with value: -1.4580999858269337 and parameters: {'n_estimators': 600, 'learning_rate': 0.010485387725194618, 'num_leaves': 79, 'max_depth': 8, 'subsample': 0.6849356442713105, 'colsample_bytree': 0.6727299868828402, 'reg_lambda': 0.9170225492671691, 'reg_alpha': 0.6084844859190754, 'min_child_samples': 23}. Best is trial 1 with value: -1.4580999858269337.
[I 2025-11-16 15:07:31,897] Trial 2 finished with value: -1.4686308009376015 and parameters: {'n_estimators': 500, 'learning_rate': 0.019553708662745254, 'num_leaves': 57, 'max_depth': 0, 'subsamp

LGBM best params: {'n_estimators': 400, 'learning_rate': 0.010045634696149257, 'num_leaves': 55, 'max_depth': 9, 'subsample': 0.8331823551676786, 'colsample_bytree': 0.7150741289461808, 'reg_lambda': 2.390310897251647, 'reg_alpha': 0.627718597762933, 'min_child_samples': 19}




[I 2025-11-16 15:09:02,139] Trial 0 finished with value: -1.4776744822083565 and parameters: {'n_estimators': 200, 'max_depth': 8, 'min_samples_split': 16, 'min_samples_leaf': 7, 'max_features': 'sqrt'}. Best is trial 0 with value: -1.4776744822083565.
[I 2025-11-16 15:09:06,900] Trial 1 finished with value: -1.4916903531715444 and parameters: {'n_estimators': 400, 'max_depth': 7, 'min_samples_split': 16, 'min_samples_leaf': 3, 'max_features': 'sqrt'}. Best is trial 0 with value: -1.4776744822083565.
[I 2025-11-16 15:09:09,258] Trial 2 finished with value: -1.5565653560087835 and parameters: {'n_estimators': 150, 'max_depth': 4, 'min_samples_split': 9, 'min_samples_leaf': 7, 'max_features': 0.5}. Best is trial 0 with value: -1.4776744822083565.
[I 2025-11-16 15:09:10,323] Trial 3 finished with value: -1.5444689081654048 and parameters: {'n_estimators': 100, 'max_depth': 5, 'min_samples_split': 10, 'min_samples_leaf': 6, 'max_features': 'sqrt'}. Best is trial 0 with value: -1.4776744822

RF best params: {'n_estimators': 200, 'max_depth': 8, 'min_samples_split': 16, 'min_samples_leaf': 7, 'max_features': 0.5}


[I 2025-11-16 15:11:49,844] Trial 0 finished with value: -1.537815492997082 and parameters: {'n_estimators': 300, 'learning_rate': 0.08927180304353628, 'max_depth': 7, 'min_child_weight': 5, 'subsample': 0.6624074561769746, 'colsample_bytree': 0.662397808134481, 'reg_lambda': 1.5227525095137953, 'reg_alpha': 1.7323522915498704}. Best is trial 0 with value: -1.537815492997082.
[I 2025-11-16 15:11:51,966] Trial 1 finished with value: -1.4929727378352997 and parameters: {'n_estimators': 500, 'learning_rate': 0.051059032093947576, 'max_depth': 3, 'min_child_weight': 7, 'subsample': 0.9329770563201687, 'colsample_bytree': 0.6849356442713105, 'reg_lambda': 2.636424704863906, 'reg_alpha': 0.36680901970686763}. Best is trial 1 with value: -1.4929727378352997.
[I 2025-11-16 15:11:53,786] Trial 2 finished with value: -1.4781217119302914 and parameters: {'n_estimators': 300, 'learning_rate': 0.03347776308515933, 'max_depth': 5, 'min_child_weight': 3, 'subsample': 0.8447411578889518, 'colsample_by

XGB best params: {'n_estimators': 500, 'learning_rate': 0.01142780636664853, 'max_depth': 5, 'min_child_weight': 1, 'subsample': 0.6321144475326688, 'colsample_bytree': 0.9257869962052698, 'reg_lambda': 9.974172429699522, 'reg_alpha': 1.2381368568217286}


[I 2025-11-16 15:13:58,232] Trial 0 finished with value: -1.5402587974570525 and parameters: {'n_estimators': 400, 'learning_rate': 0.08927180304353628, 'num_leaves': 64, 'max_depth': 6, 'subsample': 0.6624074561769746, 'colsample_bytree': 0.662397808134481, 'reg_lambda': 0.2904180608409973, 'reg_alpha': 1.7323522915498704, 'min_child_samples': 26}. Best is trial 0 with value: -1.5402587974570525.
[I 2025-11-16 15:14:03,397] Trial 1 finished with value: -1.479955474219714 and parameters: {'n_estimators': 600, 'learning_rate': 0.010485387725194618, 'num_leaves': 79, 'max_depth': 8, 'subsample': 0.6849356442713105, 'colsample_bytree': 0.6727299868828402, 'reg_lambda': 0.9170225492671691, 'reg_alpha': 0.6084844859190754, 'min_child_samples': 23}. Best is trial 1 with value: -1.479955474219714.
[I 2025-11-16 15:14:07,358] Trial 2 finished with value: -1.4927997313744794 and parameters: {'n_estimators': 500, 'learning_rate': 0.019553708662745254, 'num_leaves': 57, 'max_depth': 0, 'subsample

LGBM best params: {'n_estimators': 400, 'learning_rate': 0.011223305056420567, 'num_leaves': 76, 'max_depth': 5, 'subsample': 0.8865438995340241, 'colsample_bytree': 0.7361862130020156, 'reg_lambda': 2.7713810787377406, 'reg_alpha': 0.5234777939577208, 'min_child_samples': 15}

‚úÖ Hyperparameter optimization completed for all horizons.


In [21]:
# =====================================================================
# 7Ô∏è‚É£ TRAIN FINAL MODELS PER HORIZON + EVALUATE
# =====================================================================

def evaluate_metrics(y_true, y_pred, baseline_rmse_pers, baseline_rmse_climo):
    res = {}
    res["RMSE"] = rmse(y_true, y_pred)
    res["MAE"]  = mean_absolute_error(y_true, y_pred)
    res["R2"]   = r2_score(y_true, y_pred)
    res["Skill_vs_Persistence"] = 1 - (res["RMSE"] / baseline_rmse_pers)
    res["Skill_vs_Climatology"] = 1 - (res["RMSE"] / baseline_rmse_climo)
    return res

final_models = {}
results = []

for h in horizons:
    print(f"\n================ FINAL TRAINING FOR HORIZON {h} ================")
    y_train_h = y_train_dict[h]
    y_test_h  = y_test_dict[h]

    # choose model type by horizon (you can adjust this heuristic)
    if h == "t1":
        model = XGBRegressor(**xgb_best_params[h])
        model_type = "XGB"
    elif h in ["t3", "t6"]:
        model = LGBMRegressor(**lgbm_best_params[h])
        model_type = "LGBM"
    else:  # t12, t24
        model = RandomForestRegressor(**rf_best_params[h])
        model_type = "RF"

    model.fit(X_train_sel, y_train_h)
    y_pred_test = model.predict(X_test_sel)

    final_models[h] = model

    base = baseline_metrics[h]
    met = evaluate_metrics(
        y_test_h, y_pred_test,
        baseline_rmse_pers=base["RMSE_persistence"],
        baseline_rmse_climo=base["RMSE_climatology"]
    )

    res_row = {
        "Horizon": h,
        "Model": model_type,
        **met
    }
    results.append(res_row)

results_df = pd.DataFrame(results)
print("\n‚úÖ Evaluation summary (test set):")
display(results_df.round(4))



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002397 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3729
[LightGBM] [Info] Number of data points in the train set: 75259, number of used features: 19
[LightGBM] [Info] Start training from score 28.377390

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002235 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3729
[LightGBM] [Info] Number of data points in the train set: 75259, number of used features: 19
[LightGBM] [Info] Start training from score 28.377724



‚úÖ Evaluation summary (test set):


Unnamed: 0,Horizon,Model,RMSE,MAE,R2,Skill_vs_Persistence,Skill_vs_Climatology
0,t1,XGB,0.8613,0.5984,0.9142,0.2495,0.6958
1,t3,LGBM,1.1907,0.8556,0.8361,0.504,0.5796
2,t6,LGBM,1.3565,1.0118,0.7872,0.6415,0.521
3,t12,RF,1.5134,1.1718,0.7353,0.6796,0.4655
4,t24,RF,1.4733,1.118,0.7491,0.1529,0.4797
