# 1. Overview & Objectives

This notebook implements and evaluates a broad range of **classical machine learning regression models**
for **univariate daily weather forecasting** using Meteostat data.

The focus of this notebook is:
- Tree-based models
- Linear and robust regression models
- Distance-based regressors
- Kernel-based methods
- Neural network regressors (non-sequential)

All models are trained using a **time-aware feature engineering approach**
based on lagged values of the target variable (`tavg`).

### Implemented model families
- Linear models: Linear, Ridge, Huber, Tweedie
- Distance-based: KNN, Radius Neighbors
- Tree-based: Decision Tree, Random Forest, HistGB
- Boosting: LightGBM, XGBoost, XGBoost Random Forest
- Neural: MLP Regressor
- Kernel: Support Vector Regression

### Outputs
- CSV files with evaluation metrics for validation & test splits
- Stored best model configurations for later visualization

# 2. Imports & Setup

In [None]:
#Importing the helper notebooks

## Enable imports from .ipynb files
import import_ipynb  
import sys
sys.path.append("code")

## Importing the helper notebooks as modules
from splitting import split_time_series
from metrics import evaluate_and_save, load_best_models

# Notebook specific imports
import numpy as np
import pandas as pd
from pathlib import Path

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV

from mlforecast import MLForecast
from mlforecast.target_transforms import Differences

from sklearn.linear_model import LinearRegression, Ridge, HuberRegressor, TweedieRegressor
from sklearn import svm
from sklearn.neighbors import KNeighborsRegressor, RadiusNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor, XGBRFRegressor

# 3. Load Data & Train/Val/Test Split
use `split_time_series()`

In [2]:
splits = split_time_series()

train_df = splits["train"]
val_df   = splits["val"]
test_df  = splits["test"]

TARGET_COL = "tavg"

# 4. Model Definition

#### Model families and names
- **Linear models**: LinearRegression, HuberRegression, RidgeRegression, TweedieRegression
- **Tree / Ensemble models**:
  - RandomForest (`RandomForest_ne{n_estimators}_md{max_depth}`)
  - LightGBM (`LightGBM_nl{num_leaves}_lr{learning_rate}_ne{n_estimators}`)
  - XGBoost (`XGBoost_md{max_depth}_lr{learning_rate}_ne{n_estimators}`)
  - XGBRF, DecisionTree, HistGradientBoosting
- **Kernel / distance-based models**:
  - SVR (`SVR_C{C}_g{gamma}_e{epsilon}`)
  - KNN (`KNN_k{n_neighbors}_w{weights}`)
  - RadiusNeighbors
- **Neural network**:
- MLP (`MLP_h{hidden_units}_mi{max_iter}`)

#### Hyperparameters (searched values)
- **RandomForest**: `n_estimators ∈ {50,100,200,300,400,500}`, `max_depth ∈ {5,10,20,30}`
- **LightGBM**: `num_leaves ∈ {31,50,73}`, `learning_rate ∈ {0.05,0.1,0.15}`, `n_estimators ∈ {50,100,200,300}`
- **XGBoost**: `max_depth ∈ {3,5,9}`, `learning_rate ∈ {0.05,0.1}`, `n_estimators ∈ {100,200}`
- **SVR**: `C ∈ {1,10,20,30}`, `gamma ∈ {scale, 0.1}`, `epsilon ∈ {0.1,0.2,0.3}`
- **KNN**: `n_neighbors ∈ {3,4,5,6,10}`, `weights ∈ {uniform,distance}`
- **MLP**: `hidden_layer_sizes ∈ {(10),(20),(50),(100)}`, `max_iter ∈ {500,1000,1500}`
- **Fixed settings**: `random_state = 42` where applicable

**Why `mlforecast`?**
Most sklearn regressors are **time-agnostic**. `mlforecast` transforms time series into **supervised learning datasets using lag features**, enabling standard regressors to model temporal dependencies.

In [3]:
MODEL_PARAM_REGISTRY = {}

In [4]:
lin_models = {
    "LinearRegression": LinearRegression(),
    "HuberRegression": HuberRegressor(),
    "RidgeRegression": Ridge(),
    "TweedieRegression": TweedieRegressor(),
}

In [5]:
for name, model in lin_models.items():
    MODEL_PARAM_REGISTRY[name] = {
        "model_class": model.__class__,
        "params": model.get_params()
    }

In [6]:
rf_models = {}

for n_estimators in [50, 100, 200, 300, 400, 500]:
    for max_depth in [5, 10, 20, 30]:
        name = f"RandomForest_ne{n_estimators}_md{max_depth}"

        rf_models[name] = RandomForestRegressor(
            n_estimators=n_estimators,
            max_depth=max_depth,
            random_state=42
        )

        MODEL_PARAM_REGISTRY[name] = {
            "model_class": RandomForestRegressor,
            "params": {
                "n_estimators": n_estimators,
                "max_depth": max_depth,
                "random_state": 42
            }
        }

In [7]:
lgbm_models = {}

for num_leaves in [31, 50, 73]:
    for lr in [0.05, 0.1, 0.15]:
        for n_estimators in [50, 100, 200, 300]:
            name = f"LightGBM_nl{num_leaves}_lr{lr}_ne{n_estimators}"

            lgbm_models[name] = LGBMRegressor(
                num_leaves=num_leaves,
                learning_rate=lr,
                n_estimators=n_estimators,
                random_state=42
            )

            MODEL_PARAM_REGISTRY[name] = {
                "model_class": LGBMRegressor,
                "params": {
                    "num_leaves": num_leaves,
                    "learning_rate": lr,
                    "n_estimators": n_estimators,
                    "random_state": 42
                }
            }

In [8]:
xgb_models = {}

for max_depth in [3, 5, 9]:
    for lr in [0.05, 0.1]:
        for n_estimators in [100, 200]:
            name = f"XGBoost_md{max_depth}_lr{lr}_ne{n_estimators}"

            xgb_models[name] = XGBRegressor(
                max_depth=max_depth,
                learning_rate=lr,
                n_estimators=n_estimators,
                random_state=42
            )

            MODEL_PARAM_REGISTRY[name] = {
                "model_class": XGBRegressor,
                "params": {
                    "max_depth": max_depth,
                    "learning_rate": lr,
                    "n_estimators": n_estimators,
                    "random_state": 42
                }
            }

In [9]:
svr_models = {}

for C in [1, 10, 20, 30]:
    for gamma in ["scale", 0.1]:
        for epsilon in [0.1, 0.2, 0.3]:
            name = f"SVR_C{C}_g{gamma}_e{epsilon}"

            svr_models[name] = svm.SVR(
                C=C,
                gamma=gamma,
                epsilon=epsilon
            )

            MODEL_PARAM_REGISTRY[name] = {
                "model_class": svm.SVR,
                "params": {
                    "C": C,
                    "gamma": gamma,
                    "epsilon": epsilon
                }
            }

In [10]:
knn_models = {}

for n_neighbors in [3, 4, 5, 6, 10]:
    for weights in ["uniform", "distance"]:
        name = f"KNN_k{n_neighbors}_w{weights}"

        knn_models[name] = KNeighborsRegressor(
            n_neighbors=n_neighbors,
            weights=weights
        )

        MODEL_PARAM_REGISTRY[name] = {
            "model_class": KNeighborsRegressor,
            "params": {
                "n_neighbors": n_neighbors,
                "weights": weights
            }
        }

In [11]:
mlp_models = {}

for hls in [(10,), (20,), (50,), (100,)]:
    for max_iter in [500, 1000, 1500]:
        name = f"MLP_h{hls[0]}_mi{max_iter}"

        mlp_models[name] = MLPRegressor(
            hidden_layer_sizes=hls,
            max_iter=max_iter,
            random_state=42
        )

        MODEL_PARAM_REGISTRY[name] = {
            "model_class": MLPRegressor,
            "params": {
                "hidden_layer_sizes": hls,
                "max_iter": max_iter,
                "random_state": 42
            }
        }

In [12]:
tree_models = {
    "RadiusNeighbors": RadiusNeighborsRegressor(radius=1e5),
    "DecisionTree": DecisionTreeRegressor(criterion="squared_error"),
    "HistGradientBoosting": HistGradientBoostingRegressor(loss="absolute_error"),
    "XGBRF": XGBRFRegressor(random_state=42),
}

tree_models.update(rf_models)
tree_models.update(lgbm_models)
tree_models.update(xgb_models)
tree_models.update(svr_models)
tree_models.update(knn_models)
tree_models.update(mlp_models)

### Feature Engineering


#### Feature Scaling for distance/kernel models

In [13]:
scaler = StandardScaler()
scale_models = ["SVR", "KNN", "RadiusNeighbors", "MLP"]

# 5. Training  
For each model:  
- Fit on training data  
- Lag structure:
  - short-term memory: `t-1` to `t-29`
  - seasonal memory: `t-60`, `t-91`, `t-182`, `t-365`
  - captures: daily autocorrelation, weekly/monthly effects, annual seasonality

**Additional lag transforms** (linear models only):
- Rolling statistics:
  - 7-day mean and std
  - 30-day mean
- Purpose: 
  - encode local trends and volatility
  - improve linear model expressiveness

**Stationarity Handling**
- Differencing `(Δt = 1)` applied only to linear models
- Reason:
  - linear regression assumes stable mean
  - weather data is non-stationary
- tree-based models handle the non-stationarity implicitly -> thus no differencing

In [14]:
#Lags
LAGS = list(range(1, 30)) + [60, 91, 182, 365]

In [15]:
from mlforecast.lag_transforms import RollingMean, RollingStd

In [16]:
#MLForecast setup
lin_fcst = MLForecast(
    models=lin_models,
    freq="D",
    lags=LAGS,
    lag_transforms={
        7: [RollingMean(7), RollingStd(7)],
        30: [RollingMean(30)],
    },
    target_transforms=[Differences([1])]
)

tree_fcst = MLForecast(
    models=tree_models,
    freq="D",
    lags=LAGS,
    target_transforms=None
)

In [None]:
#Temporarily add a column because MLForecast requires a unique_id column for panel data
train_temp = train_df.copy()
train_temp["unique_id"] = "station_1"

In [18]:
#Fit models
lin_fcst.fit(
    train_temp,
    id_col="unique_id",
    time_col="time",
    target_col=TARGET_COL,
)

tree_fcst.fit(
    train_temp,
    id_col="unique_id",
    time_col="time",
    target_col=TARGET_COL,
)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002195 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 8415
[LightGBM] [Info] Number of data points in the train set: 12784, number of used features: 33
[LightGBM] [Info] Start training from score 6.950070
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002387 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 8415
[LightGBM] [Info] Number of data points in the train set: 12784, number of used features: 33
[LightGBM] [Info] Start training from score 6.950070
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001068 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 8415
[LightGBM] [Info] Number of data points in the train set: 12784, number of used features: 33
[LightGBM] [Info] Start tra

MLForecast(models=[RadiusNeighbors, DecisionTree, HistGradientBoosting, XGBRF, RandomForest_ne50_md5, RandomForest_ne50_md10, RandomForest_ne50_md20, RandomForest_ne50_md30, RandomForest_ne100_md5, RandomForest_ne100_md10, RandomForest_ne100_md20, RandomForest_ne100_md30, RandomForest_ne200_md5, RandomForest_ne200_md10, RandomForest_ne200_md20, RandomForest_ne200_md30, RandomForest_ne300_md5, RandomForest_ne300_md10, RandomForest_ne300_md20, RandomForest_ne300_md30, RandomForest_ne400_md5, RandomForest_ne400_md10, RandomForest_ne400_md20, RandomForest_ne400_md30, RandomForest_ne500_md5, RandomForest_ne500_md10, RandomForest_ne500_md20, RandomForest_ne500_md30, LightGBM_nl31_lr0.05_ne50, LightGBM_nl31_lr0.05_ne100, LightGBM_nl31_lr0.05_ne200, LightGBM_nl31_lr0.05_ne300, LightGBM_nl31_lr0.1_ne50, LightGBM_nl31_lr0.1_ne100, LightGBM_nl31_lr0.1_ne200, LightGBM_nl31_lr0.1_ne300, LightGBM_nl31_lr0.15_ne50, LightGBM_nl31_lr0.15_ne100, LightGBM_nl31_lr0.15_ne200, LightGBM_nl31_lr0.15_ne300, Li

# 6. Forecasting  
- Produce forecasts for validation and test horizons
- Forecast Horizon = full validation / test length
- Forecasts produced:  
  - Autoregressively
  - Using model predictions as future lags

In [19]:
#Forecasting
val_temp = val_df.copy()
val_temp["unique_id"] = "station_1" #MLForecast requires id_col for panel data, even for a single series
H_VAL = len(val_temp)
val_preds_lin = lin_fcst.predict(H_VAL)
val_preds_tree = tree_fcst.predict(H_VAL)

test_temp = test_df.copy()
test_temp["unique_id"] = "station_1"
H_TEST = len(test_temp)
test_preds_lin = lin_fcst.predict(H_TEST)
test_preds_tree = tree_fcst.predict(H_TEST)

  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] = values
  df[names] =

# 7. Evaluation (Using Shared Metrics Function)  
- Applied `evaluate_and_save()` to each model  
- Results saved as CSV into `data/models/`  
- Display sorted results table  

In [20]:
OUT_FILE = "../data/models/ml_models_results.csv"
results = []

In [21]:
def evaluate_split(y_true, forecasts, split_name):
    forecasts = forecasts.drop(columns=[c for c in ["unique_id"] if c in forecasts.columns])
    for model_name in forecasts.columns:
        metrics_dict = evaluate_and_save(
            y_true=y_true,
            y_pred=forecasts[model_name].values,
            model_name=model_name,
            impl_name="ml",
            split_name=split_name,
            out_filename="ml_models_results.csv" #temporary CSV to avoid overwriting
        )
        results.append(metrics_dict)

In [None]:
#Evaluate validation and test
evaluate_split(val_df[TARGET_COL].values, val_preds_lin, "val")
evaluate_split(val_df[TARGET_COL].values, val_preds_tree, "val")
evaluate_split(test_df[TARGET_COL].values, test_preds_lin, "test")
evaluate_split(test_df[TARGET_COL].values, test_preds_tree, "test")

#Combine results and keep top 3 per split
results_df = pd.DataFrame(results)
top_val = results_df[results_df["Split"]=="val"].sort_values("MAE").head(3)
top_test = results_df[results_df["Split"]=="test"].sort_values("MAE").head(3)
best_models = pd.concat([top_val, top_test], ignore_index=True)

In [23]:
top_model_names = best_models["Model"].unique()

BEST_MODEL_CONFIGS = {
    name: MODEL_PARAM_REGISTRY[name]
    for name in top_model_names
}

In [24]:
#Save top models to CSV
best_models.to_csv(OUT_FILE, index=False)

In [28]:
#Display results
print("All model metrics:")
#display(results_df.sort_values(["Split", "MAE"]))
results_df.sort_values(["Split", "MAE"]).head(10)

All model metrics:


Unnamed: 0,Model,Impl,Split,MAE,RMSE,MAPE,OPE,R2
247,MLP_h20_mi500,ml,test,3.98451,5.030306,6453597000.0,0.074321,0.759585
248,MLP_h20_mi1000,ml,test,3.98451,5.030306,6453597000.0,0.074321,0.759585
249,MLP_h20_mi1500,ml,test,3.98451,5.030306,6453597000.0,0.074321,0.759585
237,KNN_k4_wdistance,ml,test,4.224437,5.430456,3576457000.0,0.052938,0.719814
236,KNN_k4_wuniform,ml,test,4.369636,5.511018,9654018000.0,0.000145,0.711439
239,KNN_k5_wdistance,ml,test,4.458703,5.708083,3534339000.0,0.032122,0.690433
240,KNN_k6_wuniform,ml,test,4.491402,5.67343,5468750000.0,0.069097,0.694181
235,KNN_k3_wdistance,ml,test,4.778261,6.168883,3399476000.0,0.006476,0.638435
197,LightGBM_nl73_lr0.15_ne300,ml,test,5.183773,6.787466,2784946000.0,0.120175,0.562288
253,MLP_h100_mi500,ml,test,5.222226,6.75029,3831620000.0,0.120363,0.567069


In [26]:
print("Top 3 models per split saved to CSV:")
display(best_models)

Top 3 models per split saved to CSV:


Unnamed: 0,Model,Impl,Split,MAE,RMSE,MAPE,OPE,R2
0,MLP_h20_mi500,ml,val,4.270973,5.432609,1042164000.0,0.003407,0.74065
1,MLP_h20_mi1000,ml,val,4.270973,5.432609,1042164000.0,0.003407,0.74065
2,MLP_h20_mi1500,ml,val,4.270973,5.432609,1042164000.0,0.003407,0.74065
3,MLP_h20_mi500,ml,test,3.98451,5.030306,6453597000.0,0.074321,0.759585
4,MLP_h20_mi1000,ml,test,3.98451,5.030306,6453597000.0,0.074321,0.759585
5,MLP_h20_mi1500,ml,test,3.98451,5.030306,6453597000.0,0.074321,0.759585


# 8. Conclusions  
#### Short wrap-up:  
- Which model family performed best here?  
- Any issues or instability?  
- Notes for integration in the final report  

**Best-performing model family**
- **Neural network regressor (MLP)** performed best overall
- Top Model:
  - `MLP_h20_mi500`
  - Lowest MAE on both validation and test splits

**Other strong performers**
- KNN regressors ranked directly after MLP
  - `KNN_k4_wdistance`
  - `KNN_k4_wuniform`
- Indicates:
  - local similarity in lag space is highly informative
  - temperature dynamics are smooth and locally consistent

**Notes for final report**
- Linear models benefitted from differencing (still performed worse due to the limited nonlinearity)
- Classical ML models can perform **very strongly** when
  - using proper lag engineering
  - time leakage is strictly avoided
- the results justify using
  - MLP as the primary classical benchmark
  - KNN as a robust, interpretable alternative for ML models