**Update (2025/10/12)**

With sincere thanks to @lwolwo for valuable feedback, I have corrected the logging of the RMSE score for each fold.

-   **Issue:** The log was previously displaying the RMSE from only the last iteration, rather than the averaged result for the entire fold.
-   **Correction:** I have updated the metric calculation to use the out-of-fold predictions (`oof_preds[val_idx]`) instead of the predictions from the final iteration (`val_preds`). This ensures the logged score accurately reflects the performance of the entire fold.

Thank you again, @lwolwo, for your sharp observation!

# Stacking Baseline with a Neural Network

Welcome to this notebook. Our goal here is to create a simple and effective baseline for **Stacking**, a powerful technique essential for winning Kaggle competitions. We'll be implementing Stacking using a Neural Network as our final model.

Let's get started!


## What is Stacking?

Simply put, Stacking (or Stacked Generalization) is an ensemble technique that combines multiple machine learning models to create a more powerful one. The process works as follows:

1.  **Level 1 Models (Base Models):** First, we train several different models (e.g., LightGBM, CatBoost, XGBoost, TabM) on the training data using cross-validation (CV).
2.  **Generate New Features:** During the cross-validation process, we collect the predictions made on the validation sets. These are called **Out-of-Fold (OOF)** predictions. We also get predictions for the test set from each base model.
3.  **Level 2 Model (Meta-Model):** We then use these OOF predictions and test predictions as new features to train a final model, called a meta-model.

This approach often leads to better performance because the meta-model learns the strengths and weaknesses of each base model and combines their predictions intelligently. Many top solutions in Kaggle competitions rely on this method.

In this notebook, we'll use a **Neural Network (NN)** as our Level 2 meta-model.


## Our Approach in This Notebook

We will load OOF predictions and test predictions that have been pre-generated by my public notebooks. You'll see us reading `.csv` files with names like `oof_...csv` and `test_...csv` at the beginning. These files contain the Level 1 predictions that will serve as the input features for our NN meta-model.


## A Crucial Warning: Preventing Data Leakage

This is the most important rule in Stacking:

> **All Out-of-Fold (OOF) predictions must be generated using the exact same cross-validation splits.**

In our case, all base models used a `KFold` with `n_splits=5` and `random_state=42`. You **must** keep these parameters consistent for all your base models.

### Why is this so critical?

If you use different CV splits for different models, you will introduce **data leakage**. Let's break down why:

Imagine you have two base models, Model A and Model B, with different CV splits.
* For **Model A**, a specific data point (let's call it `row_100`) was in the training set for Fold 1 and the validation set for Fold 2.
* For **Model B** (with a different split), `row_100` was in the validation set for Fold 1.

When you create the OOF features for your meta-model:
* The OOF prediction for `row_100` from Model A comes from Fold 2, where Model A has **never seen** `row_100` during its training for that fold. This is correct.
* However, the OOF prediction for `row_100` from Model B comes from Fold 1.

Now, if you feed these OOFs to your meta-model, it learns from Model A's prediction for `row_100` (which was made when `row_100` was unseen data) and Model B's prediction. The problem is, when Model A was being trained on Fold 1, it *saw* the target value for `row_100`. This means some information about the target value has "leaked" into the training process of your meta-model through an indirect path.

This is a classic case of **data leakage**. Your local CV score will be artificially inflated, giving you a false sense of confidence, but your model will likely perform poorly on the private test set because it's overfitted to this leaked information.

**To summarize: Always use the same CV folds for all base models.**


## A Final Note: "Trust Your CV"

You may have seen the popular "sport" on Kaggle of blending high-scoring public submissions with various weights to climb the Public Leaderboard (LB). While it can be a fun exercise, it's important to remember what we're ultimately competing on the **Private LB**, which is calculated on a hidden 80% of the test data.

A model's true strength lies in its ability to **generalize** to unseen data. The most reliable way to measure this is through a carefully constructed and stable cross-validation framework. This is why you'll often hear the phrase:

> **"Trust your CV."**

A good local CV score is a much better indicator of private LB performance than a high public LB score achieved through blending. Let's build a model with strong fundamentals!

In [1]:
import warnings
warnings.simplefilter('ignore')

In [2]:
import pandas as pd, numpy as np

train = pd.read_csv('/kaggle/input/playground-series-s5e10/train.csv')
test = pd.read_csv('/kaggle/input/playground-series-s5e10/test.csv')

train.head(3)

Unnamed: 0,id,road_type,num_lanes,curvature,speed_limit,lighting,weather,road_signs_present,public_road,time_of_day,holiday,school_season,num_reported_accidents,accident_risk
0,0,urban,2,0.06,35,daylight,rainy,False,True,afternoon,False,True,1,0.13
1,1,urban,4,0.99,35,daylight,clear,True,False,evening,True,True,0,0.35
2,2,rural,4,0.63,70,dim,clear,False,True,morning,True,False,2,0.3


In [3]:
TARGET = 'accident_risk'

In [4]:
import glob, os

all_oof_data = []
all_test_data = []

oof_files = glob.glob('/kaggle/input/**/oof_*.csv', recursive=True)
print(f"Found {len(oof_files)} oof files.")

for oof_path in oof_files:
    test_path = oof_path.replace('oof_', 'test_')

    base_name = os.path.basename(oof_path)
    model_name = base_name.replace('oof_', '').replace('.csv', '')

    all_oof_data.append({
        'df': pd.read_csv(oof_path),
        'name': model_name
    })
    all_test_data.append({
        'df': pd.read_csv(test_path),
        'name': model_name
    })

def merge_dataframes_by_id(data_list, id_col='id', feature_col=TARGET):

    first_data = data_list[0]
    merged_df = first_data['df'].rename(columns={
        feature_col: f"{feature_col}_{first_data['name']}"
    })

    for data in data_list[1:]:
        renamed_df = data['df'].rename(columns={
            feature_col: f"{feature_col}_{data['name']}"
        })
        merged_df = pd.merge(merged_df, renamed_df, on=id_col, how='outer')

    return merged_df

oof_df = merge_dataframes_by_id(all_oof_data)
test_df = merge_dataframes_by_id(all_test_data)

oof_df[TARGET] = train[TARGET].values

oof_df.head(3)

Found 5 oof files.


Unnamed: 0,id,accident_risk_20seedslgb_plus_origcol,accident_risk_xgb_plus_origcol,accident_risk_realmlp_plus_origcol,accident_risk_tabm_overresid,accident_risk_tabm_plus_origcol_tuned,accident_risk
0,0,0.128497,0.128963,0.130021,0.13067,0.128814,0.13
1,1,0.323496,0.32294,0.326364,0.325544,0.32326,0.35
2,2,0.387007,0.387817,0.383694,0.3888,0.382161,0.3


In [5]:
FEATURES = [col for col in oof_df.columns if col not in ['id',TARGET]]

X = oof_df[FEATURES]
y = oof_df[TARGET]

In [6]:
from sklearn.model_selection import KFold

N_SPLITS = 5
kf = KFold(n_splits=N_SPLITS, shuffle=True, random_state=42)

In [7]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

2025-10-12 01:36:45.024700: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1760233005.289470      13 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1760233005.365570      13 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [8]:
#import ramdom

oof_preds = np.zeros(len(X))
test_preds = np.zeros(len(test_df))

SEED = [32, 12, 377, 485, 5900, 2392, 3948, 189, 304598, 38950]

for fold, (train_idx, val_idx) in enumerate(kf.split(X)):
    print(f'---Fold {fold+1}/{N_SPLITS}---')

    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
    X_test = test_df[FEATURES].copy()

    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_val_scaled = scaler.transform(X_val)
    X_test_scaled = scaler.transform(X_test)

    for seed in SEED:
        np.random.seed(seed)
        tf.random.set_seed(seed)
        model = Sequential([
            Input(shape=(X_train_scaled.shape[1],)), 
            Dense(64, activation='relu'),            
            Dense(32, activation='relu'),            
            Dense(1)                                 
        ])
    
        model.compile(optimizer='adam', loss='mean_squared_error')
    
        early_stopping = EarlyStopping(
            monitor='val_loss', 
            patience=20,       
            restore_best_weights=True 
        )
    
        model.fit(X_train_scaled, y_train,
                  validation_data=(X_val_scaled, y_val),
                  epochs=200,       
                  batch_size=512,
                  callbacks=[early_stopping],
                  verbose=0         
                 )
    
        val_preds = model.predict(X_val_scaled).flatten() 
        oof_preds[val_idx] += val_preds / len(SEED)
    
        test_preds += model.predict(X_test_scaled).flatten() / len(SEED)

    fold_rmse = mean_squared_error(y_val, oof_preds[val_idx], squared=False)
    print(f"Fold {fold+1} RMSE: {fold_rmse}")

test_preds /= N_SPLITS

overall_oof_rmse = mean_squared_error(y, oof_preds, squared=False)
print(f"\nOverall OOF RMSE: {overall_oof_rmse:.5f}")

---Fold 1/5---


2025-10-12 01:37:02.263277: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)


[1m3236/3236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 1ms/step
[1m5394/5394[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 1ms/step
[1m3236/3236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 1ms/step
[1m5394/5394[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 1ms/step
[1m3236/3236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 1ms/step
[1m5394/5394[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 1ms/step
[1m3236/3236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 1ms/step
[1m5394/5394[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 1ms/step
[1m3236/3236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 1ms/step
[1m5394/5394[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 1ms/step
[1m3236/3236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 1ms/step
[1m5394/5394[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 1ms/step
[1m3236/3236[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 1ms/step

In [9]:
pd.DataFrame({'id': train.id, TARGET: oof_preds}).to_csv('oof_nn_ensemble.csv', index=False)
pd.DataFrame({'id': test.id, TARGET: test_preds}).to_csv('test_nn_ensemble.csv', index=False)