# 🧠 Overview

In this notebook, we **compare four tree-based regression models** —  
🌳 **XGBoost**, 🌿 **LightGBM**, 🐱 **CatBoost**, and 🌲 **Random Forest** —  
to evaluate their predictive performance on the same dataset.

After evaluating each model individually,  
we construct an **ensemble model** that combines the predictions of all four algorithms.  
The **blending weights** are optimized using **Optuna**, aiming to achieve the **best overall performance**.

---

### 🎯 Objectives

The main goals of this notebook are to:

- Analyze the **performance differences** among popular tree-based algorithms.  
- Demonstrate how **ensemble learning with optimized weights** can further enhance predictive accuracy.  
- Provide a **practical workflow** for combining models efficiently.

---

### 🏆 Result Summary

The final **weighted ensemble** achieved a **score of 0.05555** on the evaluation metric (public leaderboard).  
This demonstrates that optimized blending can yield consistent improvements  
over individual tree-based learners, even when their standalone performances are similar.

---

📈 *By the end, we show that carefully tuned ensemble models can outperform individual learners —  
highlighting the real-world power of **model blending** in predictive modeling.*

In [None]:
# ===============================
# 📚 Library Imports
# ===============================

# Basic libraries
import os
import numpy as np
import pandas as pd

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt
import shap

# Preprocessing
from sklearn.preprocessing import LabelEncoder, RobustScaler, StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split

# Machine Learning Models
import xgboost as xgb
import lightgbm as lgb
import catboost as cb
from sklearn.ensemble import RandomForestRegressor

# Evaluation Metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Optimization
import optuna

# Statistical tools
from scipy import stats

# Model saving & loading
import joblib

# Display input file paths (Kaggle environment)
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
# 📂 Data Loading / データ読み込み
train = pd.read_csv("/kaggle/input/playground-series-s5e10/train.csv")
predict = pd.read_csv("/kaggle/input/playground-series-s5e10/test.csv")

In [None]:
train.head()

## 📊 Data Overview

We begin by **exploring the dataset** to understand its structure and key characteristics.  
Using `info()` and `describe()`, we examine:

- **Data types** of each feature  
- **Missing values** (if any)  
- **Summary statistics** such as mean, standard deviation, and range  

Additionally, a **correlation heatmap** is visualized to identify potential relationships between features.

---

✅ This step ensures that the dataset is **clean**, **well-structured**, and **free from strong multicollinearity** 
providing a reliable foundation for model training.


In [None]:
# Display basic information about the training dataset / 学習データの基本情報を表示
train.info()

In [None]:
train.describe().T

In [None]:
def plot_correlation_heatmap(df, figsize=(12, 8)):
    """
    Display the correlation between all numerical features as a heatmap.
    """
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    corr_matrix = df[numeric_cols].corr()
    
    plt.figure(figsize=figsize)
    sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
                center=0, square=True, linewidths=1)
    plt.title('Feature Correlation Heatmap', fontsize=16)
    plt.tight_layout()
    plt.show()
    
    return corr_matrix

# Display the correlation heatmap of all features in the training data
corr_matrix = plot_correlation_heatmap(train.drop(columns=['id', 'accident_risk']))

## 🧩 Feature Engineering & One-Hot Encoding

We enhance the dataset by **creating additional meaningful features** and transforming categorical variables into a numerical format suitable for model training.

- **Boolean columns** are converted into integer values (`0` and `1`).  
- **One-hot encoding** is applied to categorical features to ensure full compatibility with all tree-based models.  
- These transformations help models capture **nonlinear relationships** and handle **categorical diversity** effectively.

---

🧠 *Feature engineering plays a crucial role in improving model performance by providing richer, more informative inputs for learning.*


In [None]:
def create_features(df):
    
    df['curavture_accidents'] = df['curvature'] * df['num_reported_accidents']
    
    return df

train = create_features(train)
predict = create_features(predict)

In [None]:
# One-Hot Encoding
def one_hot_encode(df, columns):
    df = pd.get_dummies(df, columns=columns, drop_first=False)
    return df

# categorical variables
encode_columns = ['road_type', 'lighting', 'weather', 'time_of_day']

train = one_hot_encode(train, encode_columns)
predict = one_hot_encode(predict, encode_columns)

In [None]:
# Convert boolean columns

def bool_to_int(df):
    bool_columns = df.select_dtypes(include='bool').columns
    for col in bool_columns:
        df[col] = df[col].astype(int)
    return df

train = bool_to_int(train)
predict = bool_to_int(predict)

## ⚖️ Data Scaling and Train-Test Split

To ensure **consistent feature scaling** across all variables, we apply **standardization** to numerical features.  
This process helps stabilize training and improves model convergence, especially for algorithms sensitive to feature magnitude.

After scaling, the dataset is **split into training and validation sets**, allowing for an **unbiased comparison** of model performance under identical conditions.

---

📏 *Proper scaling and data splitting ensure fair and reliable evaluation across all tree-based models.*


In [None]:
# Since the residual errors are large, use RobustScaler, which is less sensitive to outliers
def robust_scale(df):
    scaler = RobustScaler()

    numeric_columns = ['curvature','curavture_accidents']

    df[numeric_columns] = scaler.fit_transform(df[numeric_columns])
    return df

train = robust_scale(train)
predict = robust_scale(predict)

In [None]:
def split_data(df,test_size=0.2,random_state=42):
    X = df.drop(columns=['id','accident_risk'])
    y = df['accident_risk']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
    return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = split_data(train)

predict_X = predict.copy()
predict_X = predict_X.drop(columns=['id'])

# 🌲 Model Training

We train **four optimized tree-based models** —  
**XGBoost**, **LightGBM**, **CatBoost**, and **Random Forest** — on the same dataset to evaluate their predictive performance.

Each model is built using **hyperparameters optimized via tuning techniques such as Optuna**, ensuring both **strong generalization** and a **fair comparison** across all algorithms.

---

🚀 *This step establishes a solid foundation for model evaluation and highlights how hyperparameter optimization can enhance accuracy and consistency.*


## XGBOOST

In [None]:
# Build an optimized XGBoost regression model
def build_xgboost_model(
    n_estimators=431,
    max_depth=10,
    learning_rate=0.08116012956704159,
    subsample=0.671117833063129,
    colsample_bytree=0.928949134207643,
    gamma=0.02035246955467515,
    min_child_weight=3,
    random_state=42
):
    """Build an optimized XGBoost regression model"""
    model = xgb.XGBRegressor(
        objective='reg:squarederror',  # Loss function suitable for regression (RMSE)
        n_estimators=n_estimators,      # Number of boosting trees
        max_depth=max_depth,            # Maximum tree depth
        learning_rate=learning_rate,    # Step size shrinkage used to prevent overfitting
        subsample=subsample,            # Fraction of samples used for each tree
        colsample_bytree=colsample_bytree,  # Fraction of features used per tree
        gamma=gamma,                    # Minimum loss reduction required to make a further partition
        min_child_weight=min_child_weight,  # Minimum sum of instance weight needed in a child
        random_state=random_state       # Random seed for reproducibility
    )
    return model


# Build the model
xgb_model = build_xgboost_model()

# Train the model
xgb_model.fit(X_train, y_train)


## CATBOOST

In [None]:
# Build an optimized CatBoost model (for regression tasks)
def build_catboost_model(
    iterations=474,
    depth=8,
    learning_rate=0.10105022929242771,
    l2_leaf_reg=3.905251076718455,
    border_count=106,
    random_seed=42,
    verbose=100
):
    """Build an optimized CatBoost regression model"""
    model = cb.CatBoostRegressor(
        iterations=iterations,           # Number of boosting iterations (trees)
        depth=depth,                     # Maximum depth of the trees
        learning_rate=learning_rate,     # Learning rate (controls step size)
        l2_leaf_reg=l2_leaf_reg,         # L2 regularization term (helps prevent overfitting)
        border_count=border_count,       # Number of splits for numerical features
        loss_function='RMSE',            # Loss function suitable for regression
        random_seed=random_seed,         # Random seed for reproducibility
        verbose=verbose                  # Logging frequency during training
    )
    return model


# Build the model
cat_model = build_catboost_model()

# Train the model
cat_model.fit(X_train, y_train)


## LIGHTGBM

In [None]:
# Build an optimized LightGBM regression model
def build_lightgbm_model(
    n_estimators=381,
    max_depth=7,
    learning_rate=0.039237697728597476,
    subsample=0.790446416831238,
    colsample_bytree=0.9592823831701073,
    num_leaves=119,
    min_child_samples=29,
    random_state=42
):
    """Build an optimized LightGBM regression model"""
    model = lgb.LGBMRegressor(
        objective='regression',          # For regression tasks
        n_estimators=n_estimators,       # Number of boosting trees
        max_depth=max_depth,             # Maximum depth of each tree
        learning_rate=learning_rate,     # Learning rate
        subsample=subsample,             # Fraction of samples used for each tree
        colsample_bytree=colsample_bytree,  # Fraction of features used per tree
        num_leaves=num_leaves,           # Maximum number of leaves per tree
        min_child_samples=min_child_samples, # Minimum number of samples per leaf
        random_state=random_state,       # Random seed for reproducibility
        verbose=-1                       # Suppress training logs
    )
    return model


# Build the model
lgb_model = build_lightgbm_model()

# Train the model
lgb_model.fit(X_train, y_train)


## RANDOM_FOREST

In [None]:
# Build an optimized Random Forest regression model
def build_random_forest_model(
    n_estimators=195,
    max_depth=11,
    min_samples_split=10,
    min_samples_leaf=2,
    max_features=0.9342067158084462,
    bootstrap=True,
    random_state=42
):
    """Build an optimized Random Forest regression model"""
    model = RandomForestRegressor(
        n_estimators=n_estimators,       # Number of trees in the forest
        max_depth=max_depth,             # Maximum depth of each decision tree
        min_samples_split=min_samples_split, # Minimum number of samples required to split a node
        min_samples_leaf=min_samples_leaf,   # Minimum number of samples required at a leaf node
        max_features=max_features,       # Fraction of features considered for each split
        bootstrap=bootstrap,             # Whether bootstrap samples are used when building trees
        random_state=random_state,       # Random seed for reproducibility
        n_jobs=-1                        # Use all CPU cores for parallel processing
    )
    return model


# Build the model
rf_model = build_random_forest_model()

# Train the model
rf_model.fit(X_train, y_train)


# 📊 Model Predictions and Performance Evaluation

We generate **predictions** for each trained model —  
**XGBoost**, **LightGBM**, **CatBoost**, and **Random Forest** — and evaluate their performance on the **test dataset** using key regression metrics:  
**Mean Squared Error (MSE)**, **Mean Absolute Error (MAE)**, and **R² score**.

This step helps to **quantify each model’s predictive capability** and identify their respective **strengths and weaknesses** before constructing the final **ensemble model**.

---

📈 *By comparing these metrics side by side, we gain valuable insights into which algorithm performs best individually — and where blending may provide further improvements.*


In [None]:
# prediction on Test Data
xgb_pred = xgb_model.predict(X_test)
cat_pred = cat_model.predict(X_test)
lgb_pred = lgb_model.predict(X_test)
rf_pred = rf_model.predict(X_test)

In [None]:
def evalute_metrics(y_true, y_pred):
    mse = mean_squared_error(y_true, y_pred)
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)

    results = pd.DataFrame([{
        'MSE': mse,
        'MAE': mae,
        'R2': r2
    }])

    return results

# 📝 Interpretation

All four **tree-based models** achieve **very similar performance**, with only **minor variations** across MSE, MAE, and R² values.

Among them, **LightGBM** slightly outperforms the others —  
showing the **lowest MSE** and **highest R²**, which indicates **marginally better predictive accuracy and generalization** compared to XGBoost, CatBoost, and Random Forest.

---

📊 *This consistency among models suggests that the dataset is well-structured and all algorithms are effectively capturing its underlying patterns.*


In [None]:
xgb_results = evalute_metrics(y_test, xgb_pred).assign(Model="XGBoost")
cat_results = evalute_metrics(y_test, cat_pred).assign(Model="CatBoost")
lgb_results = evalute_metrics(y_test, lgb_pred).assign(Model="LightGBM")
rf_results  = evalute_metrics(y_test, rf_pred).assign(Model="RandomForest")

# 結果を結合して見やすく表示
results = pd.concat([xgb_results, cat_results, lgb_results, rf_results], ignore_index=True)
results = results[['Model', 'MSE', 'MAE', 'R2']]

display(results.sort_values('MSE'))

# 🧠Ensemble Learning with Optimized Weights (Optuna)

To enhance prediction performance, we perform a **weighted ensemble** of four tree-based models:  
**XGBoost**, **LightGBM**, **CatBoost**, and **Random Forest**.

We use **Optuna** to automatically search for the optimal combination of weights that minimizes the **Mean Squared Error (MSE)** on the validation set.  
Each model’s prediction is linearly combined according to the optimized weights.

---

#### 🔍Optimization Process
1. Define search space for each model’s weight (`0.0–1.0`).
2. Normalize weights so that their total equals 1.
3. Compute the weighted average of predictions.
4. Evaluate the result with **MSE**.
5. Repeat the process with Optuna’s **TPE sampler** to find the best combination.

---

#### 📊Final Ensemble Output
The final ensemble uses the best weight combination found by Optuna to produce the final prediction, and the performance is evaluated with **RMSE**.

---


In [None]:
import os
import optuna
import numpy as np
from sklearn.metrics import mean_squared_error

# --- Optimization function ---
def optimize_weight(trial):
    w_xgb = trial.suggest_float('xgb_weight', 0.0, 1.0)
    w_lgb = trial.suggest_float('lgb_weight', 0.0, 1.0)
    w_rf  = trial.suggest_float('rf_weight', 0.0, 1.0)
    w_cat = trial.suggest_float('cat_weight', 0.0, 1.0)

    total_weight = w_xgb + w_lgb + w_rf + w_cat
    if total_weight == 0:
        return np.inf  # Invalid case

    # Normalize weights
    w_xgb /= total_weight
    w_lgb /= total_weight
    w_rf  /= total_weight
    w_cat /= total_weight

    # Weighted ensemble prediction
    final_pred = (
        w_xgb * xgb_pred +
        w_lgb * lgb_pred +
        w_rf  * rf_pred +
        w_cat * cat_pred
    )

    mse = mean_squared_error(y_test, final_pred)
    return mse


# --- Run optimization with Optuna ---
optuna.logging.set_verbosity(optuna.logging.WARNING)
study = optuna.create_study(direction='minimize', sampler=optuna.samplers.TPESampler(seed=42))
study.optimize(optimize_weight, n_trials=1000, show_progress_bar=True)

# --- Results ---
best_params = study.best_params
best_w_xgb = best_params['xgb_weight']
best_w_lgb = best_params['lgb_weight']
best_w_rf  = best_params['rf_weight']
best_w_cat = best_params['cat_weight']

# Normalize again
total_weight = best_w_xgb + best_w_lgb + best_w_rf + best_w_cat
best_w_xgb /= total_weight
best_w_lgb /= total_weight
best_w_rf  /= total_weight
best_w_cat /= total_weight

# --- Final ensemble prediction ---
final_pred = (
    best_w_xgb * xgb_pred +
    best_w_lgb * lgb_pred +
    best_w_rf  * rf_pred +
    best_w_cat * cat_pred
)

rmse = np.sqrt(mean_squared_error(y_test, final_pred))

# --- Display results ---
print("\n=== Optimized Weights ===")
print(f"XGBoost: {best_w_xgb:.4f}")
print(f"LightGBM: {best_w_lgb:.4f}")
print(f"RandomForest: {best_w_rf:.4f}")
print(f"CatBoost: {best_w_cat:.4f}")
print(f"\nFinal RMSE: {rmse:.4f}")

In [None]:
xgb_results = evalute_metrics(y_test, xgb_pred).assign(Model="XGBoost")
cat_results = evalute_metrics(y_test, cat_pred).assign(Model="CatBoost")
lgb_results = evalute_metrics(y_test, lgb_pred).assign(Model="LightGBM")
rf_results  = evalute_metrics(y_test, rf_pred).assign(Model="RandomForest")
encode_results = evalute_metrics(y_test, final_pred).assign(Model="Optimized Ensemble")

# 結果を結合して見やすく表示
results = pd.concat([xgb_results, cat_results, lgb_results, rf_results], ignore_index=True)
results = results[['Model', 'MSE', 'MAE', 'R2']]

display(results.sort_values('MSE'))


In [None]:
results = pd.concat(
    [xgb_results, lgb_results, rf_results, cat_results, ensemble_results],
    ignore_index=True
)
results = results[['Model', 'MSE', 'MAE', 'R2']]
display(results)

# 🔍 Analysis
- All four tree-based models show **very similar performance**, with MSE around `0.00316` and R² around `0.885`.  
- Among single models, **LightGBM** slightly outperforms others in terms of R².  
- The **ensemble model** achieves the **lowest MSE and highest R²**, confirming that combining models improves overall stability and accuracy.  
- This demonstrates the benefit of **ensemble learning**, even when base models perform similarly.

# submit

In [None]:
xgb_pred_new = xgb_model.predict(predict_X)

lgb_pred_new = lgb_model.predict(predict_X)

rf_pred_new = rf_model.predict(predict_X)

cat_pred_new = cat_model.predict(predict_X)

final_pred_new = (best_w_xgb * xgb_pred_new 
                  + best_w_lgb * lgb_pred_new 
                  + best_w_rf * rf_pred_new
                  + best_w_cat * cat_pred_new 
                  )

predict_df = pd.DataFrame(final_pred_new, columns=['accident_risk'])
submission = pd.concat([predict['id'], predict_df], axis=1)

display(submission.head())
print(submission.isnull().sum())

In [None]:
# --- Save to CSV for Kaggle submission ---
submission.to_csv('submission.csv', index=False)
print("\n✅ Submission file saved as 'submission.csv'")

# 🏁 Conclusion

In this notebook, we compared four **tree-based regression models** —  
**XGBoost**, **LightGBM**, **CatBoost**, and **Random Forest** — to evaluate their predictive performance on the given dataset.

Each model achieved **comparable accuracy**, with **LightGBM** performing slightly better among the individual models.

By applying **Optuna-based weight optimization**, we constructed a **weighted ensemble model** that achieved a **marginal yet consistent improvement** in overall performance —  
reaching the **lowest MSE (0.003152)** and **highest R² (0.885845)** among all approaches.

---

### ✅ Key Takeaways
- Even when single models perform similarly, **ensemble learning** can capture **complementary strengths**.  
- **Proper hyperparameter tuning** and **balanced weighting** lead to more **robust generalization**.  
- **Tree-based models** continue to be strong baselines for **structured tabular data** tasks.

---

### 🚀 Future Directions
For further improvement, incorporating **neural network–based features** or exploring **meta-model stacking** could enhance predictive accuracy beyond the current ensemble approach.

---

✨ *Overall, this experiment highlights the power of ensemble learning and the importance of model optimization in achieving consistent, high-quality predictions.*
