# Random Forest Regressor — Baseline (Accident Risk)

Goal: predict continuous `accident_risk` in [0, 1].  
This notebook mirrors the structure of the Ridge baseline:  
- data loading & prep  
- preprocessing + model  
- holdout evaluation (RMSE / MAE / R²)  
- permutation importance  
- final training & submission

## 1. Imports & data loading

In [1]:
import os, sys, json
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline

# Make local package visible (project root -> 'common/')
# If you run Jupyter from project root, this line is optional.
sys.path.append(os.path.abspath(os.path.join('..', '..')))

from common.prep import (
    load_data, infer_target, split_features, make_preprocessor,
    holdout_split, eval_regression,
    permutation_importance_df, save_json, save_csv_df, save_submission
)

# Load data
train, test, sample = load_data()
target_col = infer_target(train, test)
feature_cols, cat_cols, num_cols = split_features(train, target_col)

# Split features/target
X = train[feature_cols]
y = train[target_col].astype(float)
X_test = test[feature_cols]

print(f"Train shape: {X.shape} | Test shape: {X_test.shape}")
print(f"Target column: {target_col}")
print(f"Numeric: {len(num_cols)} | Categorical: {len(cat_cols)}")

Train shape: (517754, 13) | Test shape: (172585, 13)
Target column: accident_risk
Numeric: 9 | Categorical: 4


## 2. Preprocessing & model

In [2]:
# Preprocessing:
# - numeric: median imputation
# - categorical: most-frequent imputation + OneHotEncoder
# For tree models, scaling is not needed.
prep = make_preprocessor(num_cols, cat_cols, scale_numeric=False)

# Model: RandomForestRegressor (robust, non-linear baseline)
# Params kept simple and readable; tune later if needed.
model = RandomForestRegressor(
    n_estimators=400,
    min_samples_leaf=2,
    n_jobs=-1,
    random_state=42
)

# Full pipeline: preprocessing -> model
pipe = Pipeline([
    ("prep", prep),
    ("clf", model)
])
pipe


## 3. Holdout evaluation

In [3]:
# === Holdout validation (80/20 split) ===
X_tr, X_va, y_tr, y_va = holdout_split(X, y)

# Fit on training split
pipe.fit(X_tr, y_tr)

# Predict on validation split
valid_pred = pipe.predict(X_va)

# Evaluate (RMSE, MAE, R²)
metrics = eval_regression(y_va, valid_pred)
print("Holdout metrics:", metrics)

# Save holdout metrics
save_json(
    {"model": "RandomForestReg", **metrics},
    "../../outputs/holdout_reports/random_forest_reg_holdout.json"
)

Holdout metrics: {'rmse': 0.05754604926249958, 'mae': 0.044635213413326824, 'r2': 0.8800693724665071}


## 4. Permutation Importance (on holdout)

In [4]:
# Measures how shuffling each feature degrades RMSE.
imp_df = permutation_importance_df(
    pipe, X_va, y_va,
    num_cols=num_cols, cat_cols=cat_cols,
    n_repeats=5
)

# Save & preview
save_csv_df(imp_df, "../../outputs/feature_importance/random_forest_reg_perm_importance.csv")

print("Top-10 most important features:")
imp_df.head(10)

Top-10 most important features:


Unnamed: 0,feature,perm_importance_rmse
0,speed_limit,0.085131
1,road_signs_present,0.080892
2,public_road,0.076235
3,holiday,0.026733
4,lighting=daylight,0.012078
5,road_type=rural,0.000281
6,num_reported_accidents,0.000173
7,curvature,5.9e-05
8,id,3.4e-05
9,school_season,3.2e-05


## 5. Final model training and submission

In [5]:
# Train on all available data
pipe.fit(X, y)

# Predict on test
test_pred = pipe.predict(X_test)

# Save Kaggle submission (values clipped to [0,1])
out_path = save_submission(sample, test_pred, out_name="random_forest_reg.csv")
print("Saved submission:", out_path)

Saved submission: ../../outputs/submissions/random_forest_reg.csv


### 🧾 Summary

- **Model:** RandomForestRegressor  
- **Holdout metrics:**  
  - RMSE = **0.0575**  
  - MAE = **0.0446**  
  - R² = **0.8801**

The model achieves **~5.7% average prediction error**, significantly improving over Ridge Regression.  
It captures non-linear feature interactions and road-related thresholds (e.g., speed limits and signs).  

However, Random Forest requires **more computational resources**,  
making it slower for retraining or large-scale tuning.

**Next:** test a more efficient alternative — `HistGradientBoostingRegressor`,  
which can often match this accuracy with lower resource usage.
