### Random Forest Classifer
- Random Forest is an ensemble learning algorithm that builds multiple decision trees and outputs the majority class for classification tasks.

- It reduces overfitting by averaging (voting) many decision trees trained on random subsets of data and features.

- It handles both numerical and categorical data without much preprocessing.

- Random Forest is robust to noise and outliers and can handle missing data effectively.

- It provides estimates of feature importance, useful for feature selection and interpretability.

- The model is highly accurate and versatile, suitable for a wide range of classification problems.

- It requires tuning of hyperparameters like number of trees, max depth, and max features for optimal performance.

- Random Forest can be computationally expensive and memory intensive when using many trees or large datasets.

- Predictions can be slower compared to simpler models because each input is evaluated by many trees.

- Interpretability is lower than single decision trees, often considered a "black-box" model.

- It is suitable when accuracy and robustness are prioritized over interpretability and speed.

In [None]:
# Load necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.impute import SimpleImputer
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv("C:/Users/win10/Desktop/Project_Aug25/data/accidents_cleaned.csv")
df.head()

Unnamed: 0,Severity,Start_Lat,Start_Lng,Distance(mi),City,County,State,Zipcode,Country,Timezone,...,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop,Duration_Minutes,Hour,DayOfWeek,Month,IsWeekend,IsDay
0,1,26.7069,-80.11936,0.0,West Palm Beach,Palm Beach,FL,33417-4638,US,US/Eastern,...,0,0,1,0,60.0,9,4,4,0,1
1,2,38.781024,-121.26582,0.045,Roseville,Placer,CA,95678-1907,US,US/Pacific,...,1,0,0,0,103.133333,10,3,4,0,1
2,3,33.985249,-84.269348,0.0,Alpharetta,Fulton,GA,30022,US,US/Eastern,...,0,0,0,0,30.0,16,4,8,0,1
3,3,47.118706,-122.556908,0.0,Tacoma,Pierce,WA,98433,US,US/Pacific,...,0,0,0,0,33.733333,15,4,9,0,1
4,2,33.451355,-111.890343,0.0,Scottsdale,Maricopa,AZ,85256,US,US/Mountain,...,0,0,0,0,76.433333,16,0,6,0,1


In [3]:
# Separate features and target variable
target = 'Severity'
X = df.drop(columns=[target])
y = df[target]

In [4]:
# Identify categorical and numerical columns
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()
numerical_cols = X.select_dtypes(include=['int64', 'float64', 'bool']).columns.tolist()

In [None]:
# Numeric transformer pipeline
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),  # fill missing
    ('scaler', StandardScaler())                     # scale numeric
])

# Categorical transformer pipeline
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine preprocessing for numeric and categorical
preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numerical_cols),
    ('cat', categorical_transformer, categorical_cols)
])

In [6]:
# Create the full pipeline with RandomForestClassifier
clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42, n_estimators=100, n_jobs=-1))
])

In [7]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

In [8]:
# Fit the model
clf.fit(X_train, y_train)

KeyboardInterrupt: 

In [None]:
y_pred = clf.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

In [None]:
# Feature Importance extraction after preprocessing

## Extract feature names after OneHotEncoding
cat_features = clf.named_steps['preprocessor'].named_transformers_['cat'].\
  .named_steps['onehot'].get_feature_names_out(categorical_cols)

all_features = np.concatenate([numerical_cols, cat_features])

importances = clf.named_steps['classifier'].feature_importances_
indices = np.argsort(importances)[::-1]

plt.figure(figsize=(12,6))
plt.title("Feature Importances from Random Forest Classifier")
plt.bar(range(len(importances)), importances[indices], align='center')
plt.xticks(range(len(importances)), all_features[indices], rotation=90)
plt.tight_layout()
plt.show()