### Random Forest Classifer
- Random Forest is an ensemble learning algorithm that builds multiple decision trees and outputs the majority class for classification tasks.

- It reduces overfitting by averaging (voting) many decision trees trained on random subsets of data and features.

- It handles both numerical and categorical data without much preprocessing.

- Random Forest is robust to noise and outliers and can handle missing data effectively.

- It provides estimates of feature importance, useful for feature selection and interpretability.

- The model is highly accurate and versatile, suitable for a wide range of classification problems.

- It requires tuning of hyperparameters like number of trees, max depth, and max features for optimal performance.

- Random Forest can be computationally expensive and memory intensive when using many trees or large datasets.

- Predictions can be slower compared to simpler models because each input is evaluated by many trees.

- Interpretability is lower than single decision trees, often considered a "black-box" model.

- It is suitable when accuracy and robustness are prioritized over interpretability and speed.

In [None]:
import warnings
from sklearn.exceptions import UndefinedMetricWarning

warnings.filterwarnings("ignore", category=UndefinedMetricWarning)

In [None]:
# Load necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.impute import SimpleImputer
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
accident_df = pd.read_csv("/content/drive/MyDrive/data/accidents_cleaned.csv")
df = accident_df.sample(100000)

In [None]:
# Separate features and target variable
target = 'Severity'
X = df.drop(columns=[target])
y = df[target]

In [None]:
# Identify categorical and numerical columns
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()
numerical_cols = X.select_dtypes(include=['int64', 'float64', 'bool']).columns.tolist()

In [None]:
# Numeric transformer pipeline
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),  # fill missing
    ('scaler', StandardScaler())                     # scale numeric
])

# Categorical transformer pipeline
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine preprocessing for numeric and categorical
preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numerical_cols),
    ('cat', categorical_transformer, categorical_cols)
])

In [None]:
# Create the full pipeline with RandomForestClassifier
clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42, n_estimators=100, n_jobs=-1))
])

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.7, random_state=42, stratify=y
)

In [None]:
# Fit the model
clf.fit(X_train, y_train)

In [None]:
y_pred = clf.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

Accuracy: 0.8127142857142857

Classification Report:
               precision    recall  f1-score   support

           1       0.00      0.00      0.00       570
           2       0.81      0.99      0.89     55690
           3       0.80      0.13      0.23     11868
           4       0.67      0.00      0.00      1872

    accuracy                           0.81     70000
   macro avg       0.57      0.28      0.28     70000
weighted avg       0.80      0.81      0.75     70000


Confusion Matrix:
 [[    0   563     7     0]
 [    0 55332   356     2]
 [    0 10314  1554     0]
 [    0  1848    20     4]]
