### Gradient Boosting Classifier
- Gradient Boosting is an ensemble machine learning algorithm that builds a sequence of weak learners, typically decision trees, where each subsequent model tries to correct the errors of the previous models.

- It optimizes a loss function by iteratively adding models that minimize the error, producing a strong predictive model.

- Gradient Boosting is effective for both classification and regression problems and often yields high accuracy.

- Unlike Random Forest which builds trees independently, Gradient Boosting builds trees sequentially, making it more prone to overfitting but also capable of capturing complex patterns.

- Hyperparameters like learning rate, number of trees (n_estimators), and max depth are critical and require tuning.

- Gradient Boosting can be slower to train but usually produces more accurate models for structured data.

- It handles numerical and categorical data with appropriate preprocessing and supports custom loss functions.

In [None]:
import warnings
from sklearn.exceptions import UndefinedMetricWarning

warnings.filterwarnings("ignore", category=UndefinedMetricWarning)

In [None]:
# Load necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.impute import SimpleImputer
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
accident_df = pd.read_csv("/content/drive/MyDrive/data/accidents_cleaned.csv")
df = accident_df.sample(100000)

In [None]:
# Separate features and target variable
target = 'Severity'
X = df.drop(columns=[target])
y = df[target]

In [None]:
# Identify categorical and numerical columns
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()
numerical_cols = X.select_dtypes(include=['int64', 'float64', 'bool']).columns.tolist()

In [None]:
# Numeric transformer pipeline
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),  # fill missing
    ('scaler', StandardScaler())                     # scale numeric
])

# Categorical transformer pipeline
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine preprocessing for numeric and categorical
preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numerical_cols),
    ('cat', categorical_transformer, categorical_cols)
])

In [None]:
# Create pipeline with GradientBoostingClassifier instead of RandomForest
clf_gb = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', GradientBoostingClassifier(random_state=42, n_estimators=100))
])

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.6, random_state=42, stratify=y
)

In [None]:
# Fit the Gradient Boosting model
clf_gb.fit(X_train, y_train)

In [None]:
# Predict on the test set
y_pred_gb = clf_gb.predict(X_test)

In [None]:
# Evaluate performance
print("Gradient Boosting Classifier Accuracy:", accuracy_score(y_test, y_pred_gb))
print("Classification Report:\n", classification_report(y_test, y_pred_gb))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_gb))

Gradient Boosting Classifier Accuracy: 0.8221166666666667
Classification Report:
               precision    recall  f1-score   support

           1       0.34      0.09      0.14       535
           2       0.83      0.97      0.90     47772
           3       0.69      0.27      0.39     10151
           4       0.48      0.02      0.04      1542

    accuracy                           0.82     60000
   macro avg       0.58      0.34      0.37     60000
weighted avg       0.80      0.82      0.78     60000

Confusion Matrix:
 [[   48   475    12     0]
 [   77 46468  1207    20]
 [   10  7348  2782    11]
 [    8  1447    58    29]]
