### Gradient Boosting Classifier
- Gradient Boosting is an ensemble machine learning algorithm that builds a sequence of weak learners, typically decision trees, where each subsequent model tries to correct the errors of the previous models.

- It optimizes a loss function by iteratively adding models that minimize the error, producing a strong predictive model.

- Gradient Boosting is effective for both classification and regression problems and often yields high accuracy.

- Unlike Random Forest which builds trees independently, Gradient Boosting builds trees sequentially, making it more prone to overfitting but also capable of capturing complex patterns.

- Hyperparameters like learning rate, number of trees (n_estimators), and max depth are critical and require tuning.

- Gradient Boosting can be slower to train but usually produces more accurate models for structured data.

- It handles numerical and categorical data with appropriate preprocessing and supports custom loss functions.

In [None]:
# Load necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.impute import SimpleImputer
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv("C:/Users/win10/Desktop/Project_Aug25/data/accidents_cleaned.csv")
df.head()

In [None]:
# Separate features and target variable
target = 'Severity'
X = df.drop(columns=[target])
y = df[target]

In [None]:
# Identify categorical and numerical columns
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()
numerical_cols = X.select_dtypes(include=['int64', 'float64', 'bool']).columns.tolist()

In [None]:
# Numeric transformer pipeline
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),  # fill missing
    ('scaler', StandardScaler())                     # scale numeric
])

# Categorical transformer pipeline
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine preprocessing for numeric and categorical
preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numerical_cols),
    ('cat', categorical_transformer, categorical_cols)
])

In [None]:
# Create pipeline with GradientBoostingClassifier instead of RandomForest
clf_gb = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', GradientBoostingClassifier(random_state=42, n_estimators=100))
])

In [None]:
# Fit the Gradient Boosting model
clf_gb.fit(X_train, y_train)

In [None]:
# Predict on the test set
y_pred_gb = clf_gb.predict(X_test)

In [None]:
# Evaluate performance
print("Gradient Boosting Classifier Accuracy:", accuracy_score(y_test, y_pred_gb))
print("Classification Report:\n", classification_report(y_test, y_pred_gb))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_gb))

In [None]:
# Feature Importance extraction after preprocessing

## Extract feature names after OneHotEncoding
cat_features = clf.named_steps['preprocessor'].named_transformers_['cat'].\
  .named_steps['onehot'].get_feature_names_out(categorical_cols)

all_features = np.concatenate([numerical_cols, cat_features])

importances = clf.named_steps['classifier'].feature_importances_
indices = np.argsort(importances)[::-1]

plt.figure(figsize=(12,6))
plt.title("Feature Importances from Random Forest Classifier")
plt.bar(range(len(importances)), importances[indices], align='center')
plt.xticks(range(len(importances)), all_features[indices], rotation=90)
plt.tight_layout()
plt.show()

#### Task: Batch Training of Advanced Models
- Explore and implement a method to train advanced machine learning models by dividing the preprocessed dataset into smaller batches.

- Train the model incrementally on these batches rather than all data at once.

- Combine or update the model progressively to obtain a final, fully trained model after processing all batches.