# Baseline Model Training and Per-Class Evaluation

This notebook contains the **main modelling pipeline** for the 3-class credit risk dataset used in the project.  
It focuses on **building and evaluating baseline classifiers** before and after applying sampling in the companion notebook.

## Objectives

- Build baseline models to predict the 3-level `Risk_Level` (Low, Moderate, High).
- Tune and evaluate three algorithms:
  - **XGBoost**
  - **Artificial Neural Network (ANN, MLPClassifier)**
  - **AdaBoost** with a DecisionTree base estimator
- Report **per-class metrics** that are suitable for imbalanced credit risk data.

## Workflow in this notebook

1. **Load and inspect the dataset**
   - Read `FinalData_3_RiskLevels.xlsx` from Google Drive.
   - Display sample rows and the distribution of the target `Risk_Level`.
   - Count total observations and highlight the multi-class setting.

2. **Preprocessing**
   - Separate features `X` and target `y = Risk_Level`.
   - Encode `Risk_Level` with `LabelEncoder`.
   - One-hot encode the input features (`pd.get_dummies`, `drop_first=True`).
   - Split data into **train** and **test** sets using `train_test_split`.

3. **XGBoost pipeline**
   - Train an initial XGBoost model as a baseline.
   - Define a hyperparameter search space, e.g.:
     - `learning_rate`
     - `max_depth`
     - `gamma`
     - `colsample_bytree`
     - `reg_alpha`, `reg_lambda`
   - Use **RandomizedSearchCV** with cross-validation to find the **best XGBoost configuration**.
   - Evaluate the tuned model on the test set.

4. **AdaBoost pipeline**
   - Use `DecisionTreeClassifier` as the base estimator.
   - Define a hyperparameter search space for AdaBoost, e.g.:
     - `n_estimators`
     - `learning_rate`
     - Tree depth and split criteria of the base estimator.
   - Run **RandomizedSearchCV** to obtain the best AdaBoost model.
   - Evaluate AdaBoost on the same test set.

5. **ANN (MLPClassifier) pipeline**
   - Define an MLPClassifier as a shallow ANN baseline.
   - Tune hyperparameters such as:
     - `hidden_layer_sizes`
     - `activation`
     - `learning_rate_init`
     - `batch_size`
     - `max_iter`
   - Use **RandomizedSearchCV** to select the best ANN configuration.
   - Evaluate the best ANN model on the test data.

6. **Per-class metric calculation**
   - Implement a helper function `calculate_metrics` to compute, for each risk class:
     - **Accuracy**
     - **Sensitivity (Recall)**
     - **Specificity**
     - **F-measure**
     - **G-Mean**
     - **AUC**
   - Apply this function to:
     - XGBoost predictions
     - ANN predictions
     - AdaBoost predictions
   - Organise all results into a single **summary table** (`results` DataFrame) across:
     - Algorithm (XGBoost / ANN / AdaBoost)
     - Risk Level (Low / Moderate / High)
     - All metrics above

7. **Usage in the written report**
   - The resulting `results` table is used as the **main quantitative summary** in the academic report.
   - These baseline numbers are later compared against models trained on **resampled data** (SMOTE vs ADASYN) in the sampling notebook.


In [None]:
import pandas as pd
import numpy as np
from google.colab import drive
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

In [None]:
# Mount Google Drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
file_path = "/content/drive/MyDrive/RiskPaper/FinalData_3_RiskLevels.xlsx"
df = pd.read_excel(file_path)

In [None]:
print(df.head())

   TMDB_NUV  TMDB_Mean_UVR  TMDB_Popularity  \
0     34315          8.434          443.779   
1     29865          7.612          538.680   
2     28865          8.246          362.473   
3     27416          7.908           67.192   
4     24858          8.255          299.860   

                                Genre  \
0   Adventure, Drama, Science Fiction   
1           Action, Adventure, Comedy   
2  Adventure, Action, Science Fiction   
3  Action, Science Fiction, Adventure   
4  Adventure, Science Fiction, Action   

                                            Keywords   R1   R2   R3   R4   R5  \
0  rescue, future, spacecraft, race against time,...  340  121  137  153  183   
1  superhero, anti hero, mercenary, based on comi...  112   41   44   36   50   
2  sacrifice, magic, superhero, based on comic, s...  235   88   96  102  164   
3  spacecraft, based on comic, space, orphan, adv...   82   21   34   34   53   
4  superhero, time travel, space travel, time mac...  597  189  2

In [None]:
class_counts = df["Risk_Level"].value_counts()
print(class_counts)

Risk_Level
2    861
0    432
1    265
Name: count, dtype: int64


In [None]:
total_observations = df.shape[0]
print(f"Total number of observations: {total_observations}")

Total number of observations: 1558


In [None]:
# testing XGBoost code

In [None]:
# Separate features and target
X = df.drop("Risk_Level", axis=1)  # Features
y = df["Risk_Level"]  # Target (dependent variable)

In [None]:
# Encode the target variable
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)
X_encoded = pd.get_dummies(X, drop_first=True)

In [None]:
# Split the data into training and test sets using the encoded features
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y_encoded, test_size=0.2, random_state=42)

In [None]:
# Initialize and train the XGBoost classifier
xgb_model = XGBClassifier(objective="multi:softmax", num_class=3, eval_metric="mlogloss")
xgb_model.fit(X_train, y_train)

In [None]:
# Make predictions and evaluate
y_pred = xgb_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

Accuracy: 92.63%


In [None]:
#RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBClassifier

In [None]:
param_dist_xgboost = {
    'learning_rate': np.linspace(0.01, 0.3, 10),    # Learning rates between 0.01 and 0.3
    'max_depth': np.arange(3, 11, 1),               # Maximum tree depth from 3 to 10
    'gamma': np.linspace(0, 5, 5),                  # Gamma from 0 to 5
    'colsample_bytree': np.linspace(0.5, 1.0, 5),   # Column subsampling from 0.5 to 1.0
    'reg_alpha': np.logspace(-3, 1, 5),             # L1 regularization from 0.001 to 10
    'reg_lambda': np.logspace(-3, 1, 5)             # L2 regularization from 0.001 to 10
}

In [None]:
# Initialize the XGBoost classifier
xgb_model = XGBClassifier(objective="multi:softmax", num_class=3, eval_metric="mlogloss")

In [None]:
# Set up RandomizedSearchCV
random_search_xgboost = RandomizedSearchCV(
    estimator=xgb_model,
    param_distributions=param_dist_xgboost,
    n_iter=50,  # Number of random combinations to try
    scoring='accuracy',  # Use accuracy for scoring
    cv=5,  # 5-fold cross-validation
    verbose=3,  # Print progress
    random_state=42,
    n_jobs=-1  # Use all available cores
)

In [None]:
# Fit the model
random_search_xgboost.fit(X_train, y_train)

Fitting 5 folds for each of 50 candidates, totalling 250 fits


In [None]:
# Best hyperparameters found
print(f"Best Hyperparameters: {random_search_xgboost.best_params_}")
print(f"Best Score: {random_search_xgboost.best_score_}")

Best Hyperparameters: {'reg_lambda': 0.01, 'reg_alpha': 0.001, 'max_depth': 9, 'learning_rate': 0.23555555555555557, 'gamma': 0.0, 'colsample_bytree': 0.875}
Best Score: 0.9237493975903615


In [None]:
# Train with the best parameters
best_xgboost_model = random_search_xgboost.best_estimator_

In [None]:
# Evaluate on the test set
y_pred_xgboost = best_xgboost_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy * 100:.2f}%")

Test Accuracy: 92.63%


In [None]:
# AdaBoost
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV

In [None]:
# Define the base estimator (DecisionTreeClassifier)
base_estimator = DecisionTreeClassifier()

In [None]:
# Define the hyperparameters for tuning
param_dist_adaboost = {
    'n_estimators': np.arange(50, 501, 50),               # Number of weak learners
    'learning_rate': np.linspace(0.01, 1.0, 10),          # Learning rate
    'estimator__max_depth': np.arange(1, 11, 1),          # Max depth of each decision tree
    'estimator__min_samples_split': np.arange(2, 21, 2),  # Min samples to split
    'estimator__min_samples_leaf': np.arange(1, 11, 1),   # Min samples at a leaf
    'algorithm': ['SAMME', 'SAMME.R']                     # AdaBoost algorithm type
}

In [None]:
# Initialize the AdaBoost classifier with a decision tree as the base estimator
ada_model = AdaBoostClassifier(estimator=base_estimator)

In [None]:
# Set up RandomizedSearchCV
random_search_adaboost  = RandomizedSearchCV(
    estimator=ada_model,
    param_distributions=param_dist_adaboost,
    n_iter=50,  # Number of random combinations to try
    scoring='accuracy',  # Use accuracy for scoring
    cv=5,  # 5-fold cross-validation
    verbose=3,  # Print progress
    random_state=42,
    n_jobs=-1  # Use all available cores
)

In [None]:
# Fit the model
random_search_adaboost.fit(X_train, y_train)

Fitting 5 folds for each of 50 candidates, totalling 250 fits


In [None]:
# Best hyperparameters found
print(f"Best Hyperparameters: {random_search_adaboost.best_params_}")
print(f"Best Score: {random_search_adaboost.best_score_}")

Best Hyperparameters: {'n_estimators': 250, 'learning_rate': 0.56, 'estimator__min_samples_split': 14, 'estimator__min_samples_leaf': 5, 'estimator__max_depth': 9, 'algorithm': 'SAMME'}
Best Score: 0.9373911646586345


In [None]:
# Train with the best parameters
best_adaboost_model = random_search_adaboost.best_estimator_

In [None]:
# Evaluate on the test set
y_pred_adaboost = best_adaboost_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy * 100:.2f}%")

Test Accuracy: 97.12%


In [None]:
# ANN
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import RandomizedSearchCV

In [None]:
# Define the ANN model
ann_model = MLPClassifier()

In [None]:
# Define the hyperparameter grid
param_dist_ann = {
    'learning_rate_init': np.logspace(-4, -1, 10),  # Learning rate
    'hidden_layer_sizes': [(64,), (128,), (64, 64), (128, 128), (64, 128, 64)],  # Number of layers and neurons
    'activation': ['relu', 'tanh', 'logistic'],    # Activation function
    'batch_size': np.arange(32, 257, 32),          # Batch size
    'max_iter': np.arange(100, 1001, 100),         # Number of epochs
    'alpha': np.logspace(-4, -1, 10)               # L2 regularization (alpha)
}

In [None]:
# Set up RandomizedSearchCV
random_search_ann = RandomizedSearchCV(
    estimator=ann_model,
    param_distributions=param_dist_ann,
    n_iter=50,  # Number of random combinations to try
    scoring='accuracy',  # Use accuracy for scoring
    cv=5,  # 5-fold cross-validation
    verbose=3,  # Print progress
    random_state=42,
    n_jobs=-1  # Use all available cores
)

In [None]:
# Fit the model
random_search_ann.fit(X_train, y_train)

Fitting 5 folds for each of 50 candidates, totalling 250 fits


In [None]:
# Best hyperparameters found
print(f"Best Hyperparameters: {random_search_ann.best_params_}")
print(f"Best Score: {random_search_ann.best_score_}")

Best Hyperparameters: {'max_iter': 400, 'learning_rate_init': 0.021544346900318822, 'hidden_layer_sizes': (64, 64), 'batch_size': 256, 'alpha': 0.046415888336127774, 'activation': 'relu'}
Best Score: 0.9815453815261044


In [None]:
# Train with the best parameters
best_ann_model = random_search_ann.best_estimator_

In [None]:
# Evaluate on the test set
y_pred_ann = best_ann_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy * 100:.2f}%")

Test Accuracy: 97.12%


In [None]:
# results

In [None]:
# Function to calculate metrics for each class
def calculate_metrics(y_test, y_pred, class_label):
    # Binarize the labels for the specific class (Low=0, Moderate=1, High=2)
    y_test_bin = (y_test == class_label).astype(int)
    y_pred_bin = (y_pred == class_label).astype(int)

    # Accuracy
    accuracy = accuracy_score(y_test_bin, y_pred_bin)

    # Sensitivity (Recall)
    sensitivity = recall_score(y_test_bin, y_pred_bin)

    # Specificity
    tn, fp, fn, tp = confusion_matrix(y_test_bin, y_pred_bin).ravel()
    specificity = tn / (tn + fp) if (tn + fp) > 0 else 0

    # F-Measure (F1-score)
    f1 = f1_score(y_test_bin, y_pred_bin)

    # G-Mean
    g_mean = np.sqrt(sensitivity * specificity)

    # AUC
    auc = roc_auc_score(y_test_bin, y_pred_bin)

    return accuracy, sensitivity, specificity, f1, g_mean, auc

In [None]:
from sklearn.metrics import accuracy_score, recall_score, f1_score, roc_auc_score, confusion_matrix

In [None]:
# XGBoost Metrics
metrics_low_xgb = calculate_metrics(y_test, y_pred_xgboost, class_label=0)
metrics_moderate_xgb = calculate_metrics(y_test, y_pred_xgboost, class_label=1)
metrics_high_xgb = calculate_metrics(y_test, y_pred_xgboost, class_label=2)

# ANN Metrics
metrics_low_ann = calculate_metrics(y_test, y_pred_ann, class_label=0)
metrics_moderate_ann = calculate_metrics(y_test, y_pred_ann, class_label=1)
metrics_high_ann = calculate_metrics(y_test, y_pred_ann, class_label=2)

# AdaBoost Metrics
metrics_low_ada = calculate_metrics(y_test, y_pred_adaboost, class_label=0)
metrics_moderate_ada = calculate_metrics(y_test, y_pred_adaboost, class_label=1)
metrics_high_ada = calculate_metrics(y_test, y_pred_adaboost, class_label=2)

In [None]:
results = pd.DataFrame({
    "Algorithm": ["XGBoost", "XGBoost", "XGBoost", "ANN", "ANN", "ANN", "AdaBoost", "AdaBoost", "AdaBoost"],
    "Risk Level": ["Low", "Moderate", "High", "Low", "Moderate", "High", "Low", "Moderate", "High"],
    "Accuracy": [metrics_low_xgb[0], metrics_moderate_xgb[0], metrics_high_xgb[0],
                 metrics_low_ann[0], metrics_moderate_ann[0], metrics_high_ann[0],
                 metrics_low_ada[0], metrics_moderate_ada[0], metrics_high_ada[0]],
    "Sensitivity": [metrics_low_xgb[1], metrics_moderate_xgb[1], metrics_high_xgb[1],
                    metrics_low_ann[1], metrics_moderate_ann[1], metrics_high_ann[1],
                    metrics_low_ada[1], metrics_moderate_ada[1], metrics_high_ada[1]],
    "Specificity": [metrics_low_xgb[2], metrics_moderate_xgb[2], metrics_high_xgb[2],
                    metrics_low_ann[2], metrics_moderate_ann[2], metrics_high_ann[2],
                    metrics_low_ada[2], metrics_moderate_ada[2], metrics_high_ada[2]],
    "F-Measure": [metrics_low_xgb[3], metrics_moderate_xgb[3], metrics_high_xgb[3],
                  metrics_low_ann[3], metrics_moderate_ann[3], metrics_high_ann[3],
                  metrics_low_ada[3], metrics_moderate_ada[3], metrics_high_ada[3]],
    "G-Mean": [metrics_low_xgb[4], metrics_moderate_xgb[4], metrics_high_xgb[4],
               metrics_low_ann[4], metrics_moderate_ann[4], metrics_high_ann[4],
               metrics_low_ada[4], metrics_moderate_ada[4], metrics_high_ada[4]],
    "AUC": [metrics_low_xgb[5], metrics_moderate_xgb[5], metrics_high_xgb[5],
            metrics_low_ann[5], metrics_moderate_ann[5], metrics_high_ann[5],
            metrics_low_ada[5], metrics_moderate_ada[5], metrics_high_ada[5]]
})

# Display the results
print(results)

  Algorithm Risk Level  Accuracy  Sensitivity  Specificity  F-Measure  \
0   XGBoost        Low  0.964744     0.928571     0.978070   0.934132   
1   XGBoost   Moderate  0.926282     0.771930     0.960784   0.792793   
2   XGBoost       High  0.961538     0.976608     0.943262   0.965318   
3       ANN        Low  0.974359     0.976190     0.973684   0.953488   
4       ANN   Moderate  0.971154     0.894737     0.988235   0.918919   
5       ANN       High  0.996795     0.994152     1.000000   0.997067   
6  AdaBoost        Low  0.971154     0.940476     0.982456   0.946108   
7  AdaBoost   Moderate  0.945513     0.842105     0.968627   0.849558   
8  AdaBoost       High  0.974359     0.982456     0.964539   0.976744   

     G-Mean       AUC  
0  0.952999  0.953321  
1  0.861196  0.866357  
2  0.959790  0.959935  
3  0.974937  0.974937  
4  0.940325  0.941486  
5  0.997072  0.997076  
6  0.961237  0.961466  
7  0.903154  0.905366  
8  0.973456  0.973498  
