## **🔹 Checking GPU Availability in Google Colab**

Before running heavy computations, it's essential to verify whether a **GPU** is available for acceleration. The following code checks:

- If a **GPU is available** (`True` or `False`).
- The **name of the GPU** (if one is detected).



In [1]:
import torch
print("GPU Available:", torch.cuda.is_available())
print("GPU Name:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "No GPU")


GPU Available: True
GPU Name: Tesla T4


In [2]:
!pip install imbalanced-learn xgboost



In [19]:
import pandas as pd
import numpy as np
import torch
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, roc_auc_score, precision_score, recall_score, f1_score, matthews_corrcoef, average_precision_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE
from sklearn.feature_selection import RFE
from sklearn.preprocessing import StandardScaler


## **🔹 Uploading and Preparing the Dataset in Google Colab**

Since Google Colab does not provide direct access to local files, we first need to **manually upload the dataset**. Once uploaded, the dataset is loaded into a Pandas DataFrame.

### **📌 Steps:**
1. **Upload the dataset** manually using Google Colab’s file upload feature.
2. **Load the dataset** into a Pandas DataFrame for processing.
3. **Separate Features and Target Variable:**
   - Independent features (`X_features`) are extracted.
   - The target variable (`default_encoded`) is assigned to `y_target`.
4. **Split the dataset into training and testing sets**:
   - An **80-20 split** ensures a balanced train-test distribution.
   - **Stratification** is applied to maintain the same class ratio in both sets.

### **📌 Why is This Important?**
- **Ensures proper data loading in Colab**, where local file access is limited.
- **Prepares the dataset for model training** by creating separate feature and target variables.
- **Maintains class balance in training and test sets**, preventing biased model evaluation.

### **💡 Next Steps**
- **Verify the dataset structure** by displaying the first few rows.
- **Check for missing values** before proceeding to model training.


In [4]:
# Upload File Manually in Google Colab
from google.colab import files

uploaded = files.upload()
filename = list(uploaded.keys())[0]  # Get uploaded file name
print(f"File '{filename}' uploaded successfully!")

# Load the CSV into a DataFrame
X = pd.read_csv(filename)

# Separate Features and Target Variable
X_features = X.drop(columns=["default_encoded"])  # Features
y_target = X["default_encoded"]  # Target variable

# Split into Train-Test (80-20 Split)
X_train, X_test, y_train, y_test = train_test_split(
    X_features, y_target, test_size=0.2, stratify=y_target, random_state=42
)


Saving data_processing.csv to data_processing.csv
File 'data_processing.csv' uploaded successfully!


## **🔹 Model 1: Basic Logistic Regression (Class Weights)**
### **📌 Sampling Strategy:**  
- **No data is removed** from the dataset.  
- Uses **`class_weight="balanced"`** to **handle class imbalance dynamically** without modifying the dataset.  

### **📌 Preprocessing:**  
- **StandardScaler()** is applied to normalize the dataset for better model stability.  

### **📌 GPU Usage:**  
- **No GPU is used** for this model.  
- Logistic Regression runs on **CPU only**, as `sklearn` does not support GPU acceleration.  

### **📌 Purpose:**  
- Handles **class imbalance** effectively without complex sampling strategies.  
- Keeps the approach simple for **quick execution and evaluation**.  
- Evaluated using key metrics like **AUC-PR, Recall, Precision, F1-Score, and MCC**.  

### **💡 Next Steps:**  
- Compare this model with **Random Forest and XGBoost** to analyze performance differences.  
- Apply **hyperparameter tuning later** to improve performance.  


In [7]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import precision_score, recall_score, f1_score, matthews_corrcoef, average_precision_score, classification_report

# Standardize Features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Logistic Regression (Basic Version)
log_reg = LogisticRegression(class_weight="balanced", solver="saga", max_iter=1000, random_state=42, n_jobs=-1)
log_reg.fit(X_train_scaled, y_train)

# Make Predictions
y_pred = log_reg.predict(X_test_scaled)
y_prob = log_reg.predict_proba(X_test_scaled)[:, 1]

# Compute Metrics
log_reg_metrics = {
    "AUC-PR": average_precision_score(y_test, y_prob),
    "Recall": recall_score(y_test, y_pred),
    "Precision": precision_score(y_test, y_pred),
    "F1-Score": f1_score(y_test, y_pred),
    "MCC": matthews_corrcoef(y_test, y_pred)
}

# Display Metrics
metrics_df = pd.DataFrame([log_reg_metrics])

# Print Summary
print("\n🔹 Basic Logistic Regression Results:")
print(classification_report(y_test, y_pred))



🔹 Basic Logistic Regression Results:
              precision    recall  f1-score   support

           0       1.00      0.71      0.83      8880
           1       0.05      0.85      0.10       163

    accuracy                           0.71      9043
   macro avg       0.52      0.78      0.46      9043
weighted avg       0.98      0.71      0.82      9043



## **🔹 Model 2: Basic Random Forest (SMOTE Oversampling)**

### **📌 Sampling Strategy:**  
- **SMOTE (Synthetic Minority Oversampling Technique)** is applied to balance the dataset.  
- SMOTE **creates synthetic examples** for the minority class (`default_encoded = 1`) to prevent bias toward the majority class (`default_encoded = 0`).  

### **📌 Preprocessing & Model Training:**  
- **No hyperparameter tuning** is performed to keep the model simple and fast.  
- A **basic Random Forest model** is trained with **100 trees (`n_estimators=100`)**.  

### **📌 GPU Usage:**  
- **No GPU is used** for this model.  
- Random Forest runs on **CPU only** in `sklearn`, as it does not support GPU acceleration.  

### **📌 Purpose:**  
- Uses **SMOTE to balance class distribution**, ensuring fair learning from both classes.  
- Focuses on **fast execution** rather than extensive optimization.  
- Evaluated using key metrics such as **AUC-PR, Recall, Precision, F1-Score, and MCC**.  

### **💡 Next Steps**  
- Compare this **basic Random Forest model** with **Logistic Regression (Class Weights)** and **XGBoost (Class Weights, GPU)**.  
- If needed, apply **hyperparameter tuning later** for better performance.  


In [8]:
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from sklearn.metrics import precision_score, recall_score, f1_score, matthews_corrcoef, average_precision_score, classification_report

# Apply SMOTE for Class Balancing
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Train Basic Random Forest Model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train_resampled, y_train_resampled)

# Make Predictions
y_pred_rf = rf.predict(X_test)
y_prob_rf = rf.predict_proba(X_test)[:, 1]

# Compute Metrics
rf_metrics = {
    "AUC-PR": average_precision_score(y_test, y_prob_rf),
    "Recall": recall_score(y_test, y_pred_rf),
    "Precision": precision_score(y_test, y_pred_rf),
    "F1-Score": f1_score(y_test, y_pred_rf),
    "MCC": matthews_corrcoef(y_test, y_pred_rf)
}

# Display Metrics
metrics_df_rf = pd.DataFrame([rf_metrics])

# Print Summary
print("\n🔹 Basic Random Forest (SMOTE Oversampling) Results:")
print(classification_report(y_test, y_pred_rf))



🔹 Basic Random Forest (SMOTE Oversampling) Results:
              precision    recall  f1-score   support

           0       0.99      0.98      0.98      8880
           1       0.16      0.20      0.18       163

    accuracy                           0.97      9043
   macro avg       0.57      0.59      0.58      9043
weighted avg       0.97      0.97      0.97      9043



## **🔹 Model 3: Basic XGBoost (Class Weights, GPU)**

### **📌 Sampling Strategy:**  
- **No SMOTE or undersampling is used**—instead, XGBoost handles class imbalance using **`scale_pos_weight`**.  
- The **`scale_pos_weight`** is calculated as:  
  \[
  \frac{\text{Number of Non-Defaulters}}{\text{Number of Defaulters}}
  \]  
  This ensures that the model **assigns higher weight** to the minority class (`default_encoded = 1`).  

### **📌 Preprocessing & Model Training:**  
- **No hyperparameter tuning** is applied to keep the model simple and efficient.  
- A **basic XGBoost model** is trained with **GPU acceleration enabled (`device="cuda"`)**.  

### **📌 GPU Usage:**  
- ✅ **Yes, this model uses GPU for acceleration**.  
- The following settings ensure **full GPU utilization**:  
  - **`device="cuda"`** → Runs on GPU instead of CPU.  
  - **`tree_method="hist"`** → Optimized method for fast GPU-based training.  
- **Faster execution compared to CPU-based training**.  

### **📌 Purpose:**  
- Uses **class weights (`scale_pos_weight`) instead of data resampling**, keeping all data intact.  
- Focuses on **fast execution** without hyperparameter tuning.  
- Evaluated using key metrics such as **AUC-PR, Recall, Precision, F1-Score, and MCC**.  

### **💡 Next Steps**  
- Compare this **basic XGBoost model** with **Logistic Regression (Class Weights)** and **Random Forest (SMOTE Oversampling)**.  
- Apply **hyperparameter tuning later** for better performance if needed.  


In [9]:
from xgboost import XGBClassifier
from sklearn.metrics import precision_score, recall_score, f1_score, matthews_corrcoef, average_precision_score, classification_report

# Compute scale_pos_weight for imbalance handling
scale_pos_weight = (len(y_train) - sum(y_train)) / sum(y_train)

# Train Basic XGBoost Model (GPU Accelerated)
xgb_model = XGBClassifier(
    scale_pos_weight=scale_pos_weight,
    use_label_encoder=False,
    eval_metric="aucpr",
    tree_method="hist",
    device="cuda",
    random_state=42
)

xgb_model.fit(X_train, y_train)

# Make Predictions
y_pred_xgb = xgb_model.predict(X_test)
y_prob_xgb = xgb_model.predict_proba(X_test)[:, 1]

# Compute Metrics
xgb_metrics = {
    "AUC-PR": average_precision_score(y_test, y_prob_xgb),
    "Recall": recall_score(y_test, y_pred_xgb),
    "Precision": precision_score(y_test, y_pred_xgb),
    "F1-Score": f1_score(y_test, y_pred_xgb),
    "MCC": matthews_corrcoef(y_test, y_pred_xgb)
}

# Display Metrics
metrics_df_xgb = pd.DataFrame([xgb_metrics])

# Print Summary
print("\n🔹 Basic XGBoost (Class Weights, GPU) Results:")
print(classification_report(y_test, y_pred_xgb))


Parameters: { "use_label_encoder" } are not used.




🔹 Basic XGBoost (Class Weights, GPU) Results:
              precision    recall  f1-score   support

           0       0.99      0.95      0.97      8880
           1       0.12      0.34      0.17       163

    accuracy                           0.94      9043
   macro avg       0.55      0.65      0.57      9043
weighted avg       0.97      0.94      0.96      9043



Potential solutions:
- Use a data structure that matches the device ordinal in the booster.
- Set the device for booster before call to inplace_predict.




## **🔹 Model 4: Basic LightGBM (Class Weights, CPU)**

### **📌 Sampling Strategy:**  
- **No SMOTE or undersampling is used**—instead, LightGBM handles class imbalance using **`scale_pos_weight`**.  
- The **`scale_pos_weight`** is calculated as:  
  \[
  \frac{\text{Number of Non-Defaulters}}{\text{Number of Defaulters}}
  \]  
  This ensures that the model **assigns higher weight** to the minority class (`default_encoded = 1`).  

### **📌 Preprocessing & Model Training:**  
- **No hyperparameter tuning** is applied to keep the model simple and efficient.  
- A **basic LightGBM model** is trained with **CPU processing (`device="cpu"`)** to ensure compatibility and stability.  

### **📌 GPU Usage:**  
- ❌ **No GPU is used** for this model.  
- LightGBM runs on **CPU only** to avoid OpenCL issues and ensure smooth execution.  

### **📌 Purpose:**  
- Uses **class weights (`scale_pos_weight`) instead of data resampling**, keeping all data intact.  
- Focuses on **fast execution** without hyperparameter tuning.  
- Evaluated using key metrics such as **AUC-PR, Recall, Precision, F1-Score, and MCC**.  

### **💡 Next Steps**  
- Compare this **basic LightGBM model** with **Logistic Regression, Random Forest, and XGBoost**.  
- Apply **hyperparameter tuning later** for better performance if needed.  


In [15]:
# Compute class weight ratio
scale_pos_weight = (len(y_train) - sum(y_train)) / sum(y_train)

# Train Basic LightGBM Model (GPU Accelerated)
lgbm_model = LGBMClassifier(
    scale_pos_weight=scale_pos_weight,
    boosting_type="gbdt",
    objective="binary",
    metric="average_precision",
    device="cpu",  # ✅ Enable GPU acceleration
    random_state=42
)

lgbm_model.fit(X_train, y_train)

# Make Predictions
y_pred_lgbm = lgbm_model.predict(X_test)
y_prob_lgbm = lgbm_model.predict_proba(X_test)[:, 1]

# Compute Metrics
lgbm_metrics = {
    "AUC-PR": average_precision_score(y_test, y_prob_lgbm),
    "Recall": recall_score(y_test, y_pred_lgbm),
    "Precision": precision_score(y_test, y_pred_lgbm),
    "F1-Score": f1_score(y_test, y_pred_lgbm),
    "MCC": matthews_corrcoef(y_test, y_pred_lgbm)
}

# Display Metrics
metrics_df_lgbm = pd.DataFrame([lgbm_metrics])

# Print Summary
print("\n🔹 Basic LightGBM (Class Weights, GPU) Results:")
print(classification_report(y_test, y_pred_lgbm))


[LightGBM] [Info] Number of positive: 652, number of negative: 35516
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.011984 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 984
[LightGBM] [Info] Number of data points in the train set: 36168, number of used features: 14
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.018027 -> initscore=-3.997694
[LightGBM] [Info] Start training from score -3.997694

🔹 Basic LightGBM (Class Weights, GPU) Results:
              precision    recall  f1-score   support

           0       0.99      0.89      0.94      8880
           1       0.08      0.53      0.14       163

    accuracy                           0.88      9043
   macro avg       0.54      0.71      0.54      9043
weighted avg       0.97      0.88      0.92      9043



In [17]:
# Combine all model metrics into one DataFrame for easy comparison
all_metrics = pd.DataFrame([
    log_reg_metrics,  # Logistic Regression Metrics
    rf_metrics,       # Random Forest Metrics
    xgb_metrics,      # XGBoost Metrics
    lgbm_metrics      # LightGBM Metrics
])

# Print the table for reference
print("\n🔹 Model Performance Comparison:")
print(all_metrics)



🔹 Model Performance Comparison:
     AUC-PR    Recall  Precision  F1-Score       MCC
0  0.101233  0.852761   0.051424  0.096999  0.163919
1  0.109390  0.202454   0.163366  0.180822  0.165130
2  0.103068  0.337423   0.115546  0.172144  0.172783
3  0.127667  0.533742   0.081157  0.140891  0.174021


## **🔹 Model Performance Comparison Summary**
Below is the performance comparison of all four trained models using key evaluation metrics.

| **Model**   | **AUC-PR** | **Recall** | **Precision** | **F1-Score** | **MCC**  |
|------------|-----------|-----------|------------|-----------|-----------|
| **Logistic Regression (Class Weights)** | 0.101233 | **0.852761** | 0.051424 | 0.096999 | 0.163919 |
| **Random Forest (SMOTE Oversampling)** | 0.109390 | 0.202454 | 0.163366 | 0.180822 | 0.165130 |
| **XGBoost (Class Weights, GPU)** | 0.103068 | 0.337423 | 0.115546 | 0.172144 | 0.172783 |
| **LightGBM (Class Weights, GPU)** | **0.127667** | 0.533742 | **0.081157** | **0.140891** | **0.174021** |

---

### **📌 Key Observations**
- **AUC-PR (Higher is better)** → **LightGBM (0.127667) performed the best**, indicating better ranking ability for positive (default) cases.
- **Recall (Higher is better)** → **Logistic Regression (0.852761) has the highest recall**, meaning it captures most defaulters but with lower precision.
- **Precision (Higher is better)** → **Random Forest (0.163366) has the best precision**, meaning it predicts fewer false positives.
- **F1-Score (Higher is better)** → **Random Forest (0.180822) balances precision and recall well**.
- **MCC (Higher is better)** → **LightGBM (0.174021) achieves the best MCC score**, indicating better overall classification balance.

---

### **✅ Final Verdict**
✔ **LightGBM (Class Weights, GPU)** appears to be the **best overall model** based on **AUC-PR and MCC**.  
✔ **Logistic Regression** is **too biased towards recall**, leading to many false positives.  
✔ **Random Forest provides a balanced trade-off**, excelling in precision and F1-Score.  
✔ **XGBoost performs moderately well** but does not outperform LightGBM.

---

### **🚀 Next Steps**
- Fine-tune **LightGBM and Random Forest** to further improve **Precision and AUC-PR**.
- Consider **feature selection** and **hyperparameter tuning** for better performance.
- **Test ensemble models** combining **LightGBM and Random Forest** for optimal results.

