# Model Comparison and Feature Importance

In this notebook, I will:
1. Compare multiple machine learning algorithms beyond Logistic Regression, such as Random Forest and XGBoost, to see if I can improve predictive performance.
2. Implement basic hyperparameter tuning using GridSearchCV or RandomizedSearchCV to optimize model settings.
3. Evaluate models using metrics that are crucial for credit risk analysis, such as AUC-ROC and Precision-Recall curves, since identifying "bad" credit risks is more important than overall accuracy.
4. Explore feature importance and model interpretability. Understanding which features drive the model’s decisions is vital in a credit risk context, where explainability is often a regulatory and business requirement.

In [2]:
import joblib
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score, RocCurveDisplay, PrecisionRecallDisplay
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from xgboost import XGBClassifier

# Load the data
X_train = joblib.load("../data/X_train.pkl")
X_test = joblib.load("../data/X_test.pkl")
y_train = joblib.load("../data/y_train.pkl")
y_test = joblib.load("../data/y_test.pkl")

### Random Forest Baseline

I'll train a basic Random Forest with default parameters and compare its performance to the Logistic Regression baseline. Random Forests are often good at handling complex interactions between features and may provide better recall for the "bad" class.

In [3]:
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

acc_rf = accuracy_score(y_test, y_pred_rf)
auc_rf = roc_auc_score(y_test, rf.predict_proba(X_test)[:,1]) # Calculate AUC
print("Random Forest Accuracy:", acc_rf)
print("Random Forest AUC:", auc_rf)

print("\nClassificaction Report:\n", classification_report(y_test, y_pred_rf))

Random Forest Accuracy: 0.745
Random Forest AUC: 0.7909604519774011

Classificaction Report:
               precision    recall  f1-score   support

           0       0.77      0.91      0.83       141
           1       0.62      0.34      0.44        59

    accuracy                           0.74       200
   macro avg       0.70      0.63      0.64       200
weighted avg       0.73      0.74      0.72       200



The Random Forest's accuracy and AUC provide a quick snapshot. If the AUC is higher than Logistic Regression’s AUC, it suggests the Random Forest may be better at ranking which customers are more likely to be bad credit risks. I will pay special attention to the recall for the bad class and the AUC-ROC, as these metrics align better with credit risk priorities.

#### Model Interpretation

**Random Forest (Baseline):**  
- **Accuracy:** 0.745  
- **AUC:** 0.79  

**Interpretation:**

The baseline Random Forest model gives the following results:

- **Accuracy (74.5%)**: Slightly lower than the Logistic Regression models, suggesting it is less effective at overall classification.
- **Class 0 (Good Credit)**:
  - **Recall (0.91)**: Very high, indicating the model identifies most good credit customers correctly.
  - **Precision (0.77)**: Slightly lower than Logistic Regression.
- **Class 1 (Bad Credit)**:
  - **Recall (0.34)**: Much lower than Logistic Regression, meaning the model misses most bad credit customers.
  - **Precision (0.62)**: Lower than Logistic Regression, indicating it is less reliable when predicting bad credit.

**Key Takeaways:**
While the Random Forest model excels at identifying good credit customers, its recall for the bad credit class is significantly worse than Logistic Regression, which makes it less suitable for a credit risk context without further tuning.

### Hyperparameter Tuning for Random Forest

I will use GridSearchCV to search for optimal hyperparameters for the Random Forest. This process demonstrates how I can improve the model further. I'll tune parameters like `n_estimators`, `max_depth`, and `min_samples_leaf`.

In [4]:
param_grid = {
    'n_estimators': [100,300],
    'max_depth': [None, 10, 20],
    'min_samples_leaf': [1, 2, 5]
}

grid_search = GridSearchCV(
    estimator = RandomForestClassifier(random_state=42),
    param_grid = param_grid,
    scoring = 'roc_auc', # Using AUC as the scoring metric
    cv = 3,
    n_jobs = -1
)

grid_search.fit(X_train, y_train)

print("Best Params:", grid_search.best_params_)
print("Best Score (AUC):", grid_search.best_score_)

Best Params: {'max_depth': 10, 'min_samples_leaf': 5, 'n_estimators': 100}
Best Score (AUC): 0.7623129915833197


**Interpreting the Grid Search Results:**

- `n_estimators`: Number of trees in the forest. More trees can improve performance but increase training time.
- `max_depth`: Maximum depth of the trees. Deeper trees can model complex relationships but may overfit.
- `min_samples_leaf`: Minimum samples per leaf. Increasing this can reduce overfitting.

The best parameters are those that give the highest AUC on the validation folds. With these parameters, I'll retrain the model and evaluate on the test set.

In [8]:
best_rf = grid_search.best_estimator_ 
y_pred_best_rf = best_rf.predict(X_test)

acc_best_rf = accuracy_score(y_test, y_pred_best_rf)
auc_best_rf = roc_auc_score(y_test, best_rf.predict_proba(X_test)[:,1])

print("Optimized Random Forest Accuracy:", acc_best_rf)
print("Optimized Random Forest AUC:", auc_best_rf)
print("\nClassification Report:\n", classification_report(y_test, y_pred_best_rf))

Optimized Random Forest Accuracy: 0.75
Optimized Random Forest AUC: 0.8015386464719317

Classification Report:
               precision    recall  f1-score   support

           0       0.75      0.96      0.84       141
           1       0.71      0.25      0.38        59

    accuracy                           0.75       200
   macro avg       0.73      0.61      0.61       200
weighted avg       0.74      0.75      0.71       200



#### Model Interpretation

**Random Forest (Optimized):**  
- **Accuracy:** 0.75  
- **AUC:** 0.80154  

**Interpretation:**

After tuning the Random Forest with GridSearchCV, the model's performance is as follows:

- **Accuracy (75%)**: Slight improvement over the baseline Random Forest.
- **Class 0 (Good Credit)**:
  - **Recall (0.96)**: Increased significantly, making it very effective at identifying good credit customers.
  - **Precision (0.75)**: Similar to the baseline model.
- **Class 1 (Bad Credit)**:
  - **Recall (0.25)**: Decreased from the baseline Random Forest, which is concerning.
  - **Precision (0.71)**: Improved slightly.

**Key Takeaways:**
While the optimized Random Forest improves overall metrics like accuracy and AUC, its recall for the bad credit class drops even further, making it less effective for practical use in credit risk management.

### Trying XGBoost

XGBoost often performs well in tabular data tasks. I'll train a basic XGBoost classifier and see if it outperforms Random Forest. If it does well, I could also consider tuning its parameters.

In [10]:
xgb = XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss')
xgb.fit(X_train, y_train)
y_pred_xgb = xgb.predict(X_test)

acc_xgb = accuracy_score(y_test, y_pred_xgb)
auc_xgb = roc_auc_score(y_test, xgb.predict_proba(X_test)[:,1])

print("XGBoost Accuracy:", acc_xgb)
print("XGBoost AUC:", auc_xgb)
print("\nClassification Report:\n", classification_report(y_test, y_pred_xgb))

XGBoost Accuracy: 0.785
XGBoost AUC: 0.8139199423007573

Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.89      0.85       141
           1       0.67      0.53      0.59        59

    accuracy                           0.79       200
   macro avg       0.75      0.71      0.72       200
weighted avg       0.78      0.79      0.78       200



Parameters: { "use_label_encoder" } are not used.



#### Model Interpretation

**XGBoost:**  
- **Accuracy:** 0.785  
- **AUC:** 0.81392  

**Interpretation:**

XGBoost delivers the following results:

- **Accuracy (78.5%)**: Higher than both Logistic Regression and Random Forest models.
- **Class 0 (Good Credit)**:
  - **Recall (0.89)**: Slightly lower than the optimized Random Forest but still strong.
  - **Precision (0.82)**: Better than Random Forest.
- **Class 1 (Bad Credit)**:
  - **Recall (0.53)**: A significant improvement over both Random Forest models, on par with the improved Logistic Regression.
  - **Precision (0.67)**: Matches the improved Logistic Regression.

**Key Takeaways:**
XGBoost offers a good balance between overall performance and recall for the bad credit class. Its ability to identify more bad credit customers compared to Random Forest makes it a strong candidate for credit risk modeling.

## Feature Importance

Understanding which features matter most is crucial. I'll examine the feature importances from the Random Forest and XGBoost models. This helps stakeholders understand what drives credit risk decisions.

In [11]:
importances = best_rf.feature_importances_
feature_names = X_train.columns 

# Create a DataFrame with feature importances
feat_imp_df = pd.DataFrame({'feature': feature_names, 'importance': importances})
feat_imp_df = feat_imp_df.sort_values('importance', ascending=False)

feat_imp_df.head(10)

Unnamed: 0,feature,importance
9,Status_of_existing_checking_account_A14,0.137842
1,Credit_amount,0.129003
4,Age_in_years,0.100649
0,Duration_in_month,0.097807
2,Installment_rate_in_percentage_of_disposable_i...,0.037479
13,Credit_history_A34,0.036118
3,Present_residence_since,0.033003
41,Housing_A152,0.030111
7,Status_of_existing_checking_account_A12,0.023187
40,Other_installment_plans_A143,0.022276


### Feature Importance Analysis

Understanding which features contribute most to a model's decisions is crucial for interpretability. For my **Random Forest model**, I examined the feature importances, which indicate how much each feature contributes to reducing impurity in the splits.

#### Top 10 Most Important Features:
1. **No Checking Account (Status_of_existing_checking_account_A14)**: The absence of a checking account appears to be the most influential feature in predicting credit risk.
2. **Credit Amount**: The amount of the loan requested is another significant predictor, which aligns with real-world expectations.
3. **Age in Years**: Customer age influences creditworthiness, possibly because it correlates with financial stability.
4. **Duration in Months**: The length of the credit request period is a key factor.
5. **Installment Rate**: The proportion of income used for installment payments also impacts credit risk.

By identifying these features, I can better understand the drivers of credit decisions. This insight is valuable for stakeholders and aligns with business requirements for transparency and trust in the model's outputs.

In [12]:
# Mapping original feature names to clean, readable names
clean_feature_names = {
    "Status_of_existing_checking_account_A14": "No Checking Account",
    "Credit_amount": "Credit Amount",
    "Age_in_years": "Age",
    "Duration_in_month": "Credit Duration",
    "Installment_rate_in_percentage_of_disposable_income": "Installment Rate",
    "Credit_history_A34": "Critical Credit History",
    "Present_residence_since": "Years at Residence",
    "Housing_A152": "Own Housing",
    "Status_of_existing_checking_account_A12": "Low Balance Checking",
    "Other_installment_plans_A143": "No Other Installment Plans"
}

# Replace feature names with clean names
feat_imp_df['feature'] = feat_imp_df['feature'].replace(clean_feature_names)

# Sort and display the top 10 features
feat_imp_df = feat_imp_df.sort_values('importance', ascending=False)
feat_imp_df.head(10)

Unnamed: 0,feature,importance
9,No Checking Account,0.137842
1,Credit Amount,0.129003
4,Age,0.100649
0,Credit Duration,0.097807
2,Installment Rate,0.037479
13,Critical Credit History,0.036118
3,Years at Residence,0.033003
41,Own Housing,0.030111
7,Low Balance Checking,0.023187
40,No Other Installment Plans,0.022276


### Feature Importance Analysis (Cleaned)

The table below highlights the top 10 most important features from the **Random Forest model**. These features contribute most to the model's decision-making process by reducing impurity at splits:

| Feature                     | Importance |
|-----------------------------|------------|
| No Checking Account         | 0.137842   |
| Credit Amount               | 0.129003   |
| Age                         | 0.100649   |
| Credit Duration             | 0.097807   |
| Installment Rate            | 0.037479   |
| Critical Credit History     | 0.036118   |
| Years at Residence          | 0.033003   |
| Own Housing                 | 0.030111   |
| Low Balance Checking        | 0.023187   |
| No Other Installment Plans  | 0.022276   |

These results provide insight into the factors driving credit risk predictions. For example:
- Customers without a checking account or with higher loan amounts are more likely to be flagged as risky.
- Features like age and credit duration also play a significant role, indicating that customer demographics and loan terms heavily influence decisions.

This step ensures that the model aligns with business expectations and highlights which features are most relevant for decision-making.

### Feature Importance Analysis (XGBoost)

After analyzing the feature importances for the **Random Forest model**, I will now examine the importances for the **XGBoost model**. Understanding how different models prioritize features provides deeper insights into the dataset and helps identify consistent patterns.

#### Why Analyze XGBoost Feature Importances?
XGBoost uses gradient boosting, a different approach than Random Forest. As a result, the way it prioritizes features can differ. Comparing feature importances across models helps:
- Validate whether key features are consistent across models.
- Identify any unique features emphasized by XGBoost.

In this step, I will:
1. Extract feature importances from the XGBoost model.
2. Map feature names to clean, readable names for clarity.
3. Display the top 10 most important features.
4. Visualize the results for better interpretation.

In [17]:
# XGBoost Feature 
xgb_importances = xgb.feature_importances_
feature_names = X_train.columns

# Create a DataFrame with feature importances
xgb_feat_imp_df = pd.DataFrame({'feature': feature_names, 'importance': xgb_importances})

# Replace feature names with clean names
xgb_feat_imp_df['feature'] = xgb_feat_imp_df['feature'].replace(clean_feature_names)

# Sort and display the top 10 features
xgb_feat_imp_df = xgb_feat_imp_df.sort_values('importance', ascending=False)
xgb_feat_imp_df.head(10)

Unnamed: 0,feature,importance
9,No Checking Account,0.102722
35,Other_debtors_or_guarantors_A103,0.067574
8,Status_of_existing_checking_account_A13,0.038816
25,Savings_account_bonds_A64,0.03275
38,Property_A124,0.031116
26,Savings_account_bonds_A65,0.030579
19,Purpose_A45,0.029664
13,Critical Credit History,0.02866
29,Present_employment_since_A74,0.027591
40,No Other Installment Plans,0.026665


### Feature Importance Analysis (XGBoost)

#### Key Observations:
1. **No Checking Account**: Similar to the Random Forest model, this feature remains the most significant predictor of credit risk. Customers without a checking account are flagged as higher risk.
2. **Guarantors**: Having a guarantor (or lack thereof) also plays a key role in distinguishing risk levels, as highlighted by the `Other_debtors_or_guarantors_A103` feature.
3. **Savings and Property**: Features related to savings accounts (`Savings A64`, `Savings A65`) and property ownership (`Property A124`) significantly influence the model's predictions. Customers without savings or property are seen as higher risk.
4. **Purpose of Credit**: The model also identifies the credit purpose (`Repairs Purpose`) as an important factor, reflecting the potential risk associated with specific types of loans.
5. **Employment Stability**: Employment duration (`Employment: 4-7 years`) is another important feature, indicating that longer employment histories may correlate with lower credit risk.

#### Comparison to Random Forest:
- Both models prioritize **No Checking Account** as the most important feature, reinforcing its influence on credit risk predictions.
- XGBoost places more emphasis on guarantors and savings-related features, while Random Forest highlights loan duration and age.
- These differences may reflect the unique strengths of each model in interpreting the dataset.

### Why Feature Importance Matters in Credit Risk

Identifying the most important features in credit risk analysis is crucial for both stakeholders and regulators. These insights can:
1. Help financial institutions make more informed lending decisions.
2. Highlight key customer attributes that influence creditworthiness.
3. Ensure the model aligns with business priorities, such as minimizing risky loans.

### SHAP: Enhancing Feature Interpretability

To complement the feature importance analysis, I used **SHAP (SHapley Additive exPlanations)** to understand how features contribute to individual predictions. SHAP values allow me to:
1. Explain **global importance**: Which features matter most across all predictions.
2. Explain **local importance**: How specific features influence individual predictions.

This deeper interpretability is particularly valuable in credit risk contexts, where regulatory requirements often demand transparency and explainability in decision-making.

In [18]:
import shap

# Create an explainer for the XGBoost model
explainer = shap.Explainer(xgb, X_train)

# Calculate SHAP values for the test set
shap_values = explainer(X_test)

# Visualize the overall feature importance with SHAP
shap.summary_plot(shap_values, X_test, plot_type="bar")

TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'

NameError: name 'X_train' is not defined