## Data Import

* Feature Importance: Analyze the contribution of each feature to the final model’s predictions using coefficients and importance scores.
* Precision-Recall Curve: Plot the Precision-Recall curve to evaluate the model’s performance in distinguishing between bankrupt and non-bankrupt companies.
* SHAP Analysis: Use SHAP (SHapley Additive exPlanations) to visualize and explain individual predictions, highlighting key factors influencing bankruptcy risk.
* Business Insights and Implications: Translate model findings into actionable insights for stakeholders, including recommendations for risk mitigation and investment strategies.

Required Data
* Trained Model: Final trained model (likely a Logistic Regression or Cox Proportional Hazards model)
* Test Dataset: A held-out portion of your data not used during training
* Feature Matrix: The processed features used for prediction (X_test)
* Target Variable: The actual bankruptcy outcomes (y_test)
* Prediction Scores: Probability outputs from your model on the test data

### Performance Metrics Visualization: ROC Curves

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc

# Assuming you have:
# model = your trained model
# X_test = your test features
# y_test = your test labels

# Get prediction probabilities
y_pred_proba = model.predict_proba(X_test)[:, 1]  # For binary classification

# Calculate ROC curve points
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)

# Plot ROC curve
plt.figure(figsize=(10, 8))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.savefig('roc_curve.png', dpi=300, bbox_inches='tight')
plt.show()

The ROC curve is particularly useful for your balanced dataset because it shows the model's performance across different classification thresholds, which is important when the default threshold (0.5) might not be optimal due to class imbalance.

In [None]:
# Create ROC curve using the test data
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

# Get probability predictions
y_pred_prob = log_reg.predict_proba(X_test)[:,1]

# Calculate ROC curve points
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

# Calculate AUC
auc = roc_auc_score(y_test, y_pred_prob)

# Create plot
plt.figure(figsize=(8, 6))
plt.plot([0, 1], [0, 1], 'k--', label='Random (AUC = 0.5)')
plt.plot(fpr, tpr, label=f'Logistic Regression (AUC = {auc:.3f})')

# Add labels and title
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for Logistic Regression Bankruptcy Prediction')

# Add legend and grid
plt.legend()
plt.grid(alpha=0.3)

# Set axis limits
plt.xlim([0, 1])
plt.ylim([0, 1.05])

plt.show()


When to Use Each:

- ROC curves (TPR vs FPR) are good for balanced datasets
- Precision-Recall curves are better for imbalanced datasets like yours
- For your bankruptcy prediction with few positive cases, precision and recall are more informative than FPR

For your bankruptcy prediction model, a precision-recall curve is particularly valuable given your imbalanced dataset. Here's how to create one:

In [None]:
# Import necessary libraries
from sklearn.metrics import precision_recall_curve, average_precision_score
import matplotlib.pyplot as plt
import numpy as np

# Get probability predictions for the positive class (bankruptcy)
y_pred_prob = log_reg.predict_proba(X_test)[:,1]

# Calculate precision, recall, and thresholds
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_prob)

# Calculate average precision (AP) - similar to area under PR curve
ap = average_precision_score(y_test, y_pred_prob)

# Create the precision-recall curve
plt.figure(figsize=(8, 6))

# Plot the precision-recall curve
plt.plot(recall, precision, label=f'Logistic Regression (AP = {ap:.3f})')

# Plot the baseline (proportion of positive class)
baseline = np.sum(y_test) / len(y_test)
plt.axhline(y=baseline, color='r', linestyle='--', 
            label=f'Baseline (No Skill) = {baseline:.3f}')

# Add labels and title
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve for Bankruptcy Prediction')

# Set axis limits
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])

# Add legend and grid
plt.legend(loc='best')
plt.grid(alpha=0.3)

plt.show()

# Optional: Plot precision and recall as functions of threshold
plt.figure(figsize=(8, 6))

# We need to add a point at threshold=0 to complete the curve
thresholds = np.append(thresholds, 1.0)
precision = np.append(precision[:-1], 1.0)
recall = np.append(recall[:-1], 0.0)

plt.plot(thresholds, precision[:-1], 'b--', label='Precision')
plt.plot(thresholds, recall[:-1], 'g-', label='Recall')

plt.xlabel('Threshold')
plt.ylabel('Score')
plt.title('Precision and Recall vs. Threshold')
plt.legend()
plt.grid(alpha=0.3)

plt.show()


How to Interpret the Precision-Recall Curve:
- The curve: Shows the trade-off between precision and recall at different classification thresholds
- Baseline: The horizontal line represents a "no skill" model that randomly predicts the positive class at the rate it appears in the dataset
- Area under the curve: The higher the average precision (AP), the better the model
- Curve shape:
- A curve that stays in the upper-right corner (high precision and high recall) indicates a good model
- A curve that drops quickly indicates the model struggles to maintain precision as recall increases

### Model Interpretation



1) SHAP Summary Plot

2. Feature Importance Bar Charts

3. Coefficient Plots for Logistic Regression

### Creating SHAP Summary Plots from a Pickled Model

To create SHAP (SHapley Additive exPlanations) summary plots from your saved pickle model, you'll need to:

- Load the pickled model
- Create a SHAP explainer for your model
- Calculate SHAP values
- Generate the summary plot

In [None]:
# Import necessary libraries
import pickle
import shap
import matplotlib.pyplot as plt
import numpy as np

# Load the pickled model
with open('best_model.pkl', 'rb') as file:
    best_model = pickle.load(file)

# Get a sample of your data for SHAP analysis (using the same X_test you used for evaluation)
# If X_test is too large, you might want to sample it
X_sample = X_test.copy()
if len(X_test) > 1000:  # Optional: limit sample size for faster computation
    X_sample = X_test.sample(1000, random_state=42)

# Create the SHAP explainer based on your model type
# For tree-based models (like RandomForest, XGBoost, etc.)
if hasattr(best_model, 'predict_proba') and hasattr(best_model, 'estimators_'):
    explainer = shap.TreeExplainer(best_model)
# For other models (like logistic regression)
else:
    # Create a background dataset for the explainer (using training data)
    X_background = shap.sample(X_train, 100, random_state=42)  # Sample for efficiency
    explainer = shap.KernelExplainer(best_model.predict_proba, X_background)

# Calculate SHAP values
shap_values = explainer.shap_values(X_sample)

# For multi-class models, shap_values will be a list of arrays (one per class)
# For binary classification, we typically want the values for class 1 (bankruptcy)
if isinstance(shap_values, list):
    shap_values_to_plot = shap_values[1]  # Values for class 1 (bankruptcy)
else:
    shap_values_to_plot = shap_values

# Create the SHAP summary plot
plt.figure(figsize=(10, 8))
shap.summary_plot(shap_values_to_plot, X_sample, feature_names=X_sample.columns)
plt.title('SHAP Summary Plot for Bankruptcy Prediction Model')
plt.tight_layout()
plt.show()

# Optional: Create a SHAP bar plot to see average impact magnitude
plt.figure(figsize=(10, 8))
shap.summary_plot(shap_values_to_plot, X_sample, plot_type='bar', feature_names=X_sample.columns)
plt.title('SHAP Feature Importance for Bankruptcy Prediction Model')
plt.tight_layout()
plt.show()


In [None]:
import shap
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Assuming you have:
# 1. A trained pipeline called 'best_model' from your bankruptcy prediction project
# 2. X_train and X_test datasets with your financial indicators
# 3. Feature names stored in your dataset

# --- Extract Steps from Your Best Pipeline ---
preprocessor = best_model.named_steps['preprocessor']  # Adjust if your pipeline has different step names
classifier = best_model.named_steps['classifier']  # This might be 'regressor' or another name in your pipeline

# --- Transform the Data ---
X_train_transformed = preprocessor.transform(X_train)
X_test_transformed = preprocessor.transform(X_test)

# If your transformed data is not a DataFrame but you have feature names
# Create a list of feature names that match your transformed data
# This might require extracting column names from your preprocessor if it includes OneHotEncoder or similar
feature_names = []  # Fill this with your actual feature names after preprocessing

# --- Create a Background Sample for SHAP ---
background = shap.sample(X_train_transformed, 100, random_state=42)

# --- Create the SHAP Explainer ---
# For classification models, you might want to use the predict_proba method instead
# to get probability estimates of bankruptcy
explainer = shap.KernelExplainer(
    lambda x: classifier.predict_proba(x)[:,1],  # For binary classification, focus on positive class
    background
)

# --- Local Explanation (for a single company) ---
# Select one observation from the test set
observation = X_test_transformed[0:1, :]
shap_values = explainer.shap_values(observation)

# Initialize the JavaScript visualization
shap.initjs()

# Create and display a force plot for the selected company
force_plot = shap.force_plot(
    explainer.expected_value, 
    shap_values, 
    features=observation, 
    feature_names=feature_names, 
    matplotlib=True
)
plt.title("SHAP Force Plot: Factors Influencing Bankruptcy Prediction")
plt.tight_layout()
plt.show()

# --- Global Explanation (for understanding overall model behavior) ---
# Compute SHAP values for a sample of the test data
X_test_sample = shap.sample(X_test_transformed, 200, random_state=42)
shap_values_test = explainer.shap_values(X_test_sample)

# Create a summary plot to show overall feature importance
plt.figure(figsize=(12, 8))
shap.summary_plot(
    shap_values_test, 
    features=X_test_sample, 
    feature_names=feature_names,
    plot_type="bar"  # This creates the feature importance bar chart you asked about
)
plt.title("Feature Importance Based on SHAP Values")
plt.tight_layout()
plt.show()

# Create a detailed summary plot showing distribution of SHAP values
plt.figure(figsize=(12, 10))
shap.summary_plot(
    shap_values_test, 
    features=X_test_sample, 
    feature_names=feature_names
)
plt.title("SHAP Summary Plot: Impact of Features on Bankruptcy Prediction")
plt.tight_layout()
plt.show()

# --- Additional Visualization: SHAP Dependence Plots ---
# For the top 3 most important features (based on your README: debt-to-equity ratio, current ratio, operating cash flow)
# Assuming these features are in your dataset and you know their indices after transformation
important_features = [
    # Replace these indices with the actual indices of your important features
    feature_names.index("debt_to_equity_ratio"),
    feature_names.index("current_ratio"),
    feature_names.index("operating_cash_flow")
]

for idx in important_features:
    plt.figure(figsize=(10, 6))
    shap.dependence_plot(
        idx, 
        shap_values_test, 
        X_test_sample,
        feature_names=feature_names
    )
    plt.title(f"SHAP Dependence Plot: {feature_names[idx]}")
    plt.tight_layout()
    plt.show()


### Key Findings Explanation