# 📊 Preparing Tableau Dashboards for Credit Risk Insights

In this final notebook, I prepare targeted CSV files from our validation dataset and SHAP outputs to support **interactive Tableau dashboards**. These dashboards are designed to communicate key insights from the credit risk modeling pipeline to both technical and non-technical audiences.

### 🎯 Objectives

- Export curated datasets to power five core Tableau visualizations:
  - 🔍 SHAP Global Importance  
    Summary of the top features influencing model predictions.
  - 🎯 Score by Actual Outcome  
    Distribution of predicted probabilities grouped by ground truth.
  - 🧠 SHAP by Risk Group  
    Aggregated SHAP values segmented by low, medium, and high-risk bands.
  - 📈 Feature Impact on Score  
    Visualizes how selected features influence loan default probability.
  - ✏️ Confusion Matrix & Metrics  
    Includes precision, recall, and classification breakdown.

Each export is tailored to maximize clarity, interactivity, and storytelling impact inside Tableau Public.

> This notebook acts as the **bridge between machine learning outputs and stakeholder communication**, enabling the delivery of interpretable, transparent credit risk insights.

---
### 📦 Load Required Libraries

I begin by importing the essential libraries for this final notebook.

In [1]:
# Core data manipulation
import pandas as pd
import numpy as np

# Model loading and evaluation tools
from sklearn.model_selection import train_test_split
import joblib

# SHAP explanations and LightGBM model compatibility
import shap
import lightgbm as lgb

# Suppress SHAP warning about binary classifiers
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

### 🧠 Load Trained Model and Validation Data

I begin by loading the final trained LightGBM model along with the validation dataset and corresponding labels. These will be used to generate prediction outputs and SHAP values for Tableau-ready visualizations.

In [2]:
# Load the final trained LightGBM model
model = joblib.load("../models/lgbm_model.joblib")

# Load the validation feature set
X_valid = pd.read_parquet("../data/processed/X_valid.parquet")

# Load the corresponding validation labels
y_valid = pd.read_parquet("../data/processed/y_valid.parquet").squeeze()

### 📊 Prepare Global SHAP Summary for Tableau Dashboarding

In this section, I generate a summary of global feature importance using SHAP values and prepare it for visualization in Tableau.

- First, I predict loan default probabilities on the validation set and assign class labels based on the chosen threshold (0.3).
- I then compute **SHAP values** using `TreeExplainer`, which is optimized for our LightGBM model.
- To summarize global importance, I calculate the **mean absolute SHAP value per feature**, giving us a ranked list of the most influential variables.
- Finally, I save this summary to a CSV file (`global_shap_importance.csv`) in the `data/final/` directory. This file can be easily loaded into Tableau to build an interactive feature importance dashboard.

In [3]:
# Predict probabilities and assign predicted labels using chosen threshold
y_pred_proba = model.predict_proba(X_valid)[:, 1]
# Apply threshold of 0.3 to convert probabilities into class labels
y_pred_thresh = (y_pred_proba >= 0.3).astype(int)

# Add prediction columns to X_valid
X_valid_final = X_valid.copy()
X_valid_final["loan_default_proba"] = y_pred_proba
X_valid_final["predicted_label"] = y_pred_thresh
X_valid_final["actual_label"] = y_valid.values

# Initialize SHAP explainer (TreeExplainer for LightGBM)
explainer = shap.TreeExplainer(model)

# Compute SHAP values (shape: [n_samples, n_features])
shap_values = explainer.shap_values(X_valid)

# Compute global SHAP feature importance (mean absolute value)
global_importance = (
    pd.DataFrame(shap_values, columns=X_valid.columns)
    .abs()
    .mean()
    .reset_index()
    .rename(columns={"index": "feature_name", 0: "mean_abs_shap_value"})
    .sort_values("mean_abs_shap_value", ascending=False)
)

# Save global SHAP importance summary for Tableau
global_importance.to_csv("../data/final/global_shap_importance.csv", index=False)
print("✅ Global SHAP feature importance saved to global_shap_importance.csv")

✅ Global SHAP feature importance saved to global_shap_importance.csv


### 📊 Save Risk Distribution Data for Tableau

To support visual analysis of loan default risk scores in Tableau, I prepare a dataset containing key applicant features and the model’s predicted probabilities:

- `loan_default_proba`: Model’s predicted probability of default
- `actual_label`: Ground truth indicating default or not
- `label`: Readable label version of the actual class (e.g., “Default”, “No Default”)

This file can be used to create histograms, density plots, or stratified risk profiles in Tableau dashboards.

In [4]:
# Add readable labels to indicate default vs no default
X_valid_final["label"] = X_valid_final["actual_label"].map({0: "No Default", 1: "Default"})

# Select relevant columns for visualization
risk_df = X_valid_final[[
    "loan_default_proba",
    "actual_label",
    "label"
]]

# Save to CSV
risk_df.to_csv("../data/final/risk_distribution.csv", index=False)
print("✅ Risk distribution saved to risk_distribution.csv")

✅ Risk distribution saved to risk_distribution.csv


### 📊 Aggregate SHAP Values by Risk Band for Tableau

To enable segmented interpretation of model behavior, I prepare an aggregated SHAP summary grouped by predicted risk bands:

- Assign each applicant to a **risk band** based on their predicted probability of default:
  - Low Risk: 0.0–0.2
  - Medium Risk: 0.2–0.5
  - High Risk: 0.5–1.0
- Compute **mean absolute SHAP values** across all validation samples to identify the top 15 most influential features.
- Reshape these SHAP values into long format grouped by risk band, then compute the **average SHAP impact per feature per band**.
- Export the result as `agg_shap_by_risk_band.csv`, ready for Tableau heatmaps or bar plots to show which features drive different levels of credit risk.

This segmentation reveals **how different applicant profiles are evaluated by the model**, supporting transparent, risk-aware storytelling.

In [5]:
# Create risk bands from predicted default probabilities
bins = [0, 0.2, 0.5, 1.0]
labels = ["Low Risk", "Medium Risk", "High Risk"]
X_valid_final["risk_band"] = pd.cut(X_valid_final["loan_default_proba"], bins=bins, labels=labels)

# Compute top 15 features by mean absolute SHAP value
shap_df = pd.DataFrame(shap_values, columns=X_valid.columns)
mean_abs_shap = shap_df.abs().mean().sort_values(ascending=False)
top_features = mean_abs_shap.head(15).index.tolist()

# Add risk band to SHAP values DataFrame
shap_df["risk_band"] = X_valid_final["risk_band"]

# Melt to long format for Tableau
shap_melted = shap_df[["risk_band"] + top_features].melt(
    id_vars="risk_band", var_name="feature", value_name="shap_value"
)

# Aggregate SHAP values by risk band and feature
agg_df = shap_melted.groupby(["risk_band", "feature"], observed=True).mean().reset_index()

# Save to CSV for Tableau
agg_df.to_csv("../data/final/agg_shap_by_risk_band.csv", index=False)
print("✅ SHAP aggregation saved to agg_shap_by_risk_band.csv")

✅ SHAP aggregation saved to agg_shap_by_risk_band.csv


### 📊 Prepare SHAP vs. Risk Score Data for Tableau

In this section, I generate a **long-format dataset** that connects SHAP feature contributions to the model’s predicted probability of default.

- I select **8 key features** based on their relevance to financial behavior and SHAP impact: external scores, credit ratios, demographics, and employment type.
- SHAP values for each selected feature are combined with the model’s `loan_default_proba` scores.
- The data is **reshaped into long format**, with one row per (applicant, feature) pair, suitable for scatterplots or faceted visualizations in Tableau.
- This enables detailed analysis of how each feature influences risk scores across the applicant population.

The final CSV (`shap_vs_risk_long.csv`) allows stakeholders to explore which features drive default probability at different risk levels, uncovering nuanced patterns and potential biases.

In [6]:
# Select key features to analyze SHAP contributions vs. risk probability
selected_features = [
    "EXT_SOURCE_1",
    "EXT_SOURCE_2",
    "EXT_SOURCE_3",
    "credit_annuity_ratio",
    "credit_goods_ratio",
    "CODE_GENDER_M",
    "DAYS_BIRTH",
    "ORGANIZATION_TYPE_TE"
]

# Subset SHAP values for selected features
shap_selected = pd.DataFrame(shap_values, columns=X_valid.columns)[selected_features]

# Combine with predicted probabilities
shap_vs_risk_df = shap_selected.copy()
shap_vs_risk_df["loan_default_proba"] = y_pred_proba

# Melt into long format for Tableau visualization
shap_vs_risk_long = shap_vs_risk_df.melt(
    id_vars="loan_default_proba",
    value_vars=selected_features,
    var_name="feature",
    value_name="shap_value"
)

# Save to CSV
shap_vs_risk_long.to_csv("../data/final/shap_vs_risk_long.csv", index=False)
print("✅ SHAP vs risk score (long format) saved to shap_vs_risk_long.csv")

✅ SHAP vs risk score (long format) saved to shap_vs_risk_long.csv


### 📊 Export Row-Level Confusion Matrix Predictions for Tableau

In this step, I prepare a clean, row-level dataset showing each applicant’s predicted and actual class along with their model-assigned probability of default. This file powers a **confusion matrix visualization** in Tableau that includes:

- `loan_default_proba`: Predicted probability of default  
- `predicted_label`: Model-assigned class label (0 = No Default, 1 = Default)  
- `actual_label`: Ground truth from the validation set  

This format is optimized for **flexible Tableau interactivity**, enabling visual breakdowns by true vs false positives/negatives, model confidence, and threshold-based performance tuning.

The output is saved as `confusion_prediction_only.csv` in the `data/final/` directory.

In [7]:
# Select only prediction-relevant columns
df_pred = X_valid_final[["loan_default_proba", "predicted_label", "actual_label"]].copy()

# Save to final output directory
df_pred.to_csv("../data/final/confusion_prediction_only.csv", index=False)
print("✅ Saved: confusion_prediction_only.csv")

✅ Saved: confusion_prediction_only.csv


### 📊 Export Confusion Matrix Summary for Tableau

To complement the row-level predictions, I generate a concise **confusion matrix summary** to visualize true/false positives and negatives in Tableau:

- Calculates `True Positive`, `False Positive`, `False Negative`, and `True Negative` counts using logical conditions.
- Adds placeholder columns (`loan_default_proba`, `actual_label`, `predicted_label`) to maintain schema consistency with the prediction-level export.
- Includes a `Source` column to distinguish this summary from row-level predictions when combining both for dashboarding.

This file (`confusion_summary.csv`) provides a high-level view of model performance, useful for heatmaps, confusion matrix visuals, or dashboard tiles.

In [8]:
# Calculate confusion matrix components directly from predicted vs actual labels
TP = ((y_valid == 1) & (y_pred_thresh == 1)).sum()
TN = ((y_valid == 0) & (y_pred_thresh == 0)).sum()
FP = ((y_valid == 0) & (y_pred_thresh == 1)).sum()
FN = ((y_valid == 1) & (y_pred_thresh == 0)).sum()

# Create a summary DataFrame in Tableau-friendly format
confusion_summary = pd.DataFrame({
    "Metric": ["True Positive", "False Positive", "False Negative", "True Negative"],
    "Count": [TP, FP, FN, TN],
    "loan_default_proba": [pd.NA] * 4,
    "actual_label": [pd.NA] * 4,
    "predicted_label": [pd.NA] * 4,
    "Source": ["Summary"] * 4
})

# Save to CSV
confusion_summary.to_csv("../data/final/confusion_summary.csv", index=False)
print("✅ Summary saved: confusion_summary.csv")

✅ Summary saved: confusion_summary.csv


---

## ✅ Final Notes

All required datasets for Tableau dashboarding have now been successfully exported and saved in the `data/final/` directory:

| CSV File                         | Purpose                                |
|----------------------------------|----------------------------------------|
| global_shap_importance.csv       | SHAP global feature importance         |
| risk_distribution.csv            | Predicted risk vs. actual outcome      |
| agg_shap_by_risk_band.csv        | SHAP mean by risk group                |
| shap_vs_risk_long.csv            | SHAP vs. score (long format)           |
| confusion_prediction_only.csv    | Row-level prediction and outcome       |
| confusion_summary.csv            | Summary of confusion matrix counts     |

These files support a range of visualizations—from feature importance and risk stratification to confusion matrix insights.

> This marks the completion of the modeling-to-visualization pipeline. The next stage focuses on building a transparent, interactive credit risk dashboard in **Tableau Public**, translating ML insights into accessible, decision-ready narratives.