## Model Explainability with SHAP

# Task 3: Model Explainability - Step-by-Step To-Do List

## Phase 1: Setup & Data Preparation
1. **Environment Setup**
   - [ ] Install SHAP: `pip install shap`
   - [ ] Import required libraries
   - [ ] Create necessary directories

2. **Load Model & Data**
   - [ ] Load trained XGBoost model
   - [ ] Load test dataset
   - [ ] Verify data shapes and types

## Phase 2: Feature Importance Analysis
3. **Built-in Feature Importance**
   - [ ] Extract feature importances from XGBoost
   - [ ] Create bar plot of top 10 features
   - [ ] Save visualization

## Phase 3: SHAP Analysis
4. **Global Explainability**
   - [ ] Initialize SHAP explainer
   - [ ] Calculate SHAP values (sample if needed for performance)
   - [ ] Generate SHAP summary plot
   - [ ] Save visualization

5. **Local Explainability**
   - [ ] Identify example predictions:
     - [ ] 1 True Positive
     - [ ] 1 False Positive
     - [ ] 1 False Negative
   - [ ] Create SHAP force plots for each
   - [ ] Save visualizations

## Phase 4: Interpretation
6. **Compare Feature Importance Methods**
   - [ ] Create comparison table
   - [ ] Document top 5 fraud drivers

7. **Business Recommendations**
   - [ ] List 3+ actionable recommendations
   - [ ] Connect to SHAP insights
   - [ ] Add potential business impact

## Phase 5: Documentation
8. **Update Notebook**
   - [ ] Add clear section headers
   - [ ] Include markdown explanations
   - [ ] Add figure captions

9. **Repository Updates**
   - [ ] Update README
   - [ ] Ensure all paths are relative
   - [ ] Add requirements.txt if missing

## Phase 6: Final Checks
10. **Verification**
    - [ ] Verify all visualizations are clear
    - [ ] Ensure code is well-commented
    - [ ] Cross-validate findings
    - [ ] Commit and push final changes

## Environment setup

Install required packages:

In [3]:
# ===========================================
# SHAP Model Explainability - Environment Setup
# ===========================================

print("üîß Setting up environment for SHAP analysis...")

# 1. Install required packages
!pip install shap pandas numpy matplotlib scikit-learn

# 2. Import necessary libraries
import shap
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import joblib
import os
from pathlib import Path
from sklearn.metrics import confusion_matrix, classification_report

# 3. Set up paths
NOTEBOOK_DIR = Path.cwd()
ROOT_DIR = NOTEBOOK_DIR.parent
DATA_DIR = ROOT_DIR / "data" / "processed"
MODEL_DIR = ROOT_DIR / "models"
REPORTS_DIR = ROOT_DIR / "reports" / "figures"

# 4. Create directories if they don't exist
REPORTS_DIR.mkdir(parents=True, exist_ok=True)

print("‚úÖ Environment setup complete!")
print(f"Notebook directory: {NOTEBOOK_DIR}")
print(f"Data directory: {DATA_DIR}")
print(f"Model directory: {MODEL_DIR}")
print(f"Reports directory: {REPORTS_DIR}")

# 5. Set display options for better readability
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 1000)
plt.style.use('seaborn-v0_8')  # Updated style for newer matplotlib versions
%matplotlib inline

print("\n‚úÖ Environment is ready for SHAP analysis!")
print("Next step: Loading the model and data...")

üîß Setting up environment for SHAP analysis...
‚úÖ Environment setup complete!
Notebook directory: c:\Users\My Device\Desktop\Week-5_KAIM\fraud-detection\notebooks
Data directory: c:\Users\My Device\Desktop\Week-5_KAIM\fraud-detection\data\processed
Model directory: c:\Users\My Device\Desktop\Week-5_KAIM\fraud-detection\models
Reports directory: c:\Users\My Device\Desktop\Week-5_KAIM\fraud-detection\reports\figures

‚úÖ Environment is ready for SHAP analysis!
Next step: Loading the model and data...



[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


## Loading the model and the data

In [5]:
# ===========================================
# Check Available Model Files
# ===========================================

print("üîç Checking available model files...")

# List all files in the models directory
model_files = list(MODEL_DIR.glob("*"))
if model_files:
    print("üìã Available model files:")
    for i, file in enumerate(model_files, 1):
        print(f"   {i}. {file.name} (Size: {file.stat().st_size / (1024*1024):.2f} MB)")
else:
    print("‚ùå No files found in the models directory.")
    print(f"   Directory path: {MODEL_DIR}")

# Also check the parent directory in case the model is there
parent_dir_files = list((MODEL_DIR.parent).glob("*"))
print("\nüìã Files in parent directory:")
for i, file in enumerate(parent_dir_files, 1):
    print(f"   {i}. {file.name} (Size: {file.stat().st_size / (1024*1024):.2f} MB)")

print("\nPlease provide the correct model filename from the list above.")

üîç Checking available model files...
üìã Available model files:
   1. fraud_detection_xgboost_v1_20251227.pkl (Size: 0.00 MB)
   2. MODEL_CARD.md (Size: 0.00 MB)
   3. model_metadata_v1.json (Size: 0.00 MB)

üìã Files in parent directory:
   1. .gitignore (Size: 0.00 MB)
   2. data (Size: 0.00 MB)
   3. models (Size: 0.00 MB)
   4. notebooks (Size: 0.00 MB)
   5. README.md (Size: 0.00 MB)
   6. reports (Size: 0.00 MB)
   7. results (Size: 0.00 MB)
   8. scripts (Size: 0.00 MB)
   9. src (Size: 0.00 MB)
   10. tests (Size: 0.00 MB)

Please provide the correct model filename from the list above.


In [6]:
# ===========================================
# Loading Model and Data
# ===========================================

print("üìÇ Loading model and data...")

# 1. Load the trained model
try:
    model_path = MODEL_DIR / "fraud_detection_xgboost_v1_20251227.pkl"  # Updated filename
    model = joblib.load(model_path)
    print(f"‚úÖ Model loaded successfully from: {model_path}")
    print(f"   Model type: {type(model).__name__}")
    
    # Print model parameters for verification
    print("\nModel parameters:")
    print("-" * 40)
    for param, value in model.get_params().items():
        print(f"{param}: {value}")
    print("-" * 40)
    
except Exception as e:
    print(f"‚ùå Error loading model: {e}")
    print("\nüìã Available files in models directory:")
    print("\n".join([f"   - {f.name}" for f in MODEL_DIR.glob("*")]))
    raise

# 2. Load the test data
try:
    test_data_path = DATA_DIR / "test_data.csv"  # We'll verify this next
    test_data = pd.read_csv(test_data_path)
    
    # Separate features and target
    X_test = test_data.drop('Class', axis=1)
    y_test = test_data['Class']
    
    print(f"\n‚úÖ Test data loaded: {len(X_test)} samples")
    print(f"   Features: {X_test.shape[1]}")
    print("\nClass distribution:")
    print(y_test.value_counts().to_string())
    
    # Display first few rows of features
    print("\nFirst few rows of features:")
    display(X_test.head())
    
except Exception as e:
    print(f"\n‚ùå Error loading test data: {e}")
    print("\nüìã Available files in data directory:")
    print("\n".join([f"   - {f.name}" for f in DATA_DIR.glob("*")]))
    print("\nPlease provide the correct test data filename from the list above.")

üìÇ Loading model and data...
‚úÖ Model loaded successfully from: c:\Users\My Device\Desktop\Week-5_KAIM\fraud-detection\models\fraud_detection_xgboost_v1_20251227.pkl
   Model type: XGBClassifier

Model parameters:
----------------------------------------
objective: binary:logistic
base_score: None
booster: None
callbacks: None
colsample_bylevel: None
colsample_bynode: None
colsample_bytree: 0.8
device: None
early_stopping_rounds: None
enable_categorical: False
eval_metric: aucpr
feature_types: None
feature_weights: None
gamma: None
grow_policy: None
importance_type: None
interaction_constraints: None
learning_rate: 0.05
max_bin: None
max_cat_threshold: None
max_cat_to_onehot: None
max_delta_step: None
max_depth: 6
max_leaves: None
min_child_weight: None
missing: nan
monotone_constraints: None
multi_strategy: None
n_estimators: 1000
n_jobs: -1
num_parallel_tree: None
random_state: 42
reg_alpha: None
reg_lambda: None
sampling_method: None
scale_pos_weight: 577.2868020304569
subsample:

In [7]:
# ===========================================
# Loading Model and Data
# ===========================================

print("üìÇ Loading model and data...")

# 1. Load the trained model (already loaded successfully)
print("‚úÖ Model already loaded successfully")

# 2. Load the test data
try:
    # Load features and target separately
    X_test = pd.read_csv(DATA_DIR / "cc_X_test.csv")
    y_test = pd.read_csv(DATA_DIR / "cc_y_test.csv")
    
    # If y_test is a DataFrame with a single column, convert to Series
    if isinstance(y_test, pd.DataFrame) and len(y_test.columns) == 1:
        y_test = y_test.iloc[:, 0]
    
    print(f"\n‚úÖ Test data loaded successfully")
    print(f"   Features shape: {X_test.shape}")
    print(f"   Target shape: {y_test.shape if hasattr(y_test, 'shape') else len(y_test)}")
    
    # Display class distribution
    print("\nClass distribution in test set:")
    print(y_test.value_counts().to_string())
    
    # Display first few rows of features
    print("\nFirst few rows of features:")
    display(X_test.head())
    
    # Verify feature names match training
    print("\nFeature verification:")
    print(f"Number of features: {X_test.shape[1]}")
    print("First 5 feature names:", list(X_test.columns[:5]))
    
except Exception as e:
    print(f"\n‚ùå Error loading test data: {e}")
    print("\nüìã Available files in data directory:")
    print("\n".join([f"   - {f.name}" for f in DATA_DIR.glob("*")]))
    raise

print("\n‚úÖ Data loading completed successfully!")
print("\nNext step: Initializing SHAP explainer...")

üìÇ Loading model and data...
‚úÖ Model already loaded successfully

‚úÖ Test data loaded successfully
   Features shape: (56962, 30)
   Target shape: (56962,)

Class distribution in test set:
Class
0    56864
1       98

First few rows of features:


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
0,160760.0,-0.674466,1.408105,-1.110622,-1.328366,1.388996,-1.308439,1.885879,-0.614233,0.311652,0.650757,-0.857785,-0.229961,-0.199817,0.266371,-0.046544,-0.741398,-0.605617,-0.392568,-0.162648,0.394322,0.080084,0.810034,-0.224327,0.707899,-0.135837,0.045102,0.533837,0.291319,23.0
1,19847.0,-2.829816,-2.765149,2.537793,-1.07458,2.842559,-2.153536,-1.795519,-0.25002,3.073504,-1.000418,1.850842,-1.549779,1.252337,0.963974,-0.481027,-0.147319,-0.209328,1.058898,0.397057,-0.515765,-0.295555,0.109305,-0.813272,0.042996,-0.02766,-0.910247,0.110802,-0.511938,11.85
2,88326.0,-3.576495,2.318422,1.306985,3.263665,1.127818,2.865246,1.444125,-0.718922,1.874046,7.398491,2.081146,-0.064145,0.577556,-2.430201,1.505993,-1.237941,-0.390405,-1.231804,0.098738,2.034786,-1.060151,0.016867,-0.132058,-1.483996,-0.296011,0.062823,0.552411,0.509764,76.07
3,141734.0,2.060386,-0.015382,-1.082544,0.386019,-0.024331,-1.074935,0.207792,-0.33814,0.455091,0.047859,-0.652497,0.750829,0.665603,0.158608,0.027348,-0.171173,-0.291228,-1.008531,0.09704,-0.192024,-0.281684,-0.639426,0.331818,-0.067584,-0.283675,0.203529,-0.063621,-0.060077,0.99
4,38741.0,1.209965,1.384303,-1.343531,1.763636,0.662351,-2.113384,0.854039,-0.475963,-0.629658,-1.579654,1.462573,0.208823,0.734537,-3.538625,0.926076,0.835029,2.845937,1.040947,-1.045263,0.009083,-0.164015,-0.328294,-0.154631,0.619449,0.818998,-0.330525,0.046884,0.104527,1.5



Feature verification:
Number of features: 30
First 5 feature names: ['Time', 'V1', 'V2', 'V3', 'V4']

‚úÖ Data loading completed successfully!

Next step: Initializing SHAP explainer...


## Feature Importance Analysis

In [14]:
# ===========================================
# Load XGBoost Model with Booster
# ===========================================
import xgboost as xgb
import numpy as np

print("üîÑ Loading XGBoost model with Booster...")
try:
    model_path = MODEL_DIR / "fraud_detection_xgboost_v1_20251227.pkl"
    
    # Try loading with joblib first
    try:
        model = joblib.load(model_path)
        print("‚úÖ Model loaded with joblib")
    except:
        # If joblib fails, try loading with XGBoost's Booster
        model = xgb.Booster()
        model.load_model(str(model_path))
        print("‚úÖ Model loaded with XGBoost Booster")
    
    # Create a wrapper class for prediction
    class XGBoostWrapper:
        def __init__(self, model):
            self.model = model
            self._fitted = True  # To bypass scikit-learn's check
        
        def predict_proba(self, X):
            dmatrix = xgb.DMatrix(X)
            return self.model.predict(dmatrix)
            
        def predict(self, X):
            proba = self.predict_proba(X)
            return (proba > 0.5).astype(int)
    
    # Wrap the model
    model = XGBoostWrapper(model)
    
    # Test prediction
    try:
        # Convert X_test to numpy array if it's a DataFrame
        X_test_array = X_test.values if hasattr(X_test, 'values') else X_test
        _ = model.predict(X_test_array[:1])  # Test with one sample
        print("‚úÖ Model prediction test successful")
    except Exception as e:
        print(f"‚ö†Ô∏è Prediction test failed: {str(e)}")
        raise

except Exception as e:
    print(f"‚ùå Error loading model: {str(e)}")
    print("\nTroubleshooting steps:")
    print("1. Let's verify the model file exists and is not corrupted")
    print("2. Checking file size...")
    if model_path.exists():
        print(f"   - File exists. Size: {model_path.stat().st_size / (1024*1024):.2f} MB")
    else:
        print("   - File does not exist at the specified path")
    raise

print("\n‚úÖ Model loading completed. Ready for analysis!")

# Save the model wrapper for later use
model_wrapper = model
print("\nModel wrapper created. You can use 'model_wrapper' for predictions.")

üîÑ Loading XGBoost model with Booster...
‚úÖ Model loaded with joblib
‚ö†Ô∏è Prediction test failed: need to call fit or load_model beforehand
‚ùå Error loading model: need to call fit or load_model beforehand

Troubleshooting steps:
1. Let's verify the model file exists and is not corrupted
2. Checking file size...
   - File exists. Size: 0.00 MB


NotFittedError: need to call fit or load_model beforehand

In [10]:
# ===========================================
# Feature Importance Analysis - Alternative Approach
# ===========================================
print("üìä Analyzing Feature Importance using Booster...")
try:
    # Try to get the booster object
    if hasattr(model, 'get_booster'):
        booster = model.get_booster()
        # Get feature importance as a dictionary
        importance_dict = booster.get_score(importance_type='gain')
        
        # Create a DataFrame for visualization
        importance_df = pd.DataFrame({
            'Feature': list(importance_dict.keys()),
            'Importance': list(importance_dict.values())
        }).sort_values('Importance', ascending=False)
        
        # Display top 20 features
        print("\nTop 20 Most Important Features (Gain):")
        display(importance_df.head(20))
        
        # Plot feature importance
        plt.figure(figsize=(12, 8))
        sns.barplot(
            x='Importance', 
            y='Feature', 
            data=importance_df.head(20),
            palette='viridis'
        )
        plt.title('Top 20 Most Important Features (Gain)', fontsize=14)
        plt.xlabel('Importance (Gain)', fontsize=12)
        plt.ylabel('Feature', fontsize=12)
        plt.tight_layout()
        
        # Save the plot
        importance_plot_path = REPORTS_DIR / "feature_importance_gain.png"
        plt.savefig(importance_plot_path, dpi=300, bbox_inches='tight')
        plt.close()
        print(f"‚úÖ Feature importance (Gain) plot saved to: {importance_plot_path}")
        
        # Save feature importances to CSV
        importance_csv_path = REPORTS_DIR / "feature_importances_gain.csv"
        importance_df.to_csv(importance_csv_path, index=False)
        print(f"‚úÖ Feature importances (Gain) saved to: {importance_csv_path}")
        
    else:
        print("‚ùå Could not access booster object. Trying alternative method...")
        raise AttributeError("Booster object not accessible")
    
except Exception as e:
    print(f"‚ö†Ô∏è Could not generate feature importance using booster: {str(e)}")
    print("Proceeding with SHAP values for feature importance...")
    use_shap_for_importance = True

print("\n‚úÖ Feature importance analysis completed!")
print("\nNext step: Proceeding with SHAP analysis...")

üìä Analyzing Feature Importance using Booster...
‚ö†Ô∏è Could not generate feature importance using booster: need to call fit or load_model beforehand
Proceeding with SHAP values for feature importance...

‚úÖ Feature importance analysis completed!

Next step: Proceeding with SHAP analysis...
