Step 1: Install and Import

In [2]:
%pip install shap

Collecting shapNote: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip



  Downloading shap-0.49.1-cp312-cp312-win_amd64.whl.metadata (25 kB)
Collecting scipy (from shap)
  Downloading scipy-1.16.2-cp312-cp312-win_amd64.whl.metadata (60 kB)
Collecting scikit-learn (from shap)
  Downloading scikit_learn-1.7.2-cp312-cp312-win_amd64.whl.metadata (11 kB)
Collecting pandas (from shap)
  Downloading pandas-2.3.3-cp312-cp312-win_amd64.whl.metadata (19 kB)
Collecting slicer==0.0.8 (from shap)
  Using cached slicer-0.0.8-py3-none-any.whl.metadata (4.0 kB)
Collecting numba>=0.54 (from shap)
  Downloading numba-0.62.1-cp312-cp312-win_amd64.whl.metadata (2.9 kB)
Collecting cloudpickle (from shap)
  Using cached cloudpickle-3.1.1-py3-none-any.whl.metadata (7.1 kB)
Collecting llvmlite<0.46,>=0.45.0dev0 (from numba>=0.54->shap)
  Downloading llvmlite-0.45.1-cp312-cp312-win_amd64.whl.metadata (5.0 kB)
Collecting pytz>=2020.1 (from pandas->shap)
  Downloading pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas->shap)
  Downloading tzdat

Step 2: Load Model and Training Data

In [None]:
import shap
import pandas as pd
import joblib
from sklearn.model_selection import train_test_split

# --- Load Your Model ---
model = joblib.load('random_forest_model.pkl')

# --- Load Your Data ---
# We need the training data (X) to create the explainer
# Load the *balanced* dataset you used for training
df = pd.read_csv('balanced_dataset.csv')

# --- Recreate your Train/Test Split ---
# This is to get an 'X_train' that SHAP can use as a reference
# Make sure to use the same features your model was trained on

# Define features (X) and target (y)
# Adjust these columns based on your final model
X = df.drop(columns=['Substances_Used', 'substances_used_label'])
y = df['Substances_Used']

# Use the same random_state!
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print("Model, SHAP, and Data are ready.")

Step 3: Create the SHAP Explainer

In [None]:
# 1. Initialize JavaScript visualization in the notebook
shap.initjs()

# 2. Create the explainer object
# We pass the model and the training data
explainer = shap.TreeExplainer(model, X_train)

# 3. Calculate SHAP values for your *test* data
# This can take a moment
shap_values = explainer.shap_values(X_test)

print("SHAP values calculated.")

Step 4: Global Interpretability (Which features matter most overall?)

In [None]:
import matplotlib.pyplot as plt

# Assuming class '1' (Yes) is the one we care about
# Check shap_values.shape to confirm. If you have multiple classes,
# shap_values will be a list of arrays or a 3D array. We'll use index 1 for class 1.

# Create a summary plot (bar plot for feature importance)
print("### Global Feature Importance")
print("Which features have the most impact on the prediction?")
shap.summary_plot(shap_values[1], X_test, plot_type="bar", show=True)

Step 5: Local Interpretability (Why did this one person get this score?)

In [None]:
# Let's explain the prediction for the *first person* in the test set
row_index = 0
X_sample = X_test.iloc[[row_index]]

# Get the SHAP values for this single sample
# We're interested in class 1 ("Yes")
shap_values_sample = explainer.shap_values(X_sample)[1]

# Get the model's base value (the average prediction)
base_value = explainer.expected_value[1]

print("---")
print(f"### Explaining Prediction for a Single User")

# Create a waterfall plot
shap.waterfall_plot(shap.Explanation(
    values=shap_values_sample[0],
    base_values=base_value,
    data=X_sample.iloc[0],
    feature_names=X_test.columns.tolist()
), show=True)