In [0]:
import os
import joblib
import numpy as np
import pandas as pd
from scipy.sparse import issparse

# Your CWD is .../notebooks, so project root is one level up
project_root = os.path.dirname(os.getcwd())
load_dir = os.path.join(project_root, "etl_pipeline")

pipeline = joblib.load(os.path.join(load_dir, "stedi_feature_pipeline.pkl"))
model = joblib.load(os.path.join(load_dir, "stedi_best_model.pkl"))

X_train_transformed = joblib.load(os.path.join(load_dir, "X_train_transformed.pkl"))
X_test_transformed  = joblib.load(os.path.join(load_dir, "X_test_transformed.pkl"))
y_train = joblib.load(os.path.join(load_dir, "y_train.pkl"))
y_test  = joblib.load(os.path.join(load_dir, "y_test.pkl"))

def to_float_matrix(arr: np.ndarray) -> np.ndarray:
    if arr.ndim == 0:
        arr = arr.item()
        if issparse(arr):
            arr = arr.toarray()
        arr = np.array(arr, dtype=float)
    elif arr.dtype == object:
        arr = np.array([
            x.toarray() if issparse(x) else np.array(x, dtype=float)
            for x in arr
        ])
        arr = np.vstack(arr)
    elif issparse(arr):
        arr = arr.toarray()
    else:
        arr = np.array(arr, dtype=float)
    return arr

X_train = to_float_matrix(X_train_transformed)
X_test  = to_float_matrix(X_test_transformed)
y_train = np.ravel(y_train)
y_test  = np.ravel(y_test)

X_train.shape, X_test.shape, y_train.shape, y_test.shape


In [0]:
try:
    feature_names = pipeline.named_steps["preprocess"].get_feature_names_out()
except Exception:
    feature_names = np.array([f"feature_{i}" for i in range(X_train.shape[1])])

len(feature_names), feature_names[:10]


In [0]:
import numpy as np

if hasattr(model, "feature_importances_"):
    importances = model.feature_importances_
    importance_label = "Feature importance (Random Forest)"
elif hasattr(model, "coef_"):
    importances = np.abs(model.coef_[0])
    importance_label = "Absolute coefficient magnitude (Logistic Regression)"
else:
    raise ValueError("Model has neither feature_importances_ nor coef_.")

importance_order = np.argsort(importances)[::-1]

for idx in importance_order[:10]:
    print(feature_names[idx], ":", importances[idx])

#Does this importance pattern make sense?
The importance pattern mostly makes sense in one respect: distance_cm being the top numeric feature is reasonable, since steps should correlate with changes in movement/distance. What’s more surprising is that many of the next most “important” features are device_id one-hot categories (spotter-14, spotter-16, spotter-1, etc.). That suggests the model may be learning device-specific quirks (sensor calibration, placement, user behavior, collection conditions) instead of purely learning step vs. no_step behavior.

I wouldn’t fully trust predictions based on this pattern without more checks, because reliance on device_id can act like a shortcut and can reduce generalization to new devices or users. It also raises a fairness concern: if some devices are associated with different environments or populations, the model could perform well for devices it has seen and poorly for others, while overall accuracy still looks high.

In [0]:
%restart_python

In [0]:
import matplotlib.pyplot as plt

top_n = 10
top_idx = importance_order[:top_n]

plt.figure(figsize=(10, 5))
plt.barh([feature_names[i] for i in top_idx], importances[top_idx])
plt.xlabel(importance_label)
plt.title("Top Global Feature Importance")
plt.gca().invert_yaxis()
plt.show()

In [0]:
%pip install numpy==2.3

In [0]:
%restart_python

In [0]:
# If import shap fails, run these two lines in a separate cell:
%pip install shap
# dbutils.library.restartPython()

import shap
shap.initjs()

# Tree models vs linear models need different explainers
if hasattr(model, "feature_importances_"):
    explainer = shap.TreeExplainer(model)
    shap_values = explainer.shap_values(X_test)
else:
    explainer = shap.LinearExplainer(model, X_train)
    shap_values = explainer.shap_values(X_test)


In [0]:
# For binary classification, shap_values usually gives per-class explanations for tree models,
# and a 2-D array for linear models.
if isinstance(shap_values, list):
    shap.summary_plot(shap_values[1], X_test, feature_names=feature_names, rng=42)
else:
    shap.summary_plot(shap_values, X_test, feature_names=feature_names, rng=42)

#SHAP summary plot
The SHAP summary plot matches the global importance results: num__distance_cm is clearly the strongest driver of predictions, and many device_id one-hot features also have noticeable influence. In general, larger values of distance_cm tend to push the model more strongly in one direction, which fits the idea that steps involve larger movement changes.

It’s concerning that device_id features appear so prominent, because that suggests the model may be learning device-specific patterns rather than a general “step vs. no_step” concept. This also fits what we observed earlier: the model behaves like it defaults to predicting step, so these SHAP effects may be explaining a majority-class shortcut rather than balanced reasoning across both classes.

In [0]:
i = 0  # pick any row index you want

if isinstance(shap_values, list):
    shap.force_plot(
        explainer.expected_value[1],
        shap_values[1][i],
        X_test[i],
        feature_names=feature_names
    )
else:
    shap.force_plot(
        explainer.expected_value,
        shap_values[i],
        X_test[i],
        feature_names=feature_names
    )

#Reflection
Globally, the model is most influenced by num_distance_cm, followed by several device_id indicators and sensor_type. Distance makes intuitive sense because steps should correlate with movement, but the heavy reliance on device identity suggests the model may be learning device-specific quirks rather than purely learning step behavior.

Locally, the SHAP force plot shows a small number of features doing most of the work for a single prediction, with distance and device-related indicators pushing the output in the same direction as the overall model. That makes the decision explainable, but it also reinforces the concern that the model may be defaulting to majority-class behavior rather than carefully separating step from no_step.

A human would expect motion features to matter most, so the distance result matches intuition, but the prominence of device_id is a warning sign for generalization and fairness. For the dashboard, I plan to include the global feature importance bar chart, the SHAP summary plot, and one SHAP force plot example to show both overall behavior and a single-row explanation.

In [0]:
import os, joblib

print("cwd:", os.getcwd())

project_root = os.path.dirname(os.getcwd())
load_dir = os.path.join(project_root, "etl_pipeline")

print("load_dir:", load_dir)
print("etl_pipeline contents:", os.listdir(load_dir))

model_path = os.path.join(load_dir, "stedi_best_model.pkl")
print("model_path:", model_path, "exists:", os.path.exists(model_path))

model = joblib.load(model_path)
type(model)


In [0]:
from sklearn.metrics import classification_report, confusion_matrix

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
