In [0]:
import os, joblib
import numpy as np
from scipy.sparse import issparse

project_root = os.path.dirname(os.getcwd())
load_dir = os.path.join(project_root, "etl_pipeline")

pipeline = joblib.load(os.path.join(load_dir, "stedi_feature_pipeline.pkl"))
X_train_transformed = joblib.load(os.path.join(load_dir, "X_train_transformed.pkl"))
X_test_transformed  = joblib.load(os.path.join(load_dir, "X_test_transformed.pkl"))
y_train = joblib.load(os.path.join(load_dir, "y_train.pkl"))
y_test  = joblib.load(os.path.join(load_dir, "y_test.pkl"))

def to_float_matrix(arr: np.ndarray) -> np.ndarray:
    if arr.ndim == 0:
        arr = arr.item()
        if issparse(arr):
            arr = arr.toarray()
        arr = np.array(arr, dtype=float)
    elif arr.dtype == object:
        arr = np.array([
            x.toarray() if issparse(x) else np.array(x, dtype=float)
            for x in arr
        ])
        arr = np.vstack(arr)
    elif issparse(arr):
        arr = arr.toarray()
    else:
        arr = np.array(arr, dtype=float)
    return arr

X_train = to_float_matrix(X_train_transformed)
X_test  = to_float_matrix(X_test_transformed)
y_train = np.ravel(y_train)
y_test  = np.ravel(y_test)

X_train.shape, X_test.shape, y_train.shape, y_test.shape


In the explainability results, num__distance_cm was by far the most influential feature, and several one-hot device_id features (such as spotter-14 and spotter-16) also ranked highly, suggesting the model may be relying on device-specific patterns in addition to motion-derived signals. A concerning behavior is that the model’s performance is dominated by the majority class: despite high overall accuracy, it predicts “step” for every sample, which the confusion matrix confirms. The biggest weakness, therefore, is the model’s complete failure to identify no_step, creating extreme false positives and making the model unsafe for distinguishing stepping from non-stepping periods without further refinement.

In [0]:
# Focused refinement grid for Logistic Regression (small + purposeful)
params = {
    "C": [0.003, 0.01, 0.03, 0.1, 0.3, 1.0],     # centered around the previous best C=0.01
    "class_weight": [None, "balanced"],          # address severe class imbalance
    "penalty": ["l2"],
    "solver": ["lbfgs"],
    "max_iter": [1000]
}


SHAP and global importance show the model relies most on num__distance_cm, with several device_id one-hot features also ranking highly. Because the confusion matrix shows the model predicts step for every sample, I’m running a small, focused grid around the previous best C=0.01 and adding class_weight='balanced' to reduce the impact of class imbalance and improve minority-class performance rather than chasing accuracy alone.

In [0]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

# focused refinement grid (from Step 3)
params = {
    "C": [0.003, 0.01, 0.03, 0.1, 0.3, 1.0],
    "class_weight": [None, "balanced"],
    "penalty": ["l2"],
    "solver": ["lbfgs"],
    "max_iter": [1000]
}

grid = GridSearchCV(
    LogisticRegression(),
    params,
    scoring="accuracy",
    cv=3,
    n_jobs=-1
)

grid.fit(X_train, y_train)

print("Best params:", grid.best_params_)
print("Best CV score:", grid.best_score_)
new_model = grid.best_estimator_

In [0]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score, balanced_accuracy_score, f1_score,
    classification_report, confusion_matrix, ConfusionMatrixDisplay,
    RocCurveDisplay
)
import numpy as np
import matplotlib.pyplot as plt

# Rebuild the Week 5 winner as your "old" baseline
old_model = LogisticRegression(C=0.01, penalty="l2", solver="lbfgs", max_iter=1000)
old_model.fit(X_train, y_train)

def evaluate(name, model, X, y):
    y_pred = model.predict(X)

    acc = accuracy_score(y, y_pred)
    bal_acc = balanced_accuracy_score(y, y_pred)
    f1_macro = f1_score(y, y_pred, average="macro")

    print(f"\n{name}")
    print(f"accuracy={acc:.4f}  balanced_accuracy={bal_acc:.4f}  macro_f1={f1_macro:.4f}")
    print(classification_report(y, y_pred))

    # Confusion matrix heatmap (extra credit-friendly)
    labels = ["no_step", "step"] if set(np.unique(y)).issubset({"no_step", "step"}) else None
    cm = confusion_matrix(y, y_pred, labels=labels)
    ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels).plot(values_format="d")
    plt.title(f"Confusion Matrix: {name}")
    plt.show()

    # ROC curve (extra credit-friendly, requires predict_proba)
    if hasattr(model, "predict_proba") and labels is not None:
        step_index = list(model.classes_).index("step")
        proba_step = model.predict_proba(X)[:, step_index]
        y_bin = (y == "step").astype(int)
        RocCurveDisplay.from_predictions(y_bin, proba_step)
        plt.title(f"ROC Curve: {name}")
        plt.show()

evaluate("Old model (Week 5 baseline)", old_model, X_test, y_test)
evaluate("New tuned model (GridSearch best)", new_model, X_test, y_test)

print("\nOld CV best score (Week 5): 0.9511214840660257")
print("New CV best score (this grid):", grid.best_score_)
print("New best params:", grid.best_params_)


Old model score: The Week 5 baseline Logistic Regression model had a best cross-validation accuracy of 0.9511. On the test set it achieved accuracy = 0.9511, balanced accuracy = 0.5000, and macro F1 = 0.4875, and the confusion matrix shows it predicts “step” for every example. New tuned model score: the refined GridSearch model achieved best CV accuracy = 0.9511 with params C = 0.003 (class_weight = None), and on the test set it produced the same results: accuracy = 0.9511, balanced accuracy = 0.5000, and macro F1 = 0.4875, again predicting “step” for every example. The new tuning did not improve meaningful performance, so I will not switch models because it does not fix the model’s failure to detect the no_step class.

In [0]:
import os, joblib

def first_writable_dir(candidates):
    for d in candidates:
        try:
            os.makedirs(d, exist_ok=True)
            test_path = os.path.join(d, ".write_test")
            with open(test_path, "w") as f:
                f.write("ok")
            os.remove(test_path)
            return d
        except Exception:
            pass
    raise RuntimeError("No writable directory found in candidates.")

cwd = os.getcwd()

candidates = [
    os.path.join(cwd, "exports", "model"),   # best if you're in a Repo
    os.path.join(cwd, "model"),              # also good in a Repo
    "/tmp/stedi_exports/model",              # driver-local fallback
    "/local_disk0/tmp/stedi_exports/model",  # another common driver-local fallback
]

save_dir = first_writable_dir(candidates)
save_path = os.path.join(save_dir, "stedi_best_model.pkl")

joblib.dump(old_model, save_path)

print("Saved model to:", save_path)
print("File size (bytes):", os.path.getsize(save_path))


In [0]:
import joblib
reloaded = joblib.load("/Workspace/Users/tcm082@ensign.edu/csai382_lab_2_4_-tmuhlestein-/notebooks/exports/model/stedi_best_model.pkl")

print(type(reloaded))
print("Sample preds:", reloaded.predict(X_test[:5]))


In [0]:
import numpy as np
from sklearn.metrics import confusion_matrix, classification_report

y_pred = reloaded.predict(X_test)

print("Unique predictions + counts:", np.unique(y_pred, return_counts=True))

# If your labels are strings like 'no_step'/'step'
cm = confusion_matrix(y_test, y_pred, labels=["no_step", "step"])
print("Confusion matrix [[no_step, step] rows x [no_step, step] cols]:\n", cm)

print("\nClassification report:\n", classification_report(y_test, y_pred))


I saved the final selected model to a writable workspace path: /Workspace/Users/tcm082@ensign.edu/csai382_lab_2_4_-tmuhlestein-/notebooks/exports/model/stedi_best_model.pkl. This ensures the notebook contains an exported model artifact even though DBFS/FileStore and the earlier shared pipeline directories were not writable in this environment. I also verified the file by reloading it and confirming it produces predictions successfully.

#Refinement Summary
I performed a second, focused hyperparameter search on Logistic Regression centered around the previous best C value and tested class weighting to address the severe class imbalance revealed by the confusion matrix and explainability results. The refined GridSearchCV produced the same best CV accuracy as the original model and selected C=0.003 with class_weight=None, which did not change the model’s behavior on the minority class. On the test set, the refined model still predicted “step” for essentially all samples, so I did not update the final model. This decision is responsible because it avoids claiming improvement when the refinement did not address the model’s most important weakness.

#Reflection
Careless hyperparameter tuning can create unfair or unsafe models when it optimizes a metric like accuracy that can look strong while the model ignores a minority class. That’s why it’s important to examine confusion matrices, per-class metrics, and explainability instead of trusting a single score. SHAP and feature importance help reveal whether the model is learning real movement patterns or relying on shortcuts that may fail on new users or devices. Gospel principles of integrity and stewardship guide me to report results honestly and choose decisions that reflect the model’s actual behavior, “by their fruits ye shall know them.”