# Scikit-learn Baseline Comparison

## Purpose
This notebook justifies the use of PySpark (distributed computing) over
scikit-learn (single-node) for the PAMAP2 activity recognition task.

We train a **scikit-learn LogisticRegression** on the same feature set and
compare:
- **Accuracy** and **weighted F1** against the Spark Logistic Regression
- **Training time** to highlight scalability differences

The Spark model is NOT retrained here; we read the saved results from
`data/model_results.json` produced by the model training pipeline.

In [1]:
import json
import time
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, f1_score
from pyspark.sql import SparkSession
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import col, isnan, when

print("Imports ready.")

Imports ready.


## 1. Load Feature Data

We use a temporary Spark session to read the Parquet feature file,
then convert to pandas for scikit-learn training. This mirrors the
exact same data split used in the Spark training pipeline (seed=42,
80/20 split) to ensure a fair comparison.

In [2]:
# -- Start a lightweight Spark session just to read Parquet ----
spark = (
    SparkSession.builder
    .appName("sklearn_baseline")
    .master("local[2]")
    .config("spark.driver.memory", "2g")
    .config("spark.ui.enabled", "false")
    .getOrCreate()
)
spark.sparkContext.setLogLevel("ERROR")

INPUT_PATH = r"C:/Users/johnu/Desktop/BigDataProject/data/pamap2_features.parquet"
df = spark.read.parquet(INPUT_PATH)

# -- Identify feature columns (same logic as training script) --
META = {"subject_id", "activity_id"}
feature_cols = sorted([
    c for c in df.columns
    if c not in META
    and isinstance(df.schema[c].dataType, DoubleType)
])

# -- Replace NaN with 0 (same as training pipeline) -----------
for c in feature_cols:
    df = df.withColumn(c, when(isnan(col(c)), 0.0).otherwise(col(c)))
df_clean = df.na.drop(subset=feature_cols)

# -- Same train/test split as Spark pipeline ------------------
train_df, test_df = df_clean.randomSplit([0.8, 0.2], seed=42)

print(f"Total rows  : {df_clean.count():,}")
print(f"Train rows  : {train_df.count():,}")
print(f"Test rows   : {test_df.count():,}")
print(f"Features    : {len(feature_cols)}")

Total rows  : 5,447


Train rows  : 4,416


Test rows   : 1,031
Features    : 172


In [3]:
# -- Convert to pandas for scikit-learn -----------------------
# Select only the columns we need to minimize memory usage
columns_needed = feature_cols + ["activity_id"]

train_pd = train_df.select(columns_needed).toPandas()
test_pd  = test_df.select(columns_needed).toPandas()

X_train = train_pd[feature_cols].values
y_train = train_pd["activity_id"].values
X_test  = test_pd[feature_cols].values
y_test  = test_pd["activity_id"].values

print(f"X_train shape: {X_train.shape}")
print(f"X_test  shape: {X_test.shape}")
print(f"Classes      : {np.unique(y_train).shape[0]}")

# Stop Spark -- we no longer need it
spark.stop()
print("Spark session stopped (scikit-learn uses single-node only).")

X_train shape: (4416, 172)
X_test  shape: (1031, 172)
Classes      : 18


Spark session stopped (scikit-learn uses single-node only).


## 2. Train Scikit-learn Logistic Regression

We apply **StandardScaler** (zero-mean, unit-variance) followed by
multinomial Logistic Regression with L2 regularisation — the same
configuration used in the Spark pipeline.

In [4]:
# -- Scale features (same as Spark StandardScaler) ------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)

# -- Train LogisticRegression ---------------------------------
# multinomial + L2 penalty mirrors the Spark LR configuration
# We test the same regParam values used in Spark's grid search
best_acc = 0
best_model = None
best_C = None

for C_val in [1/0.01, 1/0.1]:  # sklearn C = 1/regParam
    t0 = time.time()
    lr = LogisticRegression(
        multi_class="multinomial",
        solver="lbfgs",
        penalty="l2",
        C=C_val,
        max_iter=100,
        random_state=42,
    )
    lr.fit(X_train_scaled, y_train)
    train_time = time.time() - t0
    
    acc = accuracy_score(y_test, lr.predict(X_test_scaled))
    print(f"  C={C_val:>6.1f}  acc={acc:.4f}  train_time={train_time:.2f}s")
    
    if acc > best_acc:
        best_acc = acc
        best_model = lr
        best_C = C_val
        best_train_time = train_time

print(f"\nBest C = {best_C:.1f}")

  C= 100.0  acc=0.8933  train_time=0.19s

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



  C=  10.0  acc=0.9001  train_time=0.17s

Best C = 10.0


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [5]:
# -- Evaluate the best sklearn model --------------------------
y_pred = best_model.predict(X_test_scaled)

sklearn_acc = accuracy_score(y_test, y_pred)
sklearn_f1  = f1_score(y_test, y_pred, average="weighted")

print(f"Scikit-learn Logistic Regression")
print(f"  Accuracy    : {sklearn_acc:.4f}")
print(f"  Weighted F1 : {sklearn_f1:.4f}")
print(f"  Train time  : {best_train_time:.2f}s")

Scikit-learn Logistic Regression


  Accuracy    : 0.9001
  Weighted F1 : 0.9006
  Train time  : 0.17s


## 3. Comparison: Spark vs Scikit-learn

We load the saved Spark results and build a side-by-side comparison
table. The Spark training time (53.2s) includes 3-fold cross-validation
over 2 hyperparameter values (6 model fits total), while scikit-learn
trains a single model per C value.

In [6]:
# -- Load Spark results (no retraining) -----------------------
RESULTS_PATH = r"C:/Users/johnu/Desktop/BigDataProject/data/model_results.json"
with open(RESULTS_PATH) as f:
    spark_results = json.load(f)

# Find Spark LR result
spark_lr = next(r for r in spark_results if r["model"] == "Logistic Regression")

# -- Build comparison table -----------------------------------
comparison = pd.DataFrame([
    {
        "Framework": "PySpark MLlib",
        "Model": "Logistic Regression",
        "Accuracy": spark_lr["accuracy"],
        "Weighted_F1": spark_lr["f1_weighted"],
        "Train_Time_s": 53.2,  # from training script output
        "CV_Folds": 3,
        "Grid_Points": 2,
        "Distributed": "Yes (local[4])",
    },
    {
        "Framework": "scikit-learn",
        "Model": "Logistic Regression",
        "Accuracy": round(sklearn_acc, 4),
        "Weighted_F1": round(sklearn_f1, 4),
        "Train_Time_s": round(best_train_time, 2),
        "CV_Folds": "N/A (single fit)",
        "Grid_Points": 2,
        "Distributed": "No (single core)",
    },
])

print("=" * 72)
print("  SPARK vs SCIKIT-LEARN COMPARISON")
print("=" * 72)
print(comparison.to_string(index=False))

  SPARK vs SCIKIT-LEARN COMPARISON
    Framework               Model  Accuracy  Weighted_F1  Train_Time_s         CV_Folds  Grid_Points      Distributed
PySpark MLlib Logistic Regression    0.8792       0.8749         53.20                3            2   Yes (local[4])
 scikit-learn Logistic Regression    0.9001       0.9006          0.17 N/A (single fit)            2 No (single core)


In [7]:
# -- Save comparison for Tableau export -----------------------
import os
os.makedirs(r"C:/Users/johnu/Desktop/BigDataProject/results", exist_ok=True)

comparison.to_csv(
    r"C:/Users/johnu/Desktop/BigDataProject/results/sklearn_comparison.csv",
    index=False,
)
print("Saved to results/sklearn_comparison.csv")

Saved to results/sklearn_comparison.csv


## 4. Discussion: Why PySpark?

### Observations

On this **5,447-row feature dataset**, scikit-learn trains faster because
the data fits comfortably in memory and Spark's JVM startup, task
serialisation, and shuffle overhead dominate at small scale.

However, the raw PAMAP2 dataset is **1.6 GB / 3.85 million rows**.
The feature extraction step (sliding windows) is itself a distributed
operation that benefits from Spark's parallelism. At production scale:

| Factor | scikit-learn | PySpark |
|--------|-------------|--------|
| **Data size limit** | RAM-bound (~8-16 GB) | Scales to TB+ across cluster |
| **Feature engineering** | Manual, single-threaded | Distributed window aggregation |
| **Cross-validation** | Sequential | Parallelisable across executors |
| **Pipeline integration** | Separate steps | Unified Pipeline with stages |
| **Model serving** | Pickle/joblib | MLlib model save/load |

### Conclusion

For the **extracted features** (5K rows), scikit-learn is viable and
may be faster. But the **end-to-end pipeline** — from 1.6 GB raw
sensor data through windowed feature extraction to cross-validated
model selection — requires distributed processing that scikit-learn
cannot provide. PySpark is justified for the full pipeline scope,
and the scalability benchmarks (notebook 05) confirm meaningful
speedup with additional cores.