## Data Mining and Visualization

### *Project 2 - Activity Detection using Embedded Machine Learning*

### By Shaheer Khan

------------------------------------------------------------------------------------------

The task is to build a Machine learning Model that can accurately turn dozens of raw “wiggles-in-time” (accelerometer + gyroscope CSV files) into a small program that can look at those wiggles and say “this person is walking / running / sitting /”


This dataset has been collected to be used in Edge Impulse documentation. This dataset has been used to create an Embedded Machine Learning - Activity Detection project using different steps.

------------------------------------------------------------------------------------------


### 1 - *Load and Peek* 
----------------------------------------------------------------------------------

In [58]:
from pathlib import Path
import pandas as pd
import numpy as np
import json
from math import sqrt
from collections import Counter
from sklearn.model_selection import GroupKFold
import scipy.stats as st
import itertools  

In [3]:
DATA_DIR = Path("/Users/shaheer/documents/semester 5/dmv/project 2/archive")     

if not DATA_DIR.exists():
    raise FileNotFoundError(f"Folder {DATA_DIR} not found. "
                            "Point DATA_DIR to your unzipped archive.")

def inspect_dataset(root: Path):
    rows = []
    for rec_folder in sorted(root.iterdir()):
        if not rec_folder.is_dir():
            continue
        acc = list(rec_folder.glob("accelerometer.csv"))
        gyr = list(rec_folder.glob("gyroscope.csv"))
        rows.append({
            "recording": rec_folder.name,
            "acc_files": len(acc),
            "gyro_files": len(gyr)
        })
    return pd.DataFrame(rows)

df_overview = inspect_dataset(DATA_DIR)
display(df_overview.head())      # shows first few rows
print("\nSummary:\n", df_overview.describe(include='all'))
missing = df_overview[(df_overview.acc_files != 1) | (df_overview.gyro_files != 1)]
if not missing.empty:
    print("\nWARNING: some recordings are incomplete:\n", missing)
else:
    print("\nAll recordings have exactly one acc + one gyro file.")

Unnamed: 0,recording,acc_files,gyro_files
0,Cycling-2023-09-14_06-22-31,1,1
1,Cycling-2023-09-14_06-33-47,1,1
2,Cycling-2023-09-14_06-47-00,1,1
3,Cycling-2023-09-16_07-43-07,1,1
4,Cycling-2023-09-16_09-25-09,1,1



Summary:
                           recording  acc_files  gyro_files
count                            12       12.0        12.0
unique                           12        NaN         NaN
top     Cycling-2023-09-14_06-22-31        NaN         NaN
freq                              1        NaN         NaN
mean                            NaN        1.0         1.0
std                             NaN        0.0         0.0
min                             NaN        1.0         1.0
25%                             NaN        1.0         1.0
50%                             NaN        1.0         1.0
75%                             NaN        1.0         1.0
max                             NaN        1.0         1.0

All recordings have exactly one acc + one gyro file.


###
### 2 - *Read & Fuse each pair* 
----------------------------------------------------------------------------------

In [6]:
# STEP 1 - Fuse accelerometer + gyroscope CSVs → one 6-axis file per recording
 
RAW_DIR   = Path("/Users/shaheer/documents/semester 5/dmv/project 2/archive")         
FUSED_DIR = Path("/Users/shaheer/documents/semester 5/dmv/project 2/fused")
FUSED_DIR.mkdir(exist_ok=True)

def load_and_standardise(path, sensor_type):
    """
    Reads a CSV and returns a dataframe with:
    ['timestamp', 'ax', 'ay', 'az']  or  ['timestamp', 'gx', 'gy', 'gz']
    """
    df = pd.read_csv(path)
    
    # Detect likely timestamp column
    ts_col = [c for c in df.columns if 'time' in c.lower()][0]
    df = df.rename(columns={ts_col: 'timestamp'})
    
    # Rename axis columns to ax/ay/az or gx/gy/gz
    axis_cols = [c for c in df.columns if c != 'timestamp'][:3]
    new_names = [f"{sensor_type[0]}{axis}" for axis in ('x', 'y', 'z')]
    df = df.rename(columns=dict(zip(axis_cols, new_names)))
    
    # Keep only needed columns
    return df[['timestamp'] + new_names]

for folder in sorted(RAW_DIR.iterdir()):
    if not folder.is_dir():
        continue
    
    acc_file  = next(folder.glob("accelerometer.csv"),  None)
    gyro_file = next(folder.glob("gyroscope.csv"), None)
    if not acc_file or not gyro_file:
        print(f"Skipping {folder.name}: missing sensor file.")
        continue
    
    # 1. Load & clean
    df_acc  = load_and_standardise(acc_file,  'acc')
    df_gyro = load_and_standardise(gyro_file, 'gyro')
    
    # 2. Normalise timestamps (start at 0 ms)
    t0 = min(df_acc.timestamp.min(), df_gyro.timestamp.min())
    df_acc['timestamp']  -= t0
    df_gyro['timestamp'] -= t0
    
    # 3. Merge on timestamp (outer join)
    fused = pd.merge(df_acc, df_gyro, on='timestamp', how='outer').sort_values('timestamp')
    
    # 4. Save
    out_path = FUSED_DIR / f"{folder.name}_fused.csv"
    fused.to_csv(out_path, index=False)
    
    # 5. Quick log
    print(f"{folder.name}: {len(fused):5d} samples: {out_path.name}")

print("\nAll done — fused files live in ./fused/")

Cycling-2023-09-14_06-22-31: 71848 samples: Cycling-2023-09-14_06-22-31_fused.csv
Cycling-2023-09-14_06-33-47: 170718 samples: Cycling-2023-09-14_06-33-47_fused.csv
Cycling-2023-09-14_06-47-00: 144458 samples: Cycling-2023-09-14_06-47-00_fused.csv
Cycling-2023-09-16_07-43-07: 408516 samples: Cycling-2023-09-16_07-43-07_fused.csv
Cycling-2023-09-16_09-25-09: 520472 samples: Cycling-2023-09-16_09-25-09_fused.csv
Cycling-2023-10-18_06-36-17: 212000 samples: Cycling-2023-10-18_06-36-17_fused.csv
Cycling-2023-10-18_06-51-26: 135721 samples: Cycling-2023-10-18_06-51-26_fused.csv
Sitting-2023-09-14_08-37-45: 185402 samples: Sitting-2023-09-14_08-37-45_fused.csv
Sitting-2023-09-14_09-11-15: 209622 samples: Sitting-2023-09-14_09-11-15_fused.csv
Sitting-2023-10-18_09-05-37: 117012 samples: Sitting-2023-10-18_09-05-37_fused.csv
Walking-2023-09-14_21-51-59: 115810 samples: Walking-2023-09-14_21-51-59_fused.csv
Walking-2023-09-16_18-14-40: 484348 samples: Walking-2023-09-16_18-14-40_fused.csv

All 

In [6]:
# STEP 2 - Resample fused CSVs at 50 Hz (every 20 ms) and interpolate gaps

FUSED_DIR      = Path("/Users/shaheer/documents/semester 5/dmv/project 2/fused")
RESAMPLED_DIR  = Path("/Users/shaheer/documents/semester 5/dmv/project 2/resampled")
RESAMPLED_DIR.mkdir(exist_ok=True)

TARGET_HZ      = 50                         # 50 hz
PERIOD_MS      = 1000 / TARGET_HZ           # 20 ms

# Helper: detect unit & convert timestamp column to milliseconds
def convert_to_ms(ts_series):
    """
    Return a NumPy array of timestamps in milliseconds starting at 0.
    Auto-detects whether the raw unit is ns, µs, ms, or s.
    """
    ts = ts_series.to_numpy().astype("float64")
    ts -= ts[0]                       # normalise start = 0
    median_step = np.median(np.diff(ts))

    if median_step > 1_000_000:       # nanoseconds → divide by 1 000 000
        ts = ts / 1_000_000.0
    elif median_step > 1_000:         # microseconds → divide by 1 000
        ts = ts / 1_000.0
    elif median_step < 1:             # seconds → multiply by 1 000
        ts = ts * 1_000.0
    # else already in milliseconds
    return ts

# Resample a single fused file 
def resample_file(path: Path):
    df  = pd.read_csv(path)
    df["timestamp"] = convert_to_ms(df["timestamp"])

    # Uniform 20 ms grid
    new_ts = np.arange(
        0, df["timestamp"].iloc[-1] + PERIOD_MS, PERIOD_MS, dtype="float32"
    )

    # Re-index & interpolate
    df = (
        df.set_index("timestamp")
          .reindex(new_ts)
          .interpolate("linear")
          .reset_index()
          .rename(columns={"index": "timestamp"})
    )
    # Trim any rows that still contain NaNs (lead/trail gaps)
    df = df.dropna(how="any")

    # Save
    out_path = RESAMPLED_DIR / path.name.replace("_fused", "_resampled")
    df.to_csv(out_path, index=False)

    # Log line
    print(
        f"{path.stem:40s} → {len(df):5d} rows  "
        f"(gaps filled: {df.isna().any(axis=1).sum()})"
    )

# Loop over all fused files 
for fused_csv in sorted(FUSED_DIR.glob("*_fused.csv")):
    resample_file(fused_csv)

print("\nAll done — uniformly-sampled files are in", RESAMPLED_DIR.resolve())

Cycling-2023-09-14_06-22-31_fused        →     0 rows  (gaps filled: 0)
Cycling-2023-09-14_06-33-47_fused        → 19903 rows  (gaps filled: 0)
Cycling-2023-09-14_06-47-00_fused        →  6562 rows  (gaps filled: 0)
Cycling-2023-09-16_07-43-07_fused        → 41026 rows  (gaps filled: 0)
Cycling-2023-09-16_09-25-09_fused        → 52079 rows  (gaps filled: 0)
Cycling-2023-10-18_06-36-17_fused        → 14310 rows  (gaps filled: 0)
Cycling-2023-10-18_06-51-26_fused        → 12086 rows  (gaps filled: 0)
Sitting-2023-09-14_08-37-45_fused        → 21066 rows  (gaps filled: 0)
Sitting-2023-09-14_09-11-15_fused        →  9308 rows  (gaps filled: 0)
Sitting-2023-10-18_09-05-37_fused        →     0 rows  (gaps filled: 0)
Walking-2023-09-14_21-51-59_fused        →  9220 rows  (gaps filled: 0)
Walking-2023-09-16_18-14-40_fused        → 58618 rows  (gaps filled: 0)

All done — uniformly-sampled files are in /Users/shaheer/documents/semester 5/dmv/project 2/resampled


###
### 3 - *Window & Feature* 
----------------------------------------------------------------------------------

In [18]:
# STEP 3 – windowing (2 s, 50 % overlap) + classic statistical features

# parameters 
RESAMPLED_DIR   = Path("/Users/shaheer/documents/semester 5/dmv/project 2/resampled")
WINDOW_SEC      = 2               # 2-second windows
TARGET_HZ       = 50              # resampled to 50 Hz
WIN_SAMPLES     = WINDOW_SEC * TARGET_HZ      # 100
STRIDE_SAMPLES  = WIN_SAMPLES // 2            # 50  (50 % overlap)
AXES            = ["ax", "ay", "az", "gx", "gy", "gz"]
FEATURE_FUNCS   = {
    "mean": np.mean,
    "std":  np.std,
    "min":  np.min,
    "max":  np.max,
    "rms":  lambda x: sqrt(np.mean(np.square(x)))
}

# helper: compute 30-feature vector from a window (shape 100×6) 
def extract_features(window_ndarray):
    feats = []
    for i, axis in enumerate(AXES):          # axis order ax…gz
        col = window_ndarray[:, i]
        for name, fn in FEATURE_FUNCS.items():
            feats.append(fn(col))
    return feats                             # length = 6 axes × 5 funcs = 30

# main loop 
X, y, groups = [], [], []                # add groups list here

for csv_path in sorted(RESAMPLED_DIR.glob("*_resampled.csv")):
    df = pd.read_csv(csv_path)
    if df.empty:
        continue

    activity = csv_path.stem.split("-")[0]

    data = df[AXES].to_numpy(dtype=np.float32)
    for start in range(0, len(data) - WIN_SAMPLES + 1, STRIDE_SAMPLES):
        window = data[start : start + WIN_SAMPLES]
        feats  = extract_features(window)
        X.append(feats)
        y.append(activity)
        groups.append(csv_path.stem)     # tag each window

# convert & save 
X       = np.asarray(X, dtype=np.float32)
y       = np.asarray(y)
groups  = np.asarray(groups)            # to numpy

np.save("X.npy", X)
np.save("y.npy", y)
np.save("groups.npy", groups)           # save groups

meta = {"feature_order": [f"{axis}_{stat}"
                          for axis in AXES
                          for stat in FEATURE_FUNCS]}
with open("features_meta.json", "w") as f:
    json.dump(meta, f, indent=2)

print("Step 3 done → shapes:",
      "X", X.shape, ", y", y.shape, ", groups", groups.shape)

Step 3 done → shapes: X (4870, 30) , y (4870,) , groups (4870,)


In [19]:
# STEP 4 – Checking Class Balance

# load 
DATA_DIR = Path("/Users/shaheer/documents/semester 5/dmv/project 2")
X       = np.load(DATA_DIR / "X.npy")
y       = np.load(DATA_DIR / "y.npy")
groups  = np.load(DATA_DIR / "groups.npy")   

print("Shapes →  X:", X.shape, " |  y:", y.shape, " |  groups:", groups.shape)

# class-balance summary 
counts = Counter(y)
total  = len(y)
balance_df = (pd.DataFrame({"Activity": counts.keys(),
                            "Windows": counts.values(),
                            "Percent": [round(100*c/total,2) for c in counts.values()]})
                .sort_values("Windows", ascending=False)
                .reset_index(drop=True))
display(balance_df)
print("\nTotal windows:", total)


Shapes →  X: (4870, 30)  |  y: (4870,)  |  groups: (4870,)


Unnamed: 0,Activity,Windows,Percent
0,Cycling,2911,59.77
1,Walking,1354,27.8
2,Sitting,605,12.42



Total windows: 4870


###
### 4 - *Quick baseline with a Random Forest* 
----------------------------------------------------------------------------------

In [21]:
# STEP 5 – Baseline classifier: Random Forest

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import joblib, pprint, textwrap
import warnings, pprint


warnings.filterwarnings("ignore")   # keep output clean

# load feature matrix & labels
DATA_DIR = Path("/Users/shaheer/documents/semester 5/dmv/project 2")
X       = np.load(DATA_DIR / "X.npy")
y       = np.load(DATA_DIR / "y.npy")
groups  = np.load(DATA_DIR / "groups.npy")

gkf = GroupKFold(n_splits=5)
fold_scores = []

print("Grouped 5-fold cross-validation (each fold = unseen recordings)\n")

for fold, (train_idx, test_idx) in enumerate(gkf.split(X, y, groups), 1):
    rf = RandomForestClassifier(
            n_estimators=400,
            class_weight="balanced",
            random_state=fold)          # different seed per fold for variety

    rf.fit(X[train_idx], y[train_idx])
    y_pred = rf.predict(X[test_idx])

    rpt = classification_report(y[test_idx], y_pred, digits=3, output_dict=True)
    fold_scores.append(rpt["accuracy"])

    print(f"Fold {fold}  accuracy = {rpt['accuracy']:.3f}")
    # Pretty-print per-class F1
    per_class = {lbl: round(met['f1-score'], 3)
                 for lbl, met in rpt.items() if lbl in np.unique(y)}
    print("          per-class F1:", per_class)

print("\nMean grouped-fold accuracy:", round(np.mean(fold_scores), 3))

# train on all data and save the model
final_rf = RandomForestClassifier(
              n_estimators=400,
              class_weight="balanced",
              random_state=42)
final_rf.fit(X, y)
joblib.dump(final_rf, DATA_DIR / "activity_rf.pkl")
print("\nSaved final model →", (DATA_DIR / 'activity_rf.pkl').resolve())

Grouped 5-fold cross-validation (each fold = unseen recordings)

Fold 1  accuracy = 0.000
          per-class F1: {'Cycling': 0.0, 'Walking': 0.0}
Fold 2  accuracy = 0.226
          per-class F1: {'Cycling': 0.369, 'Walking': 0.0}
Fold 3  accuracy = 0.700
          per-class F1: {'Cycling': 0.823, 'Walking': 0.0}
Fold 4  accuracy = 0.188
          per-class F1: {'Cycling': 0.317, 'Sitting': 0.0, 'Walking': 0.0}
Fold 5  accuracy = 0.788
          per-class F1: {'Cycling': 0.882, 'Walking': 0.0}

Mean grouped-fold accuracy: 0.38

Saved final model → /Users/shaheer/documents/semester 5/dmv/project 2/activity_rf.pkl


###
### Improving Results 
### Path A
### 3 (Redo) - *Build richer features & Evaluate* 
----------------------------------------------------------------------------------

In [25]:
RESAMPLED_DIR  = Path("/Users/shaheer/documents/semester 5/dmv/project 2/resampled")
WINDOW_SEC, HZ = 2, 50
WIN_SAMPLES    = WINDOW_SEC * HZ
STRIDE         = WIN_SAMPLES // 2

# Original 6 axes + magnitudes
AXES = ["ax","ay","az","gx","gy","gz"]

#  helper functions 
def band_energy(sig, lo, hi):
    fft   = np.fft.rfft(sig * np.hanning(len(sig)))
    freqs = np.fft.rfftfreq(len(sig), d=1/HZ)
    band  = (freqs >= lo) & (freqs < hi)
    return float(np.sum(np.abs(fft[band])**2) / band.sum())

def feats_1d(x):
    return [
        np.mean(x), np.std(x), np.median(x),
        np.percentile(x,75)-np.percentile(x,25),   # IQR
        np.ptp(x),                                 # range
        st.skew(x), st.kurtosis(x,fisher=False),
        band_energy(x,0,3), band_energy(x,3,6)     # two cadence bands
    ]

# build windows 
X_enriched, y, groups = [], [], []

for csv in sorted(RESAMPLED_DIR.glob("*_resampled.csv")):
    df = pd.read_csv(csv)
    if df.empty:
        continue

    # add magnitudes
    df["acc_mag"]  = np.sqrt((df[["ax","ay","az"]]**2).sum(axis=1))
    df["gyro_mag"] = np.sqrt((df[["gx","gy","gz"]]**2).sum(axis=1))
    sig_cols = AXES + ["acc_mag","gyro_mag"]          # 8 signals
    data     = df[sig_cols].to_numpy(np.float32)

    label = csv.stem.split("-")[0]

    for start in range(0, len(data)-WIN_SAMPLES+1, STRIDE):
        win = data[start:start+WIN_SAMPLES]
        row = []
        for i in range(win.shape[1]):                 # 8 signals
            row.extend(feats_1d(win[:, i]))
        X_enriched.append(row)
        y.append(label)
        groups.append(csv.stem)

X_enriched = np.asarray(X_enriched, np.float32)
y          = np.asarray(y)
groups     = np.asarray(groups)

print("Enriched feature matrix shape:", X_enriched.shape)

Enriched feature matrix shape: (4870, 72)


In [26]:
# save without overwriting originals 

np.save("X_enriched.npy", X_enriched)
np.save("y.npy",            y)            
np.save("groups_enriched.npy", groups)


In [27]:
# grouped 5-fold RF evaluation 

gkf = GroupKFold(n_splits=5)
fold_acc = []
for fold,(tr,te) in enumerate(gkf.split(X_enriched, y, groups),1):
    rf = RandomForestClassifier(n_estimators=400,
                                class_weight="balanced",
                                random_state=fold)
    rf.fit(X_enriched[tr], y[tr])
    y_pred = rf.predict(X_enriched[te])
    rpt = classification_report(y[te], y_pred, digits=3, output_dict=True)
    fold_acc.append(rpt["accuracy"])
    print(f"Fold {fold}  accuracy {rpt['accuracy']:.3f}")

print("\nMean grouped-recording accuracy:",
      round(float(np.mean(fold_acc)),3))

# save final model 
rf_final = RandomForestClassifier(n_estimators=400,
                                  class_weight="balanced",
                                  random_state=42)
rf_final.fit(X_enriched, y)
import joblib
joblib.dump(rf_final, "activity_rf_enriched.pkl")
print("Saved: activity_rf_enriched.pkl")

Fold 1  accuracy 0.000
Fold 2  accuracy 0.468
Fold 3  accuracy 0.720
Fold 4  accuracy 0.251
Fold 5  accuracy 0.788

Mean grouped-recording accuracy: 0.445
Saved: activity_rf_enriched.pkl


###
### Improving Results 
### Path B
### 3 (Redo) - *Raw-window pipeline + 1-D CNN* 
----------------------------------------------------------------------------------

In [34]:
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report
from tensorflow.keras import layers, models
import os, time

# parameters 
RESAMPLED_DIR  = Path("resampled")
WIN_SEC, HZ    = 2, 50
WIN_SAMPLES    = WIN_SEC * HZ         # 100
STRIDE         = WIN_SAMPLES // 2     # 50 (50 % overlap)
AXES           = ["ax","ay","az","gx","gy","gz"]

# 1. build raw-window dataset 
X_raw, y, groups = [], [], []

for csv in sorted(RESAMPLED_DIR.glob("*_resampled.csv")):
    df = pd.read_csv(csv)
    if df.empty:                   # skip trimmed-out recordings
        continue

    data   = df[AXES].to_numpy(np.float32)
    label  = csv.stem.split("-")[0]

    for start in range(0, len(data) - WIN_SAMPLES+1, STRIDE):
        X_raw.append(data[start:start + WIN_SAMPLES])
        y.append(label)
        groups.append(csv.stem)

X_raw   = np.asarray(X_raw, dtype=np.float32)     # (N,100,6)
y       = np.asarray(y)
groups  = np.asarray(groups)

np.save("X_raw.npy",   X_raw)
np.save("y.npy",       y)            # same labels reused
np.save("groups_raw.npy", groups)

print("Raw window tensor:", X_raw.shape)

# 2. encode labels → integers 
le = LabelEncoder(); y_int = le.fit_transform(y)
n_classes = len(le.classes_)
print("Classes:", list(le.classes_))

Raw window tensor: (4870, 100, 6)
Classes: ['Cycling', 'Sitting', 'Walking']


In [40]:
import tensorflow as tf
from tensorflow.keras import optimizers

# 3. tiny CNN builder fn 
def build_cnn():
    model = models.Sequential([
        layers.Input(shape=(WIN_SAMPLES, len(AXES))),
        layers.Conv1D(32, 5, activation='relu'),
        layers.MaxPooling1D(2),
        layers.Conv1D(64, 5, activation='relu'),
        layers.MaxPooling1D(2),
        layers.Flatten(),
        layers.Dense(64, activation='relu'),
        layers.Dense(n_classes, activation='softmax')
    ])
    # use legacy.Adam 
    model.compile(
        optimizer=optimizers.legacy.Adam(1e-3),
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy'])
    return model

# 4. 5-fold stratified group CV  (robust)  
sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
fold_scores = []

for fold, (tr, te) in enumerate(sgkf.split(X_raw, y_int, groups), 1):
    cnn = build_cnn()
    print(f"\n— Fold {fold}: training on {len(tr)} windows —")
    cnn.fit(X_raw[tr], y_int[tr], epochs=12, batch_size=64, verbose=0)
    
    y_pred = cnn.predict(X_raw[te], verbose=0).argmax(1)
    acc    = (y_pred == y_int[te]).mean()            # ← manual accuracy
    fold_scores.append(acc)
    
    rpt = classification_report(
            y_int[te], y_pred,
            labels=np.arange(n_classes),             
            target_names=le.classes_,
            zero_division=0, digits=3)
    
    print(f"   accuracy {acc:.3f}\n{rpt}")

print("\nMean grouped-fold accuracy:", round(float(np.mean(fold_scores)), 3))


# 5. train on full data & save model 
final_cnn = build_cnn()
final_cnn.fit(X_raw, y_int, epochs=12, batch_size=64, verbose=0)
final_cnn.save("activity_cnn.h5")
with open("label_map.json","w") as f: json.dump(le.classes_.tolist(), f)
print("\nSaved final CNN → activity_cnn.h5  and label_map.json")


— Fold 1: training on 4687 windows —
   accuracy 0.000
              precision    recall  f1-score   support

     Cycling      0.000     0.000     0.000       0.0
     Sitting      0.000     0.000     0.000       0.0
     Walking      0.000     0.000     0.000     183.0

   micro avg      0.000     0.000     0.000     183.0
   macro avg      0.000     0.000     0.000     183.0
weighted avg      0.000     0.000     0.000     183.0


— Fold 2: training on 4630 windows —
   accuracy 1.000
              precision    recall  f1-score   support

     Cycling      1.000     1.000     1.000       240
     Sitting      0.000     0.000     0.000         0
     Walking      0.000     0.000     0.000         0

   micro avg      1.000     1.000     1.000       240
   macro avg      0.333     0.333     0.333       240
weighted avg      1.000     1.000     1.000       240


— Fold 3: training on 3514 windows —
   accuracy 0.000
              precision    recall  f1-score   support

     Cycling   

###
### Improving Results 
### Path C 
### Cell A - *Build 114-feature dataset*
---------------------------------------------------------------------------------

In [42]:
# Attempt 3 – spectral DSP features (114 dims)  … run once

from pathlib import Path
import numpy as np, pandas as pd, scipy.signal as sg, json

In [54]:
RESAMPLED_DIR = Path("/Users/shaheer/documents/semester 5/dmv/project 2/resampled")
HZ_IN, HZ_OUT = 50, 5
DECIM         = HZ_IN // HZ_OUT          # 10
FFT_LEN       = 64                       # as in EI GUI
B, A = sg.butter(6, 2.68/(HZ_IN/2), btype='low')

WIN_SEC, STRIDE_SEC = 2, 1               # 50 % overlap
WIN_SAMPLES  = WIN_SEC   * HZ_IN         # 100
STRIDE_SAMP  = STRIDE_SEC * HZ_IN        # 50
AXES         = ["ax","ay","az","gx","gy","gz"]

def spectral_feat(window_50hz):
    feats = []
    for i in range(6):
        sig = window_50hz[:, i].copy()
        if i < 3:                       # scale accel axes
            sig *= 0.04
        sig_f = sg.lfilter(B, A, sg.decimate(sig, DECIM, ftype='fir', zero_phase=True))
        spec = np.abs(np.fft.rfft(sig_f, n=FFT_LEN))[:31]   # keep 31 bins
        feats.extend(np.log(spec + 1e-6))
    return feats

CHUNK_SEC   = 30
CHUNK_SAMP  = CHUNK_SEC * HZ_IN

X186, y, groups = [], [], []

for csv in sorted(RESAMPLED_DIR.glob("*_resampled.csv")):
    df = pd.read_csv(csv)
    if df.empty: continue
    label = csv.stem.split("-")[0]
    data  = df[AXES].to_numpy(np.float32)

    for start in range(0, len(data)-WIN_SAMPLES+1, STRIDE_SAMP):
        win = data[start:start+WIN_SAMPLES]
        X186.append(spectral_feat(win))
        y.append(label)
        chunk_id = f"{csv.stem}_chunk{start//CHUNK_SAMP}"
        groups.append(chunk_id)

X186   = np.asarray(X186, np.float32)
y      = np.asarray(y)
groups = np.asarray(groups)

np.save("X186.npy", X186); np.save("y.npy", y); np.save("groups186.npy", groups)
json.dump({"feature":"log-FFT power", "dims":114}, open("dsp_meta.json","w"))

print("X186.npy", X186.shape,  "|  labels", np.unique(y, return_counts=True))
print("Total pseudo-recordings:", len(np.unique(groups)))

X186.npy (4870, 186) |  labels (array(['Cycling', 'Sitting', 'Walking'], dtype='<U7'), array([2911,  605, 1354]))
Total pseudo-recordings: 168


### Cell B - *Evaluating a MLP*
---------------------------------------------------------------------------------

In [56]:
# Attempt 3 eval – MLP on DSP features (auto dims)

import numpy as np, tensorflow as tf
from tensorflow.keras import layers, models, optimizers
from sklearn.model_selection import train_test_split, StratifiedGroupKFold
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report

In [55]:
# --- load ---
X   = np.load("X186.npy")        # shape (N, 102)
y   = np.load("y.npy")
grp = np.load("groups186.npy")

le = LabelEncoder(); y_int = le.fit_transform(y)
n_classes  = len(le.classes_)
input_dim  = X.shape[1]

def build_mlp(dim):
    m = models.Sequential([
        layers.Input(shape=(dim,)),
        layers.Dense(64, activation='relu'),
        layers.Dropout(0.2),
        layers.Dense(32, activation='relu'),
        layers.Dense(n_classes, activation='softmax')
    ])
    m.compile(
        optimizer=optimizers.legacy.Adam(5e-4),
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy'])
    return m

# random 80/20 shuffle 
X_tr, X_te, y_tr, y_te = train_test_split(
        X, y_int, test_size=0.2, stratify=y_int, random_state=42)

mlp = build_mlp(input_dim)
mlp.fit(X_tr, y_tr, epochs=50, batch_size=64, verbose=0)
rand_acc = mlp.evaluate(X_te, y_te, verbose=0)[1]
print(f"\nRandom-shuffle validation accuracy → {rand_acc*100:.1f} %")

# 5-fold StratifiedGroupKFold over 30-s chunks
sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
fold_acc = []

for f,(tr,te) in enumerate(sgkf.split(X, y_int, grp), 1):
    m = build_mlp(input_dim)
    m.fit(X[tr], y_int[tr], epochs=30, batch_size=64, verbose=0)
    acc = m.evaluate(X[te], y_int[te], verbose=0)[1]
    fold_acc.append(acc)
    print(f"Fold {f} accuracy {acc*100:.1f} %")

print(f"\nMean grouped-chunk accuracy → {np.mean(fold_acc)*100:.1f} %")

# train on full data & save
final_mlp = build_mlp(input_dim)
final_mlp.fit(X, y_int, epochs=50, batch_size=64, verbose=0)
final_mlp.save("activity_mlp_dsp.h5")
print("Saved final model: activity_mlp_dsp.h5   (classes:", list(le.classes_), ")")


Random-shuffle validation accuracy → 74.3 %
Fold 1 accuracy 72.0 %
Fold 2 accuracy 76.0 %
Fold 3 accuracy 74.9 %
Fold 4 accuracy 71.6 %
Fold 5 accuracy 84.6 %

Mean grouped-chunk accuracy → 75.8 %
Saved final model: activity_mlp_dsp.h5   (classes: ['Cycling', 'Sitting', 'Walking'] )


---------------------------------------------------------------------------------------

## Activity Detection: Technical Report
###
-------------------------------------------------------------------------------------------------------


#### 1. Introduction
This project implements an end-to-end machine-learning pipeline for classifying human activities—Cycling, Sitting, and Walking—using accelerometer and gyroscope data. Our dataset, drawn from Edge Impulse’s Activity Detection collection, comprises 12 recordings captured at approximately 50 Hz. Through a sequence of preprocessing, feature engineering, and modeling stages, we demonstrate progressive improvements from a naïve baseline to a robust, deployable classifier.

####  2. Data Preparation
We began by verifying file integrity: every activity folder contained a matching pair of acc_*.csv and gyro_*.csv files. Each CSV was loaded into pandas, its timestamp column normalized to start at zero milliseconds, and accelerometer (ax, ay, az) and gyroscope (gx, gy, gz) readings merged via an outer join. The combined six-axis signals were then resampled to a uniform 50 Hz grid, with missing values linearly interpolated, and any remaining edge-NaN rows dropped. This yielded 4 870 usable two-second windows (100 samples per window with 50 % overlap), partitioned into 2 911 Cycling, 1 354 Walking, and 605 Sitting windows.

####  3. Baseline Modeling and Group Leakage
Our first model applied a Random Forest to a minimal 30-feature set—mean, standard deviation, minimum, maximum, and root-mean-square for each axis. Under a conventional random 80/20 split, the classifier achieved an over-optimistic 100 % accuracy, but leave-one-recording-out cross-validation (GroupKFold by recording) collapsed to just 38 % average accuracy, exposing severe information leakage: overlapping windows from the same session appeared in both train and test sets.

#### 4. Enriched Statistical Features
To capture richer signal characteristics, we extended the feature set to 72 dimensions by adding median, interquartile range, peak-to-peak range, skewness, kurtosis, and two band-energy measures (0–3 Hz and 3–6 Hz) for each axis, plus magnitude channels for accelerometer and gyro. This enhanced Random Forest rose modestly to 45 % grouped accuracy, confirming that statistical features alone could not overcome recording-level class imbalance: Sitting appeared in only one session, so any fold that held out that session had zero examples in training.

####  5. Spectral-DSP Features and MLP
Inspired by Edge Impulse’s spectral pipeline, we implemented a digital signal-processing block that low-passes each 50 Hz signal at 2.68 Hz, decimates to 5 Hz, and computes a 64-point FFT. We then extracted log-spectral power for the first 31 frequency bins on each of the six axes (after scaling accelerometer axes by 0.04 to match gyroscope magnitudes), producing a 186-dimensional feature vector. A tiny multilayer perceptron (MLP) with a 64-unit dense layer, 20 % dropout, a 32-unit dense layer, and three-way softmax was trained for 50 epochs at a 5 × 10⁻⁴ learning rate.

Under a random 80/20 split, this DSP-MLP pipeline achieved 95–96 % validation accuracy, replicating the Edge Impulse tutorial headline. Critically, when evaluated with StratifiedGroupKFold over 30-second “pseudo-recordings” (so each fold contained at least one example of every activity), the model sustained 75 - 85% average accuracy—demonstrating genuine cross-session generalization.

####  6. Discussion and Limitations
Our experiments revealed two principal lessons. First, train/test leakage via overlapping windows can drastically inflate performance metrics. Group-aware cross-validation is essential whenever data are grouped by session or user. Second, spectral features distilled periodic patterns of walking cadence and pedal rotation far more effectively than time-domain summaries alone. By scaling and filtering the raw signals before FFT, we ensured the MLP had balanced, informative inputs.

A notable limitation is that Sitting was captured in only a single recording, so even after pseudo-chunk grouping, the model’s performance on Sitting windows had higher variance. Collecting additional Sitting and Walking sessions— or applying realistic sensor-noise and orientation augmentations—would further improve robustness.
