# Initial Visualization of Data 

Dataset used, located under PD_IMU_XLS, is an IMU sensory repository of different body movements, between 19 control healthy people and 15 Parkinson's patients by Joseph et al. IMUs are placed across 4 locations on both arms: Upper Left, Upper Right, Lower Left, and Lower Right, as will be shown in an accompanying appendix. Another location is on the head but is neglected for the purpose of arm wearable-focused design.

IMUs used by Joseph et al. are a 9-axis, 3 each for accelerometer (acceleration monitoring in 3D cartesian coordinates), gyroscope (rotational speed in 3D cartesian coordinates), and magnetometer (magnetic field monitoring relative to magnetic north in 3D), last of which is not used.

Reference dataset: https://doi.org/10.5061/dryad.fbg79cp1d.

In this notebook, we go over the visualization and extraction of main features to have scoped parameters instead of working with 3D time series, crucial for streamlined modelling.

**Steps executed when _Run All_ is pressed:**

| Step | Function Called | Output Folder | Purpose |
|------|-----------------|---------------|---------|
| 1 Ingest  | `data.ingest.ingest()` | `data/raw/`, `metadata.json` | Load every `*.xls` file, pick only **Lower Arm** sheet, save as `.pkl`. |
| 2 Convert  | `data.ingest.to_h5()` | `data/interim/` | Compress the raw DataFrames to HDF5, attach sample-rate metadata. |
| 3 Pre-proc | `data.preproc.main()` | `data/interim_pre/` | Detrend & 0.5–20 Hz band-pass filter each axis. |
| 4 Windows  | `data.window.process()` | `data/processed/X_windows.npz` | Slide 256-sample windows (50 % overlap) across each trial. |
| 5 Features | `features.build.main()` | `data/processed/features.npz` | Extract **120 IMU features** (time + frequency). |
| 6 Split    | `data.split.main()` | `data/processed/split/` | Subject-wise **train/val/test** `.npz` files (PD vs. Control). |

After running, the notebook shows:

* matrix shape `(N_windows, 120)`  
* a DataFrame preview (10 rows)  
* a histogram of dominant frequencies (axis 0)  
* optional export to `features.csv`.

These artefacts flow directly into the modelling notebook **01_modelling.ipynb**.

In [1]:
import sys, os
from pathlib import Path
os.chdir('/Users/toufikjrab/Projects/COMP 588/pgm-tremors')
REPO_ROOT = Path.cwd()     
# OR if not in pgm-tremors folder:
# REPO_ROOT = Path("/Users/toufikjrab/Projects/COMP 588/pgm-tremors")

DATA_DIR = REPO_ROOT / "src" / "data" / "PD_IMU_XLS" / "Data"
SRC = REPO_ROOT / "src"
print("Repo root:", REPO_ROOT)
print("SRC path added:", SRC)


Repo root: /Users/toufikjrab/Projects/COMP 588/pgm-tremors
SRC path added: /Users/toufikjrab/Projects/COMP 588/pgm-tremors/src


In [2]:
from src.data import preproc, window
from src.data.ingest import ingest_xls as ingest

# Step 1: ingest to make the metadata file
ingest(DATA_DIR)

from src.data import split

subjects:  74%|███████▍  | 26/35 [01:08<00:25,  2.81s/it]

[warn] no Lower Left in cardigan 4.xls


subjects: 100%|██████████| 35/35 [01:36<00:00,  2.77s/it]

[ingest] 340 trials  →  data/raw





In [3]:
import json

# just checking if data is there (or manually in repository folder check)
print("metadata exists:", (REPO_ROOT/"data/metadata.json").exists())
print("first entry ->", json.load(open(REPO_ROOT/"data/metadata.json"))[0])

metadata exists: True
first entry -> {'subject': 'CT001', 'pd': 0, 'activity': 'Cardigan_1', 'sheet': 'Lower Left', 'pkl': 'data/raw/CT001/Cardigan_1.pkl'}


In [4]:
from src.features import build, imu_freq, imu_time

In [44]:
import importlib
import src.data.preproc as preproc
import src.data.window  as window
import src.data.split as split

importlib.reload(preproc)   
importlib.reload(window)   
importlib.reload(split)   

<module 'src.data.split' from '/Users/toufikjrab/Projects/COMP 588/pgm-tremors/src/data/split.py'>

In [45]:
from pathlib import Path
print(len(list(Path("data/interim_pre").rglob("*.npy"))), "npy files in interim_pre")


340 npy files in interim_pre


In [46]:
# Now continuing with the preprocessing of data for visualization

# Step 2: pre-processing band-pass
preproc.process()

# Step 3: make sliding windows
window.main()

# Step 4: build features (time + freq)  — 200 Hz sampling
build.main(200.0)

preproc: 100%|██████████| 340/340 [00:00<00:00, 471.14it/s]


[preproc] 340 files  →  data/interim_pre


windows: 100%|██████████| 340/340 [00:00<00:00, 10554.05it/s]


[window] (7564, 256, 6) saved


  freqs, _, Pxy = _spectral_helper(x, y, fs, window, nperseg, noverlap,


[features] (7564, 48) → data/processed/features.npz


In [47]:
# Step 5: train/val/test split
split.main()

FileNotFoundError: [Errno 2] No such file or directory: 'data/processed/split/train.npz'

In [19]:
# Notebook setup
import pathlib, sys
REPO_ROOT = pathlib.Path().resolve().parent.parent    # notebooks/00_quick_eda.ipynb is two levels deep
SRC       = REPO_ROOT / "src"
sys.path.append(str(SRC))

# <-- this is the ONLY line you change to match your correction -->
DATA_ROOT = SRC / "data" / "PD_IMU_XLS" / "Data"      # ...CT0xx / PD0xx

print("Dataset folder  :", DATA_ROOT)
print("Outputs will go :", REPO_ROOT / 'data')


Dataset folder  : /Users/toufikjrab/Projects/src/data/PD_IMU_XLS/Data
Outputs will go : /Users/toufikjrab/Projects/data


In [22]:
from src.data.ingest   import ingest
from src.data          import preproc, window, split
from features.build import main as build_feats

ingest(DATA_ROOT)            # → data/raw + metadata.json
preproc.process()            # → data/interim_pre
window.main()                # → data/processed/X_windows.npz
build_feats(200.0)           # → data/processed/features.npz
split.main()                 # → data/processed/split/{train,val,test}.npz


ImportError: cannot import name 'ingest' from 'src.data.ingest' (/Users/toufikjrab/Projects/COMP 588/pgm-tremors/src/data/ingest.py)