## Things achieved through preprocessing:
- removed 0 annotation
- defined windows of length 4s (250 samples per window - calculated using sampling interval) and 0.5s step size for each run
- performed feature engineering by creating window level metrics such as RMS, variance, Band power 0.5–3 Hz, Band power 3–8 Hz, Freeze Index
- created labels - FoG v/s non FoG
- merged into one csv for modeling

In [21]:
import os
import re
import numpy as np
from scipy.signal import welch
import glob
import pandas as pd

In [2]:
# defining input and output folders

INPUT_FOLDER = "./daphnet+freezing+of+gait/dataset_fog_release/dataset"     
OUTPUT_FOLDER = "./modified_dataset" 
FILE_PATTERN = re.compile(r"S(\d+)R(\d+)\.txt$")


## Including subject_id and run_id as columns in each individual txt file

In [5]:
# create output folder if it doesnt exist
os.makedirs(OUTPUT_FOLDER, exist_ok=True)

for filename in os.listdir(INPUT_FOLDER):
    match = FILE_PATTERN.match(filename)
    if not match:
        continue  # skip files that dont match pattern

    ss = int(match.group(1))
    rr = int(match.group(2))

    input_path = os.path.join(INPUT_FOLDER, filename)
    output_path = os.path.join(OUTPUT_FOLDER, filename)

    data = np.loadtxt(input_path)

    if data.ndim == 1:
        data = data.reshape(1, -1)

    ss_col = np.full((data.shape[0], 1), ss)
    rr_col = np.full((data.shape[0], 1), rr)

    new_data = np.hstack((data, ss_col, rr_col))

    np.savetxt(output_path, new_data, fmt="%d")

    print(f"Processed {filename} → {output_path}")

print("All files processed.")


Processed S01R01.txt → ./modified_dataset\S01R01.txt
Processed S01R02.txt → ./modified_dataset\S01R02.txt
Processed S02R01.txt → ./modified_dataset\S02R01.txt
Processed S02R02.txt → ./modified_dataset\S02R02.txt
Processed S03R01.txt → ./modified_dataset\S03R01.txt
Processed S03R02.txt → ./modified_dataset\S03R02.txt
Processed S03R03.txt → ./modified_dataset\S03R03.txt
Processed S04R01.txt → ./modified_dataset\S04R01.txt
Processed S05R01.txt → ./modified_dataset\S05R01.txt
Processed S05R02.txt → ./modified_dataset\S05R02.txt
Processed S06R01.txt → ./modified_dataset\S06R01.txt
Processed S06R02.txt → ./modified_dataset\S06R02.txt
Processed S07R01.txt → ./modified_dataset\S07R01.txt
Processed S07R02.txt → ./modified_dataset\S07R02.txt
Processed S08R01.txt → ./modified_dataset\S08R01.txt
Processed S09R01.txt → ./modified_dataset\S09R01.txt
Processed S10R01.txt → ./modified_dataset\S10R01.txt
All files processed.


## Removing rows with Annotation 0

In [6]:
for filename in os.listdir(OUTPUT_FOLDER):
    if not filename.endswith(".txt"):
        continue

    file_path = os.path.join(OUTPUT_FOLDER, filename)

    data = np.loadtxt(file_path)

    if data.ndim == 1:
        data = data.reshape(1, -1)

    # keep rows where 11th column != 0
    filtered_data = data[data[:, 10] != 0]

    if filtered_data.size == 0:
        open(file_path, "w").close()
    else:
        np.savetxt(file_path, filtered_data.astype(int), fmt="%d")

    print(f"Filtered {filename}: {data.shape[0]} → {filtered_data.shape[0]} rows")

print("All files updated.")

Filtered S01R01.txt: 151987 → 92802 rows
Filtered S01R02.txt: 52095 → 28801 rows
Filtered S02R01.txt: 72561 → 25601 rows
Filtered S02R02.txt: 89645 → 64961 rows
Filtered S03R01.txt: 144190 → 90882 rows
Filtered S03R02.txt: 38774 → 16641 rows
Filtered S03R03.txt: 70269 → 21121 rows
Filtered S04R01.txt: 195737 → 132482 rows
Filtered S05R01.txt: 109886 → 67844 rows
Filtered S05R02.txt: 100746 → 65922 rows
Filtered S06R01.txt: 175707 → 107523 rows
Filtered S06R02.txt: 44227 → 19842 rows
Filtered S07R01.txt: 119525 → 74241 rows
Filtered S07R02.txt: 50335 → 28801 rows
Filtered S08R01.txt: 136589 → 49284 rows
Filtered S09R01.txt: 172311 → 111365 rows
Filtered S10R01.txt: 193303 → 142722 rows
All files updated.


## Dividing each file into windows and creating labels - FoG and non FoG

A window label answers this question:
“During this 4-second window, was the person mostly freezing or not?”

So the goal is to create labels like so:
- window label = 1 (FOG)
- window label = 0 (no FOG)

Steps:
1. Start with window length 4s and windowing 0.5s - following as per paper
2. In each window, if more than 50% samples have annotation 2, then keep window label as 1 (FoG) else 0 (non FoG)

## Checking random files to check the sampling interval and sampling frequency 
sampling interval should be 16 ms

In [12]:
data = np.loadtxt(".\modified_dataset\S05R01.txt")

t_ms = data[:, 0]  

dt = np.median(np.diff(t_ms)) / 1000
fs = 1.0 / dt

print(f"Sampling interval dt = {dt:.6f} s")
print(f"Sampling frequency fs = {fs:.2f} Hz")

Sampling interval dt = 0.016000 s
Sampling frequency fs = 62.50 Hz


##  Using this, I will determine the rows in each window

62.5 samples are recorded every second

Window length: 4 seconds × 62.5 Hz = 250 samples
i.e. A 4-second window contains 250 consecutive samples

Window Step: 0.5 seconds × 62.5 Hz ≈ 31 samples i.e.
After making one window, we move forward 31 rows and make the next window. 

(4s and 0.5s determined from the paper)

Eg:
    
- Window 0: samples 0 to 249
- Window 1: samples 31 to 280
- Window 2: samples 62 to 311

### create the window
### for each window:  

-  using 250 ankle_vert values, compute RMS, variance, Band power 0.5–3 Hz, Band power 3–8 Hz, Freeze Index
-  create a new column, if more than 50% of samples have label 2, then label will be 1, else 0, 
- Then I will generate a CSV for each txt having window level metrics

In [16]:
ANKLE_VERT_COL = 2  
ANNOT_COL = 10     

WINDOW_LEN = 250
STEP = 31
FS = 62.5  # Hz



def bandpower(x, fs, fmin, fmax):
    if len(x) < 4:
        return 0.0

    nperseg = min(256, len(x))
    freqs, psd = welch(x, fs=fs, nperseg=nperseg, detrend="constant")

    mask = (freqs >= fmin) & (freqs <= fmax)
    if not np.any(mask):
        return 0.0

    return float(np.trapz(psd[mask], freqs[mask]))


def compute_features(x):
    x = x.astype(float)

    rms = float(np.sqrt(np.mean(x * x))) if len(x) else 0.0
    var = float(np.var(x)) if len(x) else 0.0

    bp_lo = bandpower(x, FS, 0.5, 3.0)
    bp_hi = bandpower(x, FS, 3.0, 8.0)

    freeze_index = bp_hi / (bp_lo + 1e-12)

    return rms, var, bp_lo, bp_hi, freeze_index


def iter_windows(n, win_len, step):
    w = 0
    start = 0
    while start < n:
        end = min(start + win_len, n)
        yield w, start, end
        w += 1
        start += step


for filename in os.listdir(OUTPUT_FOLDER):
    if not filename.lower().endswith(".txt"):
        continue

    path = os.path.join(OUTPUT_FOLDER, filename)
    data = np.loadtxt(path)

    if data.size == 0:
        continue
    if data.ndim == 1:
        data = data.reshape(1, -1)

    ankle = data[:, ANKLE_VERT_COL]
    annot = data[:, ANNOT_COL].astype(int)

    out_rows = []

    for window_id, start, end in iter_windows(len(data), WINDOW_LEN, STEP):
        sig = ankle[start:end]
        ann = annot[start:end]

        rms, var, bp_lo, bp_hi, fi = compute_features(sig)

        freeze_ratio = float(np.mean(ann == 2)) if len(ann) else 0.0
        label = int(freeze_ratio >= 0.5)

        out_rows.append([
            window_id,
            start,
            end,
            end - start,
            rms,
            var,
            bp_lo,
            bp_hi,
            fi,
            freeze_ratio,
            label
        ])

    out_csv = os.path.join(
        OUTPUT_FOLDER,
        filename.replace(".txt", "_windows.csv")
    )

    header = (
        "window_id,start_idx,end_idx_excl,n_samples,"
        "rms,var,bp_0p5_3,bp_3_8,freeze_index,freeze_ratio,label"
    )

    with open(out_csv, "w") as f:
        f.write(header + "\n")
        for r in out_rows:
            f.write(",".join(map(str, r)) + "\n")

    print(f"{filename} → {os.path.basename(out_csv)} ({len(out_rows)} windows)")

print("Done.")

S01R01.txt → S01R01_windows.csv (2994 windows)
S01R02.txt → S01R02_windows.csv (930 windows)
S02R01.txt → S02R01_windows.csv (826 windows)
S02R02.txt → S02R02_windows.csv (2096 windows)
S03R01.txt → S03R01_windows.csv (2932 windows)
S03R02.txt → S03R02_windows.csv (537 windows)
S03R03.txt → S03R03_windows.csv (682 windows)
S04R01.txt → S04R01_windows.csv (4274 windows)
S05R01.txt → S05R01_windows.csv (2189 windows)
S05R02.txt → S05R02_windows.csv (2127 windows)
S06R01.txt → S06R01_windows.csv (3469 windows)
S06R02.txt → S06R02_windows.csv (641 windows)
S07R01.txt → S07R01_windows.csv (2395 windows)
S07R02.txt → S07R02_windows.csv (930 windows)
S08R01.txt → S08R01_windows.csv (1590 windows)
S09R01.txt → S09R01_windows.csv (3593 windows)
S10R01.txt → S10R01_windows.csv (4604 windows)
Done.


## including subject_id and run_id in the csv

In [18]:
CSV_SUFFIX = "_windows.csv"


# Regex to extract ss and rr
NAME_RE = re.compile(r"S(\d+)R(\d+)_windows\.csv$", re.IGNORECASE)

for filename in os.listdir(OUTPUT_FOLDER):
    if not filename.lower().endswith(CSV_SUFFIX):
        continue

    match = NAME_RE.match(filename)
    if not match:
        print(f"Skipped (name does not match pattern): {filename}")
        continue

    ss = int(match.group(1))
    rr = int(match.group(2))

    path = os.path.join(OUTPUT_FOLDER, filename)

    with open(path, "r") as f:
        lines = f.readlines()

    if not lines:
        print(f"Skipped empty file: {filename}")
        continue

    header = lines[0].strip()
    data_lines = lines[1:]

    new_header = header + ",subject_id,run_id"

    new_lines = [new_header + "\n"]
    for line in data_lines:
        line = line.strip()
        if not line:
            continue
        new_lines.append(f"{line},{ss},{rr}\n")

    with open(path, "w") as f:
        f.writelines(new_lines)

    print(f"Updated {filename}: subject_id={ss}, run_id={rr}")

print("Done.")


Updated S01R01_windows.csv: subject_id=1, run_id=1
Updated S01R02_windows.csv: subject_id=1, run_id=2
Updated S02R01_windows.csv: subject_id=2, run_id=1
Updated S02R02_windows.csv: subject_id=2, run_id=2
Updated S03R01_windows.csv: subject_id=3, run_id=1
Updated S03R02_windows.csv: subject_id=3, run_id=2
Updated S03R03_windows.csv: subject_id=3, run_id=3
Updated S04R01_windows.csv: subject_id=4, run_id=1
Updated S05R01_windows.csv: subject_id=5, run_id=1
Updated S05R02_windows.csv: subject_id=5, run_id=2
Updated S06R01_windows.csv: subject_id=6, run_id=1
Updated S06R02_windows.csv: subject_id=6, run_id=2
Updated S07R01_windows.csv: subject_id=7, run_id=1
Updated S07R02_windows.csv: subject_id=7, run_id=2
Updated S08R01_windows.csv: subject_id=8, run_id=1
Updated S09R01_windows.csv: subject_id=9, run_id=1
Updated S10R01_windows.csv: subject_id=10, run_id=1
Done.


## Merging all files into one unified CSV

In [20]:
folder_path = "./csvs"
output_file = "merged.csv"

all_files = glob.glob(os.path.join(folder_path, "*.csv"))

df_list = [pd.read_csv(file) for file in all_files]
merged_df = pd.concat(df_list, ignore_index=True)

merged_df.to_csv(output_file, index=False)

print(f"Merged {len(all_files)} files into {output_file}")


Merged 17 files into merged.csv
