#  ECG Preprocessing Command Guide

This notebook includes a command‑line preprocessing pipeline for loading raw ECG data, balancing classes, segmenting signals,  
 and generating train/val/test splits or K‑fold datasets.

To see all available CLI arguments:

```bash
python ecg_data_loader.py --help
```

### Recommended Usage
It is recommended to run preprocessing from a terminal, because:  
It handles large datasets more reliably  
It avoids notebook kernel memory limits  
It keeps logs properly  
It is safer for long‑running jobs  
It can be run inside a notebook, but a huge datasets can cause the kernel to freeze or restart.  

#### The full command line shape:
```bash
python ecg_data_loader.py \
  --dataset_path "/path/to/raw_dataset" \
  --name ecg-arrhythmia-1 \
  --fs 500 250 100 \
  --out_root "any-path" \
  --max_samples 1000 \
  --split_ratio 0.7 0.2 0.1 \
  --test_ratio 0.3 \
  --folds 7 \
  --balance_mode global
```


### Supported Dataset Generation Modes

This CLI can generate ECG datasets in four distinct modes, depending on whether data are split into train/validation/test sets,  
 cross-validation folds, or a combination of both. All modes enforce patient-level separation to prevent data leakage.  

### Command‑line Arguments  
<span style="color:orange;">--data_path</span> (Required)  
Path to the raw dataset folder you want to load.  
This must point to the directory containing the original ECG files (MIT‑BIH, PTB‑XL, ECG‑Arrhythmia, etc.).  

<span style="color:orange;">--name</span> (Required)  
Name of the run.  
Used for:  
-naming the log file  
-naming the output folder under prepared_data/  
-identifying which dataset loader to use (the name must contain part of the dataset folder name)  
Example:  
--name ecg-arrhythmia-1 → loader detects ECG‑Arrhythmia dataset.  

<span style="color:orange;">--fs</span> (Optional)  
Target sampling rates for preprocessing.    
The pipeline will resample each ECG record to each of these frequencies.  
Defaults: 500, 250, 100 Hz  
You can specify any number of target sampling rates.  
Example:  
--fs 125 500  

<span style="color:orange;">--out_root</span> (Optional)  
Root directory where processed data will be saved.  
Default: prepared_data  
Inside this folder, the script creates:  
prepared_data/<name>/<fs>hz/  

<span style="color:orange;">--max_samples</span> (Optional)  
Caps the number of samples processed.       
In train/val/test mode → caps number of segments    
In fold mode → caps number of segments per fold  
Only needed if use MIT-BIH          


<span style="color:orange;">--split_ratio</span> (Optional)   
train/validation/test split ratios.  
Default:  
0.7 0.2 0.1 → 70% train, 20% val, 10% test  
Can select any  other ratio her.  
Ignored when --folds is used.   


<span style="color:orange;">--test_ratio</span> (Optional)    
Part of patients reserved as a test set.    
When used together with --folds, this enables:  
Hold-out test + cross-validation.    

<span style="color:orange;">--folds</span> (Optional)  
Enables patient‑safe K‑fold cross‑validation.    
Example:  
--folds 7 → 7‑fold CV  
When this is set:  
--split_ratio is ignored  
each fold contains unique patients  
no patient appears in more than one fold  
balancing is done per fold  

<span style="color:orange;">--balance_mode</span> (Optional)  
Controls how class balancing is applied.  
Options:  
global (default)    
Balances AFIB/NORMAL before splitting  
train   
Balances only inside the training set  
  



#### An example to run in the terminal:  
```bash
python ecg_data_loader --dataset_path "C:\path" --name ptbxl --fs 125 --max_samples 5000 -- split_ratio 0.8 0.1 0.1 
or
python ecg_data_loader --dataset_path "C:\path" --name ptbxl --fs 125 --max_samples 5000 -- --folds 5 
```
or simply run with all defaults: fs(500, 250 and 100) out_root(the current folder) max_samples(take all AFIB+NORMAL(Take same as AFIB number))  
split_ratio(train 70%, val 20%, test 10%) folds(None)
pytnon ecg_loader --dataset_path "C:/path" --name bih/ptbxl/arrhythmia:
```bash
pytnon ecg_loader --dataset_path "C:/path" --name bih/ptbxl/arrhythmia
```



#### Running with uv (if the environment uses it)
```bash
uv run python ecg_data_loader.py --dataset_path "C:/path" --name bih
```

#### Running Inside a Notebook
```bash
%run python ecg_data_loader.py --dataset_path "C:/path" --name bih
```  
or  
```bash
%run python ecg_data_loader.py \
--dataset_path "C:/path" \
--name bih
```

## Output Structure
After running the command, the following structure is created:
```bash

Hold-out test + cross-validation(--test_ratio):  
prepared_data/
    <name>/
        <fs>hz/
            sample_<fs>hz.csv
            data.py
            test/
                test.pt

split mode:

prepared_data/
    <name>/
        <fs>hz/
            sample_<fs>hz.csv
            train.pt
            val.pt
            test.pt
            data.pt   (if using folds)
logs/
    <name>.log
```
## Notes
Use (/) on Windows and bash . ON mac not testet , but i think it will be ^ not /. 
--name controls both the output folder and the log filename.  
--folds overrides --split_ratio.  
--max_samples behaves differently in fold vs non‑fold mode:  
    No folds: caps number of records  
    Folds: caps number of segments  



An Example run from the Terminal with output:

An Example run from the Notebook with output:  
In this example i take 20% for testing of the total AFIB and NORMAL samples found with no balance so it reflects data from the real world,  
 and then apply balancing(AFIB = NORMAL) to the rest of it to training and validations divided equally to 5 folds.  

In [1]:
%run "C:\Users\MY\OneDrive\Desktop\Bachelor_Project\SEARCH_AF_detection_OsloMet_BachelorGroup\ecg_afib_detection\ecg_preprocessing\ecg_data_prepare.py"  \
    --dataset_path "C:\Users\MY\Downloads\Dataset\ptb-xl-a-large-publicly-available-electrocardiography-dataset-1.0.3" \
    --name ptb-xl-test-folds \
    --fs 62 100 125 250 500 \
    --folds 5 \
    --test_ratio 0.2
 

[2026-01-23 12:17:30,377] INFO: Logger initialized successfully



FULL DATASET OVERVIEW
  Total records : 21799
  Sampling rates: [500]
  Leads         : [12]
  Unique labels : 71
  Labels found  :
    NORM(9514), LVOLT(182), SR(16748), SBRAD(637), IMI(2676), ABQRS(3327), SARRH(772), AFLT(73)
    AFIB(1514), NDT(1825), NST_(767), DIG(181), LVH(2132), LPFB(177), LNGQT(117), LAFB(1623)
    IRBBB(1118), RAO/RAE(99), RVH(126), IVCD(787), LMI(201), ASMI(2357), AMI(353), ISCAL(659)
    1AVB(793), STACH(826), PACE(294), ISCLA(140), SEHYP(29), ISCIL(179), ILMI(478), ISC_(1272)
    PVC(1143), CRBBB(541), CLBBB(536), ALMI(288), ANEUR(104), ISCAS(169), TAB_(35), HVOLT(62)
    PAC(398), LOWT(438), STD_(1009), EL(96), NT_(423), QWAVE(548), INVT(294), LPR(340), VCLVH(875)
    LAO/LAE(426), ILBBB(77), ISCIN(218), SVTAC(27), INJAS(214), INJAL(145), IPMI(33), ISCAN(44)
    INJLA(17), BIGU(82), TRIGU(20), IPLMI(51), 3AVB(16), INJIL(15), 2AVB(14), PRC(S)(10), PSVT(24)
    PMI(17), STE_(28), WPW(79), INJIN(18), SVARR(157)



[2026-01-23 12:21:01,010] INFO: Logger initialized successfully
[2026-01-23 12:21:01,010] INFO: DATASET SUMMARY (RAW RECORDS)
[2026-01-23 12:21:01,010] INFO:   Total records  : 10991
[2026-01-23 12:21:01,022] INFO:   Total patients : 10069
[2026-01-23 12:21:01,023] INFO:   AFIB records   : 1514
[2026-01-23 12:21:01,023] INFO:   NORMAL records : 9477
[2026-01-23 12:21:01,023] INFO: HOLD-OUT TEST MODE ENABLED (test_ratio=0.2)
[2026-01-23 12:21:01,114] INFO: Patients after split: train+val=8055, test=2014
[2026-01-23 12:21:01,115] INFO: FOLD MODE ENABLED (K=5)


62Hz:   0%|          | 0/8806 [00:00<?, ?it/s]

[2026-01-23 12:21:12,476] INFO: [62Hz] SEGMENTS CREATED: 8806
[2026-01-23 12:21:12,476] INFO: [62Hz] QC SUMMARY: 5/8806 records had extreme-value clipping, 0/8806 records had at least 1 flatline lead
[2026-01-23 12:21:12,476] INFO: [62Hz] SEGMENT DISTRIBUTION (BEFORE BALANCING)
[2026-01-23 12:21:12,476] INFO:   AFIB segments   : 1223
[2026-01-23 12:21:12,490] INFO:   NORMAL segments : 7583
[2026-01-23 12:21:12,490] INFO:   TOTAL segments  : 8806
[2026-01-23 12:21:12,493] INFO: [62Hz] GLOBAL SEGMENT BALANCE APPLIED
[2026-01-23 12:21:12,493] INFO:   AFIB segments   : 1223
[2026-01-23 12:21:12,493] INFO:   NORMAL segments : 1223
[2026-01-23 12:21:12,493] INFO:   TOTAL segments  : 2446
[2026-01-23 12:21:12,493] INFO: [62Hz] SEGMENT BALANCING APPLIED (FOLD LEVEL)
[2026-01-23 12:21:12,493] INFO:   Rule: min(AFIB, NORMAL) per fold
[2026-01-23 12:21:12,493] INFO: [62Hz] SEGMENT BALANCING APPLIED
[2026-01-23 12:21:12,493] INFO:   AFIB kept   : 1196
[2026-01-23 12:21:12,493] INFO:   NORMAL kept 

100Hz:   0%|          | 0/8806 [00:00<?, ?it/s]

[2026-01-23 12:21:34,576] INFO: [100Hz] SEGMENTS CREATED: 8806
[2026-01-23 12:21:34,592] INFO: [100Hz] QC SUMMARY: 5/8806 records had extreme-value clipping, 0/8806 records had at least 1 flatline lead
[2026-01-23 12:21:34,594] INFO: [100Hz] SEGMENT DISTRIBUTION (BEFORE BALANCING)
[2026-01-23 12:21:34,594] INFO:   AFIB segments   : 1223
[2026-01-23 12:21:34,596] INFO:   NORMAL segments : 7583
[2026-01-23 12:21:34,596] INFO:   TOTAL segments  : 8806
[2026-01-23 12:21:34,798] INFO: [100Hz] GLOBAL SEGMENT BALANCE APPLIED
[2026-01-23 12:21:34,800] INFO:   AFIB segments   : 1223
[2026-01-23 12:21:34,800] INFO:   NORMAL segments : 1223
[2026-01-23 12:21:34,800] INFO:   TOTAL segments  : 2446
[2026-01-23 12:21:34,804] INFO: [100Hz] SEGMENT BALANCING APPLIED (FOLD LEVEL)
[2026-01-23 12:21:34,804] INFO:   Rule: min(AFIB, NORMAL) per fold
[2026-01-23 12:21:34,806] INFO: [100Hz] SEGMENT BALANCING APPLIED
[2026-01-23 12:21:34,806] INFO:   AFIB kept   : 1196
[2026-01-23 12:21:34,809] INFO:   NORMAL

125Hz:   0%|          | 0/8806 [00:00<?, ?it/s]

[2026-01-23 12:22:31,087] INFO: [125Hz] SEGMENTS CREATED: 8806
[2026-01-23 12:22:31,087] INFO: [125Hz] QC SUMMARY: 5/8806 records had extreme-value clipping, 0/8806 records had at least 1 flatline lead
[2026-01-23 12:22:31,093] INFO: [125Hz] SEGMENT DISTRIBUTION (BEFORE BALANCING)
[2026-01-23 12:22:31,093] INFO:   AFIB segments   : 1223
[2026-01-23 12:22:31,093] INFO:   NORMAL segments : 7583
[2026-01-23 12:22:31,093] INFO:   TOTAL segments  : 8806
[2026-01-23 12:22:31,619] INFO: [125Hz] GLOBAL SEGMENT BALANCE APPLIED
[2026-01-23 12:22:31,620] INFO:   AFIB segments   : 1223
[2026-01-23 12:22:31,621] INFO:   NORMAL segments : 1223
[2026-01-23 12:22:31,621] INFO:   TOTAL segments  : 2446
[2026-01-23 12:22:31,624] INFO: [125Hz] SEGMENT BALANCING APPLIED (FOLD LEVEL)
[2026-01-23 12:22:31,625] INFO:   Rule: min(AFIB, NORMAL) per fold
[2026-01-23 12:22:31,627] INFO: [125Hz] SEGMENT BALANCING APPLIED
[2026-01-23 12:22:31,627] INFO:   AFIB kept   : 1196
[2026-01-23 12:22:31,628] INFO:   NORMAL

250Hz:   0%|          | 0/8806 [00:00<?, ?it/s]

[2026-01-23 12:23:26,136] INFO: [250Hz] SEGMENTS CREATED: 8806
[2026-01-23 12:23:26,136] INFO: [250Hz] QC SUMMARY: 5/8806 records had extreme-value clipping, 0/8806 records had at least 1 flatline lead
[2026-01-23 12:23:26,147] INFO: [250Hz] SEGMENT DISTRIBUTION (BEFORE BALANCING)
[2026-01-23 12:23:26,147] INFO:   AFIB segments   : 1223
[2026-01-23 12:23:26,147] INFO:   NORMAL segments : 7583
[2026-01-23 12:23:26,147] INFO:   TOTAL segments  : 8806
[2026-01-23 12:23:26,768] INFO: [250Hz] GLOBAL SEGMENT BALANCE APPLIED
[2026-01-23 12:23:26,768] INFO:   AFIB segments   : 1223
[2026-01-23 12:23:26,768] INFO:   NORMAL segments : 1223
[2026-01-23 12:23:26,768] INFO:   TOTAL segments  : 2446
[2026-01-23 12:23:26,783] INFO: [250Hz] SEGMENT BALANCING APPLIED (FOLD LEVEL)
[2026-01-23 12:23:26,783] INFO:   Rule: min(AFIB, NORMAL) per fold
[2026-01-23 12:23:26,785] INFO: [250Hz] SEGMENT BALANCING APPLIED
[2026-01-23 12:23:26,785] INFO:   AFIB kept   : 1196
[2026-01-23 12:23:26,785] INFO:   NORMAL

500Hz:   0%|          | 0/8806 [00:00<?, ?it/s]

[2026-01-23 12:23:59,625] INFO: [500Hz] SEGMENTS CREATED: 8806
[2026-01-23 12:23:59,626] INFO: [500Hz] QC SUMMARY: 5/8806 records had extreme-value clipping, 4835/8806 records had at least 1 flatline lead
[2026-01-23 12:23:59,627] INFO: [500Hz] SEGMENT DISTRIBUTION (BEFORE BALANCING)
[2026-01-23 12:23:59,627] INFO:   AFIB segments   : 1223
[2026-01-23 12:23:59,628] INFO:   NORMAL segments : 7583
[2026-01-23 12:23:59,628] INFO:   TOTAL segments  : 8806
[2026-01-23 12:24:00,288] INFO: [500Hz] GLOBAL SEGMENT BALANCE APPLIED
[2026-01-23 12:24:00,288] INFO:   AFIB segments   : 1223
[2026-01-23 12:24:00,288] INFO:   NORMAL segments : 1223
[2026-01-23 12:24:00,288] INFO:   TOTAL segments  : 2446
[2026-01-23 12:24:00,288] INFO: [500Hz] SEGMENT BALANCING APPLIED (FOLD LEVEL)
[2026-01-23 12:24:00,288] INFO:   Rule: min(AFIB, NORMAL) per fold
[2026-01-23 12:24:00,299] INFO: [500Hz] SEGMENT BALANCING APPLIED
[2026-01-23 12:24:00,299] INFO:   AFIB kept   : 1196
[2026-01-23 12:24:00,299] INFO:   NOR