#  ECG Preprocessing Command Guide

This notebook includes a command‑line preprocessing pipeline for loading raw ECG data, balancing classes, segmenting signals,  
 and generating train/val/test splits or K‑fold datasets.

To see all available CLI arguments:

```bash
python ecg_data_loader.py --help
```

### Recommended Usage
It is recommended to run preprocessing from a terminal, because:  
It handles large datasets more reliably  
It avoids notebook kernel memory limits  
It keeps logs properly  
It is safer for long‑running jobs  
It can be run inside a notebook, but a huge datasets can cause the kernel to freeze or restart.  

#### The full command line shape:
```bash
python ecg_data_loader.py \
  --dataset_path "/path/to/raw_dataset" \
  --name ecg-arrhythmia-1 \
  --fs 500 250 100 \
  --out_root "any-path" \
  --max_samples 1000 \
  --split_ratio 0.7 0.2 0.1 \
  --folds 7 \
  --balance_mode global
```

### Command‑line Arguments  
<span style="color:orange;">--data_path</span> (Required)  
Path to the raw dataset folder you want to load.  
This must point to the directory containing the original ECG files (MIT‑BIH, PTB‑XL, ECG‑Arrhythmia, etc.).  

<span style="color:orange;">--name</span> (Required)  
Name of the run.  
Used for:  
-naming the log file  
-naming the output folder under prepared_data/  
-identifying which dataset loader to use (the name must contain part of the dataset folder name)  
Example:  
--name ecg-arrhythmia-1 → loader detects ECG‑Arrhythmia dataset.  

<span style="color:orange;">--fs</span> (Optional)  
Target sampling rates for preprocessing.    
The pipeline will resample each ECG record to each of these frequencies.  
Defaults: 500, 250, 100 Hz  
You can specify any number of target sampling rates.  
Example:  
--fs 1000 500  

<span style="color:orange;">--out_root</span> (Optional)  
Root directory where processed data will be saved.  
Default: prepared_data  
Inside this folder, the script creates:  
prepared_data/<name>/<fs>hz/  

<span style="color:orange;">--max_samples</span> (Optional)  
Caps the number of samples processed.       
In train/val/test mode → caps number of segments    
In fold mode → caps number of segments per fold          


<span style="color:orange;">--split_ratio</span> (Optional)   
rain/validation/test split ratios.  
Default:  
0.7 0.2 0.1 → 70% train, 20% val, 10% test  
Can select any  other ratio her.  
Ignored when --folds is used.    

<span style="color:orange;">--folds</span> (Optional)  
Enables patient‑safe K‑fold cross‑validation.    
Example:  
--folds 7 → 7‑fold CV  
When this is set:  
--split_ratio is ignored  
each fold contains unique patients  
no patient appears in more than one fold  
balancing is done per fold  

<span style="color:orange;">--balance_mode</span> (Optional)  
Controls how class balancing is applied.  
Options:  
global (default)    
Balances AFIB/NORMAL before splitting  
train   
Balances only inside the training set  
none   
No balancing at all  



#### An example to run in the terminal:  
```bash
python ecg_data_loader --dataset_path "C:\path" --name ptbxl --fs 125 --max_samples 5000 -- split_ratio 0.8 0.1 0.1 
or
python ecg_data_loader --dataset_path "C:\path" --name ptbxl --fs 125 --max_samples 5000 -- --folds 5 
```
or simply run with all defaults: fs(500, 250 and 100) out_root(the current folder) max_samples(take all AFIB+NORMAL(Take same as AFIB number))  
split_ratio(train 70%, val 20%, test 10%) folds(None)
pytnon ecg_loader --dataset_path "C:/path" --name bih/ptbxl/arrhythmia:
```bash
pytnon ecg_loader --dataset_path "C:/path" --name bih/ptbxl/arrhythmia
```



#### Running with uv (if the environment uses it)
```bash
uv run python ecg_data_loader.py --dataset_path "C:/path" --name bih
```

#### Running Inside a Notebook
```bash
!python ecg_data_loader.py --dataset_path "C:/path" --name bih
```  
or  
```bash
!python ecg_data_loader.py \
--dataset_path "C:/path" \
--name bih
```

## Output Structure
After running the command, the following structure is created:
```bash
Code
prepared_data/
    <name>/
        <fs>hz/
            sample_<fs>hz.csv
            train.pt
            val.pt
            test.pt
            data.pt   (if using folds)
logs/
    <name>.log
```
## Notes
Use (/) on Windows and ^ on mac and bash. 
--name controls both the output folder and the log filename.  
--folds overrides --split_ratio.  
--max_samples behaves differently in fold vs non‑fold mode:  
    No folds: caps number of records  
    Folds: caps number of segments  



A Loaded Example with output

In [1]:
!python ecg_data_loader.py \
    --dataset_path "C:\Users\MY\Downloads\Dataset\mit-bih-combined" \
    --name mit-bih-AF-SR-folds \
    --max_samples 5000 \
    --folds 7


FULL DATASET OVERVIEW
  Total records : 41
  Sampling rates: [128, 250]
  Leads         : [2]
  Unique labels : 5
  Labels found  :
    (N(21), (AFIB(23), (AFL(8), (J(3), UNKNOWN(18)



[2025-12-23 15:28:28,196] INFO: DATASET SUMMARY (RAW RECORDS)
[2025-12-23 15:28:28,196] INFO:   Total records  : 41
[2025-12-23 15:28:28,196] INFO:   Total patients : 41
[2025-12-23 15:28:28,196] INFO:   AFIB records   : 23
[2025-12-23 15:28:28,196] INFO:   NORMAL records : 18
[2025-12-23 15:28:28,196] INFO: BALANCING RULE APPLIED (RECORD LEVEL)
[2025-12-23 15:28:28,202] INFO:   AFIB kept   : 18
[2025-12-23 15:28:28,202] INFO:   NORMAL kept : 18
[2025-12-23 15:28:28,202] INFO:   Total kept  : 36
[2025-12-23 15:28:28,202] INFO: FOLD MODE ENABLED (K=7)

500Hz:   0%|          | 0/36 [00:00<?, ?it/s]
500Hz:   3%|▎         | 1/36 [00:01<00:48,  1.40s/it]
500Hz:   6%|▌         | 2/36 [00:02<00:47,  1.38s/it]
500Hz:   8%|▊         | 3/36 [00:04<00:45,  1.38s/it]
500Hz:  11%|█         | 4/36 [00:05<00:44,  1.38s/it]
500Hz:  14%|█▍        | 5/36 [00:09<01:13,  2.38s/it]
500Hz:  17%|█▋        | 6/36 [00:11<01:01,  2.06s/it]
500Hz:  19%|█▉        | 7/36 [00:12<00:53,  1.84s/it]
500Hz:  22%|██▏   