### TODO: Data Preprocessing

#### Primary tasks (must-have)
- [ ] Load & Clean the NF-UNSW-NB15-v3 dataset (or skip if confirmed pre-cleaned) and log basic stats via `df.shape` / `df.info()`.
- [ ] Load the synthetic data for checking 
- [ ] Define target label column (e.g., `Label` / `attack_cat`) and check class imbalance via `value_counts(normalize=True)`.
- [ ] Double-check for nulls, duplicates, and obvious inconsistencies; visualize anomalies (e.g., `df.isna().sum().plot(kind="bar")`).
- [ ] Validate column types: convert numeric columns, standardize categorical dtypes, fix malformed values (ensure pandas/models interpret data correctly).
- [ ] Split data using **Stratified Sampling** to maintain the same percentage of rare attack classes in train/validation/test sets.
- [ ] Check feature correlations; remove or merge highly correlated columns to reduce redundancy and potential leakage.
- [ ] Save train/validation/test splits as CSV files in a structured folder (separate for real and synthetic datasets) — e.g., in Google Drive for reproducibility.
- [ ] Data split percentages:
  - **Real Data:** Training 80%, Validation 10%, Test 10%
  - **Synthetic Data:** Training 10–20%, Validation 5%, Test 0%

#### Secondary tasks (second essentials)
- [ ] Produce a quick EDA snapshot (pairplots, histograms) to highlight feature distributions and potential distribution shifts.
- [ ] Ensure deterministic splits: set a shared `RANDOM_STATE` (e.g., 42) in all splitters (`train_test_split`, `StratifiedKFold`) to guarantee reproducibility across runs.

In [None]:
# Example code snippet for deterministic splits and saving CSVs
from sklearn.model_selection import train_test_split
import pandas as pd

RANDOM_STATE = 42

# Split real dataset
train_real, temp_real = train_test_split(real_df, test_size=0.2, stratify=real_df['Label'], random_state=RANDOM_STATE)
val_real, test_real = train_test_split(temp_real, test_size=0.5, stratify=temp_real['Label'], random_state=RANDOM_STATE)

# Split synthetic dataset (adjust percentages as needed)
train_synth, temp_synth = train_test_split(synth_df, test_size=0.2, stratify=synth_df['Label'], random_state=RANDOM_STATE)
val_synth, test_synth = train_test_split(temp_synth, test_size=0.25, stratify=temp_synth['Label'], random_state=RANDOM_STATE)

# Save CSVs
train_real.to_csv('splits/real/train_real.csv', index=False)
val_real.to_csv('splits/real/val_real.csv', index=False)
test_real.to_csv('splits/real/test_real.csv', index=False)

train_synth.to_csv('splits/synthetic/train_synth.csv', index=False)
val_synth.to_csv('splits/synthetic/val_synth.csv', index=False)
test_synth.to_csv('splits/synthetic/test_synth.csv', index=False)
