# Notebook 1 — Pre-processing Pipeline

This notebook prepares the IoT-PoT dataset for PSO-tuned LightGBM training. It ingests raw CSV chunks, cleans/encodes features, applies SMOTETomek balancing, and saves a curated parquet file referenced by `configs/experiment_default.yaml`. Update the configuration or feature lists there if your dataset variant differs.

## Steps

1. Load the YAML configuration and ensure the Python path points to `src/`.
2. Inspect the raw IoT-PoT files (or any CSVs that match the glob in the config).
3. Toggle the preprocessing cell to generate the balanced feature store.
4. Verify the saved parquet file, feature list, and class distribution.

In [None]:
from pathlib import Path
import sys

import pandas as pd

PROJECT_ROOT = Path.cwd()
SRC_DIR = PROJECT_ROOT / "src"
if SRC_DIR.exists() and str(SRC_DIR) not in sys.path:
    sys.path.append(str(SRC_DIR))

print(f"Project root: {PROJECT_ROOT}")

In [None]:
from src.config import DEFAULT_CONFIG_PATH, load_config

config = load_config(DEFAULT_CONFIG_PATH)
config

In [None]:
from glob import glob

raw_files = sorted(glob(config.data.raw_glob))
print(f"Found {len(raw_files)} raw file(s)")
raw_files[:5]

### Run preprocessing (set the flag when you are ready)

The cell below calls `src.data.preprocessing.preprocess`, which will:

- Load/concatenate all matching raw files
- Clean headers and engineer features
- Encode categorical columns and scale numerics
- Apply SMOTETomek to tackle class imbalance
- Persist the processed parquet + metadata JSON

Large runs can take several minutes; feel free to adjust `sample_frac` in the YAML for faster prototyping.

In [None]:
from src.data.preprocessing import preprocess

RUN_PREPROCESS = False  # flip to True to execute the full job

if RUN_PREPROCESS:
    processed_path = preprocess(config)
    print(f"Saved processed dataset to {processed_path}")
else:
    print("Skipping preprocessing. Set RUN_PREPROCESS=True when ready.")

In [None]:
processed_file = Path(config.data.processed_file)
if processed_file.exists():
    processed_df = pd.read_parquet(processed_file)
    display(processed_df.head())
    print("Class distribution:\n", processed_df[config.data.label_column].value_counts(normalize=True).head())
else:
    print(f"Processed file not found at {processed_file}. Run the preprocessing cell first.")

✅ **Next**: Open `2. Training.ipynb` to run PSO-tuned LightGBM on the balanced dataset. Keep track of metadata written alongside the parquet file for reproducibility.