Data Splitting
--------------
This script splits the cleaned SMS dataset into Train, Validation, and Test
subsets for use in model training and evaluation.

Steps:
1. Load Clean Data
   - Reads the cleaned dataset from ../DATA/clean/sms_clean.csv.

2. Stratified Split
   - Splits the dataset into Train (70%), Validation (15%), and Test (15%).
   - Uses stratification on the "Label" column to preserve the ham/spam ratio
     across all splits.
   - Stratification is needed because the dataset is highly imbalanced
     (~86% ham vs ~14% spam). Without stratification, some splits might
     contain very few spam examples, making evaluation unreliable and
     preventing the model from learning patterns of the minority class.

3. Save Splits
   - Writes the split datasets to ../DATA/splits/train.csv,
     ../DATA/splits/val.csv, and ../DATA/splits/test.csv.

Outputs:
   - ../DATA/splits/train.csv (70% of data, stratified)
   - ../DATA/splits/val.csv   (15% of data, stratified)
   - ../DATA/splits/test.csv  (15% of data, stratified)

This step ensures that all subsequent training and evaluation stages
work with splits that are representative of the original class imbalance,
while still maintaining fair coverage of the minority spam class.

In [1]:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-

from pathlib import Path
import pandas as pd
from sklearn.model_selection import train_test_split


def main():
    clean_path = Path("../DATA/clean/sms_clean.csv")
    splits_dir = Path("../DATA/splits")
    splits_dir.mkdir(parents=True, exist_ok=True)

    # Load cleaned data
    df = pd.read_csv(clean_path)

    # Target for stratification: spam=1, ham=0
    y = (df["Label"].str.lower() == "spam").astype(int)

    # 70% train, 30% temp (stratified)
    train_df, temp_df = train_test_split(
        df, test_size=0.30, stratify=y, random_state=42
    )

    # Split temp into 15% val, 15% test (stratified)
    y_temp = (temp_df["Label"].str.lower() == "spam").astype(int)
    val_df, test_df = train_test_split(
        temp_df, test_size=0.50, stratify=y_temp, random_state=42
    )

    # Save splits
    train_df.to_csv(splits_dir / "train.csv", index=False)
    val_df.to_csv(splits_dir / "val.csv", index=False)
    test_df.to_csv(splits_dir / "test.csv", index=False)

    # Quick sanity: sizes + spam ratios
    ratio = lambda d: float((d["Label"].str.lower() == "spam").mean())
    print("Rows (total):", len(df))
    print("train:", len(train_df), "| spam_ratio=", f"{ratio(train_df):.4f}")
    print("val  :", len(val_df),   "| spam_ratio=", f"{ratio(val_df):.4f}")
    print("test :", len(test_df),  "| spam_ratio=", f"{ratio(test_df):.4f}")
    print(f"Wrote splits to: {splits_dir}")


if __name__ == "__main__":
    main()


Rows (total): 5158
train: 3610 | spam_ratio= 0.1244
val  : 774 | spam_ratio= 0.1253
test : 774 | spam_ratio= 0.1240
Wrote splits to: ../DATA/splits
