Data Cleaning
---

This script cleans and normalizes the raw SMS dataset to produce a
ready-to-use CSV for later steps (splitting, training, evaluation).

Pipeline steps:
1. Load Raw Data
   - Reads the original SMS Spam Collection dataset from ../DATA/SMSSpamCollection.csv.

2. Standardize Labels
   - Converts values in the "Label" column to lowercase (e.g., "Ham" -> "ham").

3. Normalize Text
   - Strips leading/trailing whitespace from each message.
   - Collapses multiple spaces into a single space.
   - Replaces curly quotes/apostrophes (“ ” ‘ ’) with straight ASCII quotes (' ").

4. Remove Duplicates
   - Identifies and removes duplicate messages based on the "SMS_Message" column.
   - Keeps the first occurrence of each unique message.
   - Ensures that the cleaned dataset contains only one row per unique message text.

5. Save Clean Data
   - Writes the cleaned dataset to ../DATA/clean/sms_clean.csv.
   - This file will be used by the split script (02) to create train/val/test sets.

Outputs:
   - ../DATA/clean/sms_clean.csv (cleaned dataset with standardized labels, normalized text, and duplicates removed)

Also prints:
   - Total number of rows (after removing duplicates)
   - Null counts per column
   - Class distribution (ham vs spam)

In [1]:
from pathlib import Path
import pandas as pd

df = pd.read_csv("../DATA/SMSSpamCollection.csv")

# Paths
clean_path = Path("../DATA/clean/sms_clean.csv")
clean_path.parent.mkdir(parents=True, exist_ok=True)

In [2]:
def normalize_sms_df(df_in):
    df = df_in.copy()

    # Standardize labels
    df["Label"] = df["Label"].astype(str).str.strip().str.lower()

    # Normalize text
    df["SMS_Message"] = (
        df["SMS_Message"]
        .astype(str)
        .str.strip()
        .str.replace(r"\s+", " ", regex=True)  # collapse whitespace
        .str.replace("’", "'", regex=False)    # curly apostrophe -> straight
        .str.replace("‘", "'", regex=False)    # curly opening quote -> straight
        .str.replace("“", '"', regex=False)    # curly double -> straight
        .str.replace("”", '"', regex=False)    # curly double -> straight
    )


    # Drop duplicate messages, keep the first occurrence
    df = df.drop_duplicates(subset=["SMS_Message"], keep="first").reset_index(drop=True)

    return df

In [3]:
# Normalize and clean
df_clean = normalize_sms_df(df)

# Printing to see results
print("Rows after removing duplicates:", len(df_clean))
print("Null counts:", df_clean.isna().sum().to_dict())
print(df_clean["Label"].value_counts())

duplicates = df_clean[df_clean.duplicated(subset=["SMS_Message"], keep=False)]
print(f"Total duplicate rows in cleaned dataset: {len(duplicates)}")

# Save cleaned dataset
df_clean.to_csv(clean_path, index=False)
print(f"Wrote cleaned CSV to: {clean_path}")

Rows after removing duplicates: 5158
Null counts: {'Label': 0, 'SMS_Message': 0}
Label
ham     4516
spam     642
Name: count, dtype: int64
Total duplicate rows in cleaned dataset: 0
Wrote cleaned CSV to: ../DATA/clean/sms_clean.csv


Mention in markdown: why stratification matters for imbalance.