# SMS Spam Classification: Data Preparation

This notebook covers the data preparation steps for building an SMS spam classifier. We will load, preprocess, split, and save the dataset for further modeling.

## 1. Import Required Libraries
We start by importing all necessary libraries for data loading, preprocessing, and splitting.

In [12]:
# Import required libraries
import pandas as pd
import numpy as np
import os
import re
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

## 2. Load SMS Spam Dataset
We will download and load the SMS Spam Collection dataset from the UCI repository.

In [13]:
# Load the SMS Spam Collection dataset from the data folder
file_path = 'data/SMSSpamCollection'
df = pd.read_csv(file_path, sep='\t', header=None, names=['label', 'text'])
df.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## 3. Preprocess the Data
We will clean the text, encode the labels, and prepare the data for splitting.

In [14]:
# Text cleaning and label encoding
def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z0-9\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

df['text'] = df['text'].apply(clean_text)
le = LabelEncoder()
df['label_num'] = le.fit_transform(df['label'])
df.head()

Unnamed: 0,label,text,label_num
0,ham,go until jurong point crazy available only in ...,0
1,ham,ok lar joking wif u oni,0
2,spam,free entry in 2 a wkly comp to win fa cup fina...,1
3,ham,u dun say so early hor u c already then say,0
4,ham,nah i dont think he goes to usf he lives aroun...,0


In [15]:
# Remove rows with missing text values before splitting
print(f"Rows before dropping NaN in text: {df.shape[0]}")
df = df.dropna(subset=['text'])
print(f"Rows after dropping NaN in text: {df.shape[0]}")

Rows before dropping NaN in text: 5572
Rows after dropping NaN in text: 5572


## 4. Split Data into Train, Validation, and Test Sets
We will split the data using stratified sampling to maintain class balance.

In [16]:
# Split into train, validation, and test sets (80/10/10 split)
train_df, temp_df = train_test_split(df, test_size=0.2, stratify=df['label_num'], random_state=42)
val_df, test_df = train_test_split(temp_df, test_size=0.5, stratify=temp_df['label_num'], random_state=42)

print(f"Train shape: {train_df.shape}")
print(f"Validation shape: {val_df.shape}")
print(f"Test shape: {test_df.shape}")

Train shape: (4457, 3)
Validation shape: (557, 3)
Test shape: (558, 3)


## 5. Save Data Splits to CSV Files
We will save the train, validation, and test splits as CSV files in the 'data' folder.

In [17]:
# Drop NaN values in 'text' column from all splits before saving
train_df = train_df.dropna(subset=['text'])
val_df = val_df.dropna(subset=['text'])
test_df = test_df.dropna(subset=['text'])
os.makedirs('data', exist_ok=True)
train_df.to_csv('data/train.csv', index=False)
val_df.to_csv('data/validation.csv', index=False)
test_df.to_csv('data/test.csv', index=False)
print("Data splits saved to 'data' folder.")

Data splits saved to 'data' folder.


---

**Data preparation complete!**

You can now proceed to model training and evaluation in the next notebook.

## Next Steps
Proceed to the train.ipynb notebook to train and evaluate your models.

---

**End of Data Preparation Notebook**