## AML Assignment 1
#### Name: Shruti Sharma
#### Roll: MDS202435


### Dataset Overview
We use the SMS Spam Collection Dataset from the UCI Machine Learning Repository.

The dataset consists of SMS messages labeled as either spam or ham (non-spam).

**Number of instances:** 5,574 SMS messages

**Classes:**
+ ham: legitimate messages
+ spam: unsolicited or promotional messages

**Task:** Binary text classification

In [17]:
# Importing required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import re

In [18]:
# Loading tab seperated dataset
df = pd.read_csv("SMSSpamCollection", sep="\t", header=None, names=["label", "text"])

In [19]:
df.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [20]:
df.shape

(5572, 2)

In [21]:
df.isnull().sum()

label    0
text     0
dtype: int64

In [22]:
# Preprocessing dataset

# Label Encoding
df['label'] = df['label'].map({'ham': 0, 'spam': 1})

# Basic text cleaning
df['text'] = df['text'].str.lower()
df['text'] = df['text'].apply(lambda x: re.sub(r'[^a-z0-9\s]', '', x))
df['text'] = df['text'].apply(lambda x: re.sub(r'\s+', ' ', x).strip())

The preprocessing pipeline converts class labels into numeric form for model compatibility, normalizes text by converting it to lowercase, removes punctuation and special characters to reduce noise, and standardizes whitespace. These steps simplify the raw SMS messages while preserving their semantic content, enabling more effective feature extraction and model training.

In [23]:
df.head()

Unnamed: 0,label,text
0,0,go until jurong point crazy available only in ...
1,0,ok lar joking wif u oni
2,1,free entry in 2 a wkly comp to win fa cup fina...
3,0,u dun say so early hor u c already then say
4,0,nah i dont think he goes to usf he lives aroun...


In [26]:
# Splitting into train/val/test sets
val_size=0.15
train_size=0.7

train_df, temp_df = train_test_split(df,train_size=train_size, stratify=df["label"], random_state=42)

val_ratio = val_size / (1 - train_size)

val_df, test_df = train_test_split(temp_df, train_size=val_ratio,stratify=temp_df["label"],random_state=42)


The dataset is split into training, validation, and test sets using a two-stage stratified sampling approach. First, 70% of the data is allocated to the training set, while the remaining 30% is temporarily held out. This temporary set is then equally divided into validation and test sets, ensuring that each split retains the original class distribution. A fixed random seed is used to guarantee reproducibility.

In [27]:
output_dir = "."

In [28]:
train_df.to_csv(f"{output_dir}/train.csv", index=False)
val_df.to_csv(f"{output_dir}/validation.csv", index=False)
test_df.to_csv(f"{output_dir}/test.csv", index=False)
