# Simple Spam/Ham Classifier

## Project Goal

To create a simple Machine Learning (ML) model that can automatically recognize whether a text message is Spam (unwanted) or Ham (useful/non-spam). This example illustrates the core steps in an ML project: data loading, preparation, and training a classifier using the scikit-learn library.

## Import the necessary libraries

We will use next essential libraries for this project:

*Pandas*: For easily loading and manipulating data in a table format.

*Scikit-learn*: The primary Machine Learning library, containing all the algorithms for training models.


`sklearn.model_selection.train_test_split`: A function that splits the data into training and testing sets.   

`sklearn.feature_extraction.text.CountVectorizer`: A specialized tool that turns text into numbers so the ML model can understand it.

`sklearn.naive_bayes.MultinomialNB`: A Naive Bayes classifier, which is a simple and effective algorithm for text classification.   

`sklearn.metrics`: A module that provides functions for evaluating the performance of the model.

In [1]:
# Step 1: Import the necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

## Data Loading

For next project we will use the dataset from the [SMS Spam Collection Dataset](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset).

This dataset contains a collection of SMS messages tagged as either 'ham' (legitimate) or 'spam'. It is commonly used for training and evaluating spam classification models.


We will download and save it into ./data folder

In [7]:
data_path = '/content/sample_data/spam.csv'
df = pd.read_csv(data_path, encoding='latin-1')

# Check the first few rows of the dataset
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


## Data Exploration and Preparation

In [8]:
df.shape

(5572, 5)

In [9]:
df.describe()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
count,5572,5572,50,12,6
unique,2,5169,43,10,5
top,ham,"Sorry, I'll call later","bt not his girlfrnd... G o o d n i g h t . . .@""","MK17 92H. 450Ppw 16""","GNT:-)"""
freq,4825,30,3,2,2


In [10]:
# Renaming columns for clarity (v1 is the label, v2 is the message text)
df = df.rename(columns={'v1': 'Label', 'v2': 'Message'})

In [11]:
# Selecting only the needed columns and showing the first 5 rows
df = df[['Label', 'Message']]
df.head()

Unnamed: 0,Label,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [12]:
# Checking the class distribution (how many spam, how many non-spam)
print("\nClass distribution (Label):")
print(df['Label'].value_counts())


Class distribution (Label):
Label
ham     4825
spam     747
Name: count, dtype: int64


Note: 'ham' (useful) messages are usually more numerous than 'spam' messages

## Data Preprocessing (Feature Engineering)

ML models only understand numbers. We must convert each message (text) into a numerical vector. This is called vectorization. CountVectorizer does this by counting how many times each unique word appears in each message across the entire dataset.

In [15]:
# Define X (Messages) and y (Classification)
X = df['Message']
y = df['Label']

# Create the vectorizer object
vectorizer = CountVectorizer()

# Apply the vectorizer: This teaches the model the vocabulary and converts the text into numbers
X_vectorized = vectorizer.fit_transform(X)

print("\nDimensionality of vectorized data (Number of messages, Number of unique words):")
print(X_vectorized.shape)


Dimensionality of vectorized data (Number of messages, Number of unique words):
(5572, 8672)


## Data Splitting and Model Training

To ensure the model is genuinely capable, we must test it with data it has never seen. Therefore, we split the data into two parts: a Training Set (for learning) and a Test Set (for evaluation).

We will use the Multinomial Naive Bayes (MNB) algorithm, which is fast and highly effective for text classification.

In [17]:
# Split data (80% for training, 20% for testing)
X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size=0.2, random_state=42)

# Create the ML model (Multinomial Naive Bayes)
model = MultinomialNB()

# Training the model (This is the most critical step!)
# We tell the model: "Here are the messages (X_train) and here are their correct classifications (y_train). Learn from them!"
model.fit(X_train, y_train)

print("\nModel successfully trained with Multinomial Naive Bayes!")


Model successfully trained with Multinomial Naive Bayes!


## Model Evaluation

Once the model has "learned," we need to assess its accuracy. We use the test set to predict the classification of unseen messages and compare the results with the true classifications (y_test).
We will use the accuracy_score metric, which shows us the percentage of correct predictions the model made

In [18]:
# Making predictions on the test set
y_pred = model.predict(X_test)

# Evaluating Accuracy
accuracy = accuracy_score(y_test, y_pred)

print(f"\nModel Accuracy on Test Data: {accuracy:.4f} (or {accuracy*100:.2f}%)")


Model Accuracy on Test Data: 0.9785 (or 97.85%)


## Testing the Model with New Messages

Now we can feed our own, new messages into the model to see how it classifies them.

IMPORTANT: We must use the same vectorizer that was used to train the model, ensuring the new text is converted into numbers in exactly the same way (using .transform(), not .fit_transform()).

In [19]:
# Define new messages to test
new_messages =["Hey! Are we still on for the meeting tomorrow?",
               "Congratulations! You've won a $1,000 Walmart gift card. Click here to claim your prize.",
               "Don't forget to submit your assignment by tomorrow.",
               "Limited time offer! Get 50% off on all products. Visit our website now!",
               "Can you send me the report you mentioned last week?"]

# Convert the new messages into numerical format using ONLY transform()
# The model already knows the vocabulary, so we don't 'fit' again.
new_messages_vectorized = vectorizer.transform(new_messages)

# Making the final prediction
predictions = model.predict(new_messages_vectorized)

print("\nNew Message Classification Results:")
for message, prediction in zip(new_messages, predictions):
    print(f" -> Message: '{message}' \n    -> Classification: {prediction.upper()}")


New Message Classification Results:
 -> Message: 'Hey! Are we still on for the meeting tomorrow?' 
    -> Classification: HAM
 -> Message: 'Congratulations! You've won a $1,000 Walmart gift card. Click here to claim your prize.' 
    -> Classification: SPAM
 -> Message: 'Don't forget to submit your assignment by tomorrow.' 
    -> Classification: HAM
 -> Message: 'Limited time offer! Get 50% off on all products. Visit our website now!' 
    -> Classification: SPAM
 -> Message: 'Can you send me the report you mentioned last week?' 
    -> Classification: HAM
