# Email Spam Classification

## Project Overview
This project, authored by **Roy Njuguna**, focuses on classifying email messages as either "ham" (non-spam) or "spam." The dataset consists of text messages collected from various sources, including personal and promotional emails. The main goal is to build a machine learning model that can accurately distinguish between spam and legitimate emails, providing users with an effective tool to filter unwanted communications.

## Dataset Information
The dataset contains two key columns:

- **Category**: This indicates whether the email is "ham" or "spam."
- **Message**: The actual text content of the email message.

### Sample Data
| Category | Message                                                      |
|----------|--------------------------------------------------------------|
| ham      | Go until jurong point, crazy.. Available only ...          |
| ham      | Ok lar... Joking wif u oni...                               |
| spam     | Free entry in 2 a wkly comp to win FA Cup fina...         |
| ham      | U dun say so early hor... U c already then say...          |
| ham      | Nah I don't think he goes to usf, he lives aro...          |

## Objective
The primary objective of this project is to develop a machine learning model that can classify email messages as either spam or ham. This classification will help users manage their inboxes more effectively by filtering out unwanted spam.

## Data Source
The dataset was collected from various sources and is commonly used in spam classification tasks. It serves as a benchmark for testing different machine learning algorithms in natural language processing (NLP).

---

This project aims to provide a practical solution for spam detection in email communications. To interact with the model and test its functionality, please visit the following link:

[Try My Email Spam Classification Web App](https://huggingface.co/spaces/roy123njuguna/Spam-Mail-Prediction)


Import the dependencies

In [None]:
!pip install gradio

In [24]:
import pandas as pd  # Importing pandas for data manipulation and analysis
import numpy as np  # Importing NumPy for numerical operations and handling arrays
from sklearn.model_selection import train_test_split  # Importing function to split datasets into training and testing sets
from sklearn.feature_extraction.text import TfidfVectorizer  # Importing TfidfVectorizer for converting text to TF-IDF feature vectors
from sklearn.linear_model import LogisticRegression  # Importing Logistic Regression model for classification tasks
from sklearn.metrics import accuracy_score  # Importing accuracy_score to evaluate the performance of the model

# Import additional libraries
import joblib  # Used for saving and loading the trained machine learning models
import gradio as gr  # Gradio is used to create an easy-to-use user interface for model predictions

Data Collection and preprocessing

In [2]:
# Load the email data from a CSV file into a pandas DataFrame
raw_mail_data = pd.read_csv('/content/mail_data.csv')

In [3]:
# Display the first few rows of the DataFrame to understand the structure and contents of the email data
raw_mail_data.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
# Replace NaN values in the DataFrame with empty strings for cleaner data handling
mail_data = raw_mail_data.where((pd.notnull(raw_mail_data)), '')

In [6]:
# Output the dimensions of the mail_data DataFrame (rows, columns)
mail_data.shape

(5572, 2)

Label encoding

In [8]:
# Label spam messages as 0 and ham messages as 1 for binary classification
# Use .loc to access the 'Category' column and update values based on conditions

# Set the 'Category' to 0 where the message is labeled as 'spam'
mail_data.loc[mail_data['Category'] == 'spam', 'Category'] = 0

# Set the 'Category' to 1 where the message is labeled as 'ham'
mail_data.loc[mail_data['Category'] == 'ham', 'Category'] = 1

spam is represented by 0 and ham is 1

Seperating the data as text and labels

In [12]:
# Extract the 'Message' column from the mail_data DataFrame and assign it to variable X
X = mail_data['Message']

# Extract the 'Category' column from the mail_data DataFrame and assign it to variable Y
Y = mail_data['Category']

Split the data to tain and test data

In [13]:
# Split the dataset into training and testing sets
# 80% of the data will be used for training the model, and 20% for testing its performance
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2)

Feature extraction
- Convert text data to numerical values

In [14]:
# Transform text data into feature vectors suitable for input to the logistic regression model
# The TfidfVectorizer converts the text into TF-IDF features, which reflect the importance of words in the dataset
# - min_df=1: Include all words that appear in at least one document
# - stop_words='english': Remove common English stop words (like 'the', 'is', etc.) from the analysis
# - lowercase=True: Convert all text to lowercase to ensure uniformity

feature_extraction = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)

# Fit the vectorizer on the training data and transform it into feature vectors
X_train_features = feature_extraction.fit_transform(X_train)

# Transform the test data into feature vectors using the same vectorizer fitted on the training data
X_test_features = feature_extraction.transform(X_test)

# Convert the labels in Y_train and Y_test from their original format to integers (0 and 1)
# This is necessary for the logistic regression model to process the target values correctly
Y_train = Y_train.astype('int')
Y_test = Y_test.astype('int')



Model Training using logstic regression

In [15]:
# Create a Logistic Regression model
model = LogisticRegression()

In [27]:
# Train the model with the training data
model.fit(X_train_features, Y_train)

In [29]:
joblib.dump(model, "spam_model.pkl")  # Save the trained model
joblib.dump(feature_extraction, "vectorizer.pkl")  # Save the TfidfVectorizer

['vectorizer.pkl']

Model evaluation

In [21]:
# Make predictions on the training data and calculate accuracy
X_train_prediction = model.predict(X_train_features)
training_accuracy = accuracy_score(Y_train, X_train_prediction)

print('Accuracy on training data is: ', training_accuracy)

Accuracy on training data is:  0.9685887368184878


In [20]:
# Predict on test data and compute accuracy
X_test_prediction = model.predict(X_test_features)
test_accuracy = accuracy_score(Y_test, X_test_prediction)

print('Accuracy on test data is: ', test_accuracy)

Accuracy on test data is:  0.9533632286995516


## Making a Predictive System for Spam Detection

In this section, we will outline the process of creating a predictive system for spam detection using the trained machine learning model.

### Overview of the Predictive System
The predictive system allows users to input text messages and receive a prediction regarding the likelihood of the message being spam. This system can assist users in identifying unwanted communications effectively.

### Components of the Predictive System
1. **User Input**: The system will require input data in the form of text messages that the user wants to evaluate for spam content.

2. **Data Preprocessing**: Before making predictions, the input text must be transformed into a suitable format for the model. This involves:
   - Converting the input text into feature vectors using techniques like TF-IDF.
   - Ensuring the input text is processed in the same way as the training data.

3. **Prediction**: Using the trained machine learning model, the system will predict whether the input message is spam or not based on the transformed feature vectors. The model outputs either a spam (0) or not spam (1) classification.

4. **Output**: The system will present the prediction result to the user in a clear and concise manner, indicating whether the message is likely to be spam or not.

### Implementation
To implement the predictive system, we will utilize libraries such as Gradio for building a user-friendly interface that allows easy interaction with the model. This will enable users to enter their messages and receive immediate feedback on whether they are spam.

### Conclusion
The predictive system serves as a valuable tool for enhancing communication efficiency, helping users avoid spam and unwanted messages effectively.


In [22]:
# Input email for prediction
input_mail = ["As a valued customer, I am pleased to advise you that following recent review of your Mob No. you are awarded with a £1500 Bonus Prize, call 09066364589"]

# Convert the input text to feature vectors
input_data_features = feature_extraction.transform(input_mail)

# Make the prediction
prediction = model.predict(input_data_features)

print(prediction)

# Determine if the message is spam or not
if prediction[0] == 1:
    print('Not spam')
else:
    print('Spam')


[0]
Spam


## Test the Web App

To experience the functionality of the spam detection model, you can test the web application created for this project. The web app allows users to input text messages and receive predictions regarding whether the message is spam or not.

### Access the Web App
Follow the link below to access and test the web app for the spam detection model:

[Spam Detection Web App](https://huggingface.co/spaces/roy123njuguna/Spam-Mail-Prediction)

### Instructions for Use
1. Click on the link above to open the web app.
2. Enter the message you want to check for spam in the provided input field.
3. Submit the message to receive a prediction on whether it is likely to be spam.

### Feedback
I welcome any feedback regarding your experience with the web app, which will help us improve its functionality and user interface.
