# Predicting Spam Mails Project

**Author:** [Syed Muhammad Ebad](https://www.kaggle.com/syedmuhammadebad)  
**Date:** 17-Oct-2024  
[Send me an email](mailto:mohammadebad1@hotmail.com)  
[Visit my GitHub profile](https://github.com/smebad)

**Dataset:** [Spam Mails Dataset](https://www.kaggle.com/datasets/venky73/spam-mails-dataset)

## Introduction
Spam emails pose a significant challenge to maintaining secure and efficient communication. In this project, we aim to build a classification model that predicts whether an email is spam or not using the Spam Mails Dataset. We will utilize text vectorization techniques, particularly TF-IDF, and train a logistic regression model for prediction.

---


## 1. Importing Necessary Libraries

In [72]:
# importing libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings("ignore")

## 2. Loading and Reviewing the Dataset

In [73]:
# loading the dataset
df = pd.read_csv("spam_ham_dataset.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0


### Observations:
* The dataset contains three main columns: text, label_num (0 for ham and 1 for spam), and label (the text label for ham or spam).

## 3. Dataset Information

In [74]:
# data info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5171 entries, 0 to 5170
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  5171 non-null   int64 
 1   label       5171 non-null   object
 2   text        5171 non-null   object
 3   label_num   5171 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 161.7+ KB


In [75]:
# checking for duplicate values
df.duplicated().sum()

0

In [76]:
# checking for class imbalance (0 for ham and 1 for spam)
df['label'].value_counts()

label
ham     3672
spam    1499
Name: count, dtype: int64

### Observations:
* The dataset shows a class imbalance, where one class (ham or spam) might dominate the other, potentially affecting the model's performance.

## 4. Splitting the Dataset

In [77]:
X = df['text']
y = df['label_num']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

## 5. Feature Extraction using TF-IDF

In [78]:
# feature extraction
tfidf = TfidfVectorizer(lowercase = True, stop_words = 'english', max_df = 0.7)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

## 6. Training the Logistic Regression Model

In [79]:
# training the model
lr = LogisticRegression()
lr.fit(X_train_tfidf, y_train)

## 7. Model Testing and Accuracy

In [80]:
# testing the model
y_pred = lr.predict(X_test_tfidf)
print(f"Accuracy score : {accuracy_score(y_test, y_pred)}")
y_pred = lr.predict(X_train_tfidf)
print(f"Accuracy score : {accuracy_score(y_train, y_pred)}")

Accuracy score : 0.9903381642512077
Accuracy score : 0.9961315280464217


## 8. Model Prediction on New Data

In [81]:
# predicting the model (using ham e-mail example from the dataset)
test_data = ['''Subject: hpl nom for january 9 , 2001
( see attached file : hplnol 09 . xls )
- hplnol 09 . xls''']
test_data_tfidf = tfidf.transform(test_data)

if lr.predict(test_data_tfidf) == 0:
    print("Ham e-mail")
else:
    print("Spam e-mail")

Ham e-mail


## 9. Conclusion:
In this project, we built a spam email classifier using Logistic Regression. Here's a summary of the key steps and observations:

1. Dataset Overview: We worked with a dataset containing text-based email data and their labels (ham or spam).
2. Text Vectorization: TF-IDF was used to transform the text data into numerical vectors, accounting for the importance of words based on their frequency in the dataset.
3. Model Training: Logistic Regression was applied to classify emails, with reasonable accuracy achieved on both the training and test sets.
4. Results: The model performed well, with test accuracy indicating that the classifier generalizes adequately to unseen data.