Name: Sondos Mohamed

Task 4

**Building an Email Spam Detector**

Email spam, or junk mail, is a widespread issue affecting users globally, often inundating inboxes with unwanted and potentially harmful content. In this project, initiated as part of Oasis Infobyte's Internship, we aim to develop an Email Spam Detector using Python and machine learning techniques.




**Objective**

The primary objective is to create a robust system that can automatically identify and classify emails as either spam or non-spam. By leveraging machine learning algorithms, we'll train the model on a dataset of emails, enabling it to learn patterns and characteristics associated with spam content. Through this project, we endeavor to contribute to the enhancement of email security and user experience.

Let's embark on the journey of creating an effective Email Spam Detector to mitigate the impact of unwanted emails!

In [1]:
import numpy as np
import  pandas as pd
from sklearn.model_selection import train_test_split

# convert text into vector or numeric values
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [2]:
# Load the dataset
raw_email = pd.read_csv('/content/spam.csv', encoding='latin1')
raw_email.head(5)

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [3]:
raw_email.columns

Index(['v1', 'v2', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], dtype='object')

In [4]:
raw_email.isnull().sum()

v1               0
v2               0
Unnamed: 2    5522
Unnamed: 3    5560
Unnamed: 4    5566
dtype: int64

In [5]:
# replace null values withnull strings
df = raw_email.where((pd.notnull(raw_email)),'')

In [6]:
df.head(5)

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [7]:
df.shape

(5572, 5)

In [8]:
df.rename(columns={"v1": "Category", "v2": "Message"}, inplace=True)

In [9]:
# Assuming the actual column name is 'v1'
df.loc[df['Category'] == 'spam', 'v1'] = 0
df.loc[df['Category'] == 'ham', 'v1'] = 1


In [10]:
# Split features and target variable
X = df['Message']
y = df['Category']

In [11]:
# Convert text to numerical features using CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(X)

In [12]:
X

<5572x8672 sparse matrix of type '<class 'numpy.int64'>'
	with 73916 stored elements in Compressed Sparse Row format>

In [13]:
y

0        ham
1        ham
2       spam
3        ham
4        ham
        ... 
5567    spam
5568     ham
5569     ham
5570     ham
5571     ham
Name: Category, Length: 5572, dtype: object

In [14]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [15]:
# Train a Naive Bayes classifier
model = MultinomialNB()
model.fit(X_train, y_train)

In [18]:
# Predict on the training data
y_train_pred = model.predict(X_train)

In [20]:
# Predict on the testing data
y_test_pred = model.predict(X_test)

In [22]:
# Calculate accuracy on the training and testing data
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print("Training Accuracy:", train_accuracy)
print("Testing Accuracy:", test_accuracy)

Training Accuracy: 0.9943908458604442
Testing Accuracy: 0.97847533632287


In [None]:
# Build a predictive system
while True:
    email = input("Enter an email (or 'quit' to exit): ")
    if email == "quit":
        break
    X_email = vectorizer.transform([email])
    y_email_pred = model.predict(X_email)
    if y_email_pred[0] == 'spam':
        print("This email is classified as spam.")
    else:
        print("This email is ham.")

This email is classified as spam.
Enter an email (or 'quit' to exit): Premoda <premoda-ecom@sta-egypt.com> ​ You ​ View this email in your browser        Shop Summer Looks        This email was sent to sondos.ammar@outlook.com why did I get this?    unsubscribe from this list    update subscription preferences Premoda · City Stars · Store · Cairo 0000 · Egypt
This email is classified as spam.
