Oasis Infobytes : Data Science Internship

Task-4 : Email Spam Detection with Machine Learning

We’ve all been the recipient of spam emails before. Spam mail, or junk mail, is a type of email
that is sent to a massive number of users at one time, frequently containing cryptic
messages, scams, or most dangerously, phishing content.

In this Project, we use python to build an email spam detector, then we use machine learning to
train the spam detector to recognize and classify emails into spam and non-spam.

Machine learning Model: This method involves using machine learning algorithms to analyze email data and identify patterns and characteristics of spam emails. The system then learns from this data and becomes more accurate over time.

In this project we used vecotrisation, training and testing data, Feature Extraction, logistic regression and Label Encoder

In [1]:
# importing the dependencies
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


In [2]:
Email_data = pd.read_csv("spam.csv", encoding="ISO-8859-1")

In [3]:
Email_data.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [4]:
Email_data.tail()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will Ì_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,
5571,ham,Rofl. Its true to its name,,,


In [5]:
Email_data.shape

(5572, 5)

In [6]:
Email_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


In [7]:
Email_data.describe()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
count,5572,5572,50,12,6
unique,2,5169,43,10,5
top,ham,"Sorry, I'll call later","bt not his girlfrnd... G o o d n i g h t . . .@""","MK17 92H. 450Ppw 16""","GNT:-)"""
freq,4825,30,3,2,2


In [8]:
Email_data.describe

<bound method NDFrame.describe of         v1                                                 v2 Unnamed: 2  \
0      ham  Go until jurong point, crazy.. Available only ...        NaN   
1      ham                      Ok lar... Joking wif u oni...        NaN   
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...        NaN   
3      ham  U dun say so early hor... U c already then say...        NaN   
4      ham  Nah I don't think he goes to usf, he lives aro...        NaN   
...    ...                                                ...        ...   
5567  spam  This is the 2nd time we have tried 2 contact u...        NaN   
5568   ham              Will Ì_ b going to esplanade fr home?        NaN   
5569   ham  Pity, * was in mood for that. So...any other s...        NaN   
5570   ham  The guy did some bitching but I acted like i'd...        NaN   
5571   ham                         Rofl. Its true to its name        NaN   

     Unnamed: 3 Unnamed: 4  
0           NaN        N

In [9]:
Email_data.isnull().sum()

v1               0
v2               0
Unnamed: 2    5522
Unnamed: 3    5560
Unnamed: 4    5566
dtype: int64

In [10]:
Email_data.columns

Index(['v1', 'v2', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], dtype='object')

In [11]:
Email_data.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'],axis=1,inplace=True)

In [12]:
Email_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   v1      5572 non-null   object
 1   v2      5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [13]:
Email_data.rename(columns={'v1':'Target', 'v2':'Mails'},inplace=True)

In [14]:
Email_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Target  5572 non-null   object
 1   Mails   5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [15]:
# To Check duplicate values in dataset
Email_data.duplicated().sum()

403

In [16]:
Email_data =Email_data.drop_duplicates(keep='first')

In [17]:
# To Check duplicate values in dataset
Email_data.duplicated().sum()

0

In [18]:
print(Email_data.Target.value_counts())


ham     4516
spam     653
Name: Target, dtype: int64


# Label Encoding

Encoding the categorical data (Now we will change the text data into numerical data)

In [19]:
# Label spam mail as 0 and ham mail as 1
Email_data.replace({'Target': {'spam':0,'ham':1}},inplace = True)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[selected_item_labels] = value


In [20]:
Email_data.head()

Unnamed: 0,Target,Mails
0,1,"Go until jurong point, crazy.. Available only ..."
1,1,Ok lar... Joking wif u oni...
2,0,Free entry in 2 a wkly comp to win FA Cup fina...
3,1,U dun say so early hor... U c already then say...
4,1,"Nah I don't think he goes to usf, he lives aro..."


In [21]:
# Separating the data as texts and label

In [22]:
X = Email_data["Mails"]
Y = Email_data["Target"]

In [23]:
X.head()

0    Go until jurong point, crazy.. Available only ...
1                        Ok lar... Joking wif u oni...
2    Free entry in 2 a wkly comp to win FA Cup fina...
3    U dun say so early hor... U c already then say...
4    Nah I don't think he goes to usf, he lives aro...
Name: Mails, dtype: object

In [24]:
Y.head()

0    1
1    1
2    0
3    1
4    1
Name: Target, dtype: int64

Splitting Training and test data 

In [25]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.2,random_state = 3)
# Test_size = 0.1 means that 10 percent of the data is testing data and 90 percent of the data is training data

In [26]:
print("X_train.shape : ",X_train.shape,"\nX_train.head : ",X_train.head())
print("\nX_test.shape : ",X_test.shape,"\nX_test.head : ",X_test.head())
print("\nX_train.shape : ",Y_train.shape,"\nY_train.head : ",Y_train.head())
print("\nY_test.shape : ",Y_test.shape,"\nY_test.head : ",Y_test.head())


X_train.shape :  (4135,) 
X_train.head :  4443                       COME BACK TO TAMPA FFFFUUUUUUU
982     Congrats! 2 mobile 3G Videophones R yours. cal...
3822    Please protect yourself from e-threats. SIB ne...
3924       As if i wasn't having enough trouble sleeping.
4927    Just hopeing that wasnÛ÷t too pissed up to re...
Name: Mails, dtype: object

X_test.shape :  (1034,) 
X_test.head :  4994    Just looked it up and addie goes back Monday, ...
4292    You best watch what you say cause I get drunk ...
4128                 Me i'm not workin. Once i get job...
4429          Yar lor... How u noe? U used dat route too?
660     Under the sea, there lays a rock. In the rock,...
Name: Mails, dtype: object

X_train.shape :  (4135,) 
Y_train.head :  4443    1
982     0
3822    1
3924    1
4927    1
Name: Target, dtype: int64

Y_test.shape :  (1034,) 
Y_test.head :  4994    1
4292    1
4128    1
4429    1
660     1
Name: Target, dtype: int64


In [27]:
Y_train.head()

4443    1
982     0
3822    1
3924    1
4927    1
Name: Target, dtype: int64

# Feature Extraction

Transform the text data into feature vectors that can be used as input to the logistic regression regression model

In [28]:
feature_extraction = TfidfVectorizer(min_df = 1, stop_words = 'english', lowercase ='True')

In [29]:
# Convert X_train and X_test 
X_train_features = feature_extraction.fit_transform(X_train)
X_test_features = feature_extraction.transform(X_test)

In [30]:
# Convert Y_train and Y_test values as integers
Y_train = Y_train.astype('int')
Y_test = Y_test.astype('int')

In [31]:
X_train

4443                       COME BACK TO TAMPA FFFFUUUUUUU
982     Congrats! 2 mobile 3G Videophones R yours. cal...
3822    Please protect yourself from e-threats. SIB ne...
3924       As if i wasn't having enough trouble sleeping.
4927    Just hopeing that wasnÛ÷t too pissed up to re...
                              ...                        
806      sure, but make sure he knows we ain't smokin yet
990                                          26th OF JULY
1723    Hi Jon, Pete here, Ive bin 2 Spain recently & ...
3519    No it will reach by 9 only. She telling she wi...
1745    IåÕm cool ta luv but v.tired 2 cause i have be...
Name: Mails, Length: 4135, dtype: object

In [32]:
print("X_train_features:\n",X_train_features)

X_train_features:
   (0, 2697)	0.7205755344386542
  (0, 6409)	0.5950532917415522
  (0, 1825)	0.35592482233751443
  (1, 5438)	0.27399320458839144
  (1, 4583)	0.27399320458839144
  (1, 4438)	0.22516921191243092
  (1, 5036)	0.27399320458839144
  (1, 2274)	0.27399320458839144
  (1, 2920)	0.23390504161994488
  (1, 3610)	0.27399320458839144
  (1, 4984)	0.19732502227978832
  (1, 4180)	0.23390504161994488
  (1, 7137)	0.24133495616477563
  (1, 6940)	0.27399320458839144
  (1, 203)	0.27399320458839144
  (1, 6941)	0.27399320458839144
  (1, 453)	0.25698446420786897
  (1, 4333)	0.15929709793058355
  (1, 1885)	0.22516921191243092
  (2, 953)	0.26160275768603725
  (2, 4856)	0.26160275768603725
  (2, 5786)	0.26160275768603725
  (2, 2459)	0.22436535516409714
  (2, 4960)	0.26160275768603725
  (2, 5976)	0.1902832473629628
  :	:
  (4132, 6862)	0.11085392369947865
  (4132, 5612)	0.14854309693836068
  (4132, 3865)	0.16898098428277844
  (4133, 6457)	0.6154177820886059
  (4133, 5320)	0.5530764956488926
  (4133,

# Training the Model

Logistic Regression

In [33]:
Model = LogisticRegression()

In [34]:
Model.fit(X_train_features,Y_train)

LogisticRegression()

# Model Evaluation

Predition on Training data

In [35]:
# Predition on Training data
Training_data_Model_prediction = Model.predict(X_train_features)
Training_data_Model_prediction

array([1, 1, 1, ..., 1, 1, 1])

In [36]:
# Finding the accuracy of the training data
acc_score_for_training_data = accuracy_score(Y_train,Training_data_Model_prediction)
print("Accuracy on training data : ",acc_score_for_training_data)

Accuracy on training data :  0.962273276904474


Prediction on Testing data

In [37]:
# Prediction on Testing data
Testing_data_Model_prediction = Model.predict(X_test_features)
Testing_data_Model_prediction

array([1, 1, 1, ..., 1, 0, 1])

In [38]:
# Finding the accuracy of the testing data
acc_score_for_testing_data = accuracy_score(Y_test,Testing_data_Model_prediction)
print("Accuracy on testing data : ",acc_score_for_testing_data)

Accuracy on testing data :  0.960348162475822


Hence the conclusion is Accuracy Score of training and testing data is similar to each other 

# Building a Predicitve System

In [39]:
input_mail = ["I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.,,,"]

In [40]:
# Convert this text into feature vectors 
input_mail_features = feature_extraction.transform(input_mail)

# Making Predictions
predicition = Model.predict(input_mail_features)
print(predicition)
if predicition==[1]:
    print("This is a HAM mail")
else:
    print("This is a SPAM mail")


[1]
This is a HAM mail
