Importing Libraries

Importing Libraries: This section imports the necessary libraries for the code, including numpy, pandas, sklearn, seaborn, scipy, and matplotlib. These libraries are used for various tasks such as data manipulation, machine learning algorithms, data visualization, and statistical operations.


In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import  train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import seaborn as sns
import scipy.stats as st
import matplotlib.pyplot as plt

Data Collection 

Data Collection: In this section, the code loads the data from a CSV file hosted on GitHub into a pandas DataFrame called raw_mail_data.

In [None]:
#Loading the data from csv file to pandas dataframe
url='https://raw.githubusercontent.com/tejaschaudhari192/Spam-Email-Detection/main/mail_data.csv'
raw_mail_data=pd.read_csv(url)

In [None]:
print(raw_mail_data)

     Category                                            Message
0         ham  Go until jurong point, crazy.. Available only ...
1         ham                      Ok lar... Joking wif u oni...
2        spam  Free entry in 2 a wkly comp to win FA Cup fina...
3         ham  U dun say so early hor... U c already then say...
4         ham  Nah I don't think he goes to usf, he lives aro...
...       ...                                                ...
5567     spam  This is the 2nd time we have tried 2 contact u...
5568      ham               Will ü b going to esplanade fr home?
5569      ham  Pity, * was in mood for that. So...any other s...
5570      ham  The guy did some bitching but I acted like i'd...
5571      ham                         Rofl. Its true to its name

[5572 rows x 2 columns]


Exploratory Data Analysis

Exploratory Data Analysis:This section performs exploratory data analysis on the loaded data. It includes tasks like printing the first 5 rows and last 5 rows of the DataFrame, checking the number of rows and columns, describing the statistics of the data, and examining unique values in specific columns.

In [None]:
#Replace null values with null string
mail_data=raw_mail_data.where((pd.notnull(raw_mail_data)),'')

In [None]:
#Cleaning the data
mail_data.isnull().sum()

Category    0
Message     0
dtype: int64

In [None]:
#Printing First 5 rows of the dataframe
mail_data.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
mail_data.tail()

Unnamed: 0,Category,Message
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...
5571,ham,Rofl. Its true to its name


In [None]:
#Checking the number of rows and columns in the dataframe
mail_data.shape

(5572, 2)

In [None]:
raw_mail_data.describe()

Unnamed: 0,Category,Message
count,5572,5572
unique,2,5157
top,ham,"Sorry, I'll call later"
freq,4825,30


In [None]:
mail_data.columns

Index(['Category', 'Message'], dtype='object')

In [None]:
mail_data.nunique()

Category       2
Message     5157
dtype: int64

In [None]:
mail_data['Category'].unique()

array(['ham', 'spam'], dtype=object)

In [None]:
mail_data['Message'].unique()

array(['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',
       'Ok lar... Joking wif u oni...',
       "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's",
       ..., 'Pity, * was in mood for that. So...any other suggestions?',
       "The guy did some bitching but I acted like i'd be interested in buying something else next week and he gave it to us for free",
       'Rofl. Its true to its name'], dtype=object)

Cleaning the data: Here, the code checks for null values in the DataFrame using the isnull() function and replaces them with empty strings using the where() function. The cleaned data is stored in a new DataFrame called mail_data.

Here we need not to clean data as it is already cleaned.The attributes contained are essential for spam mail detection so we need not to remove any attribute i.e clean data

Label Encoding

Label Encoding: This section assigns labels to the spam and ham categories. It converts spam mail to 0 and ham mail to 1. The loc function is used to locate the rows with specific category names and assign the corresponding labels.

In [None]:
#Label spam mail as 0 and ham mail as 1;
mail_data.loc[mail_data['Category'] == 'spam','Category',]=0
mail_data.loc[mail_data['Category'] == 'ham','Category',]=1

In [None]:
#Separating the data as texts and label
X=mail_data['Message']
Y=mail_data['Category']

In [None]:
print(X)

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                 Will ü b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: Message, Length: 5572, dtype: object


In [None]:
print(Y)

0       1
1       1
2       0
3       1
4       1
       ..
5567    0
5568    1
5569    1
5570    1
5571    1
Name: Category, Length: 5572, dtype: object


In [None]:
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,random_state=3)

In [None]:
print(X.shape)
print(X_train.shape)
print(X_test.shape)

(5572,)
(4457,)
(1115,)


Feature Extraction

Feature Extraction: In this section, the text data is transformed into feature vectors using the TfidfVectorizer from sklearn. The vectorizer converts the text into a numerical representation that can be used as input to the logistic regression model. The fit_transform() function is used on the training data (X_train), and the transform() function is used on the testing data (X_test).

Separating the data: The code splits the data into training and testing sets using the train_test_split() function from sklearn. It takes the message data (X) and the corresponding category data (Y) as inputs and splits them into four sets: X_train, X_test, Y_train, and Y_test.

In [None]:
# transform the text data to feature vectors that can be used as input to the Logistic regression

feature_extraction = TfidfVectorizer(min_df = 1, stop_words='english', lowercase=True)

X_train_features = feature_extraction.fit_transform(X_train)
X_test_features = feature_extraction.transform(X_test)

# convert Y_train and Y_test values as integers

Y_train = Y_train.astype('int')
Y_test = Y_test.astype('int')

In [None]:
print(X_train)

3075                  Don know. I did't msg him recently.
1787    Do you know why god created gap between your f...
1614                         Thnx dude. u guys out 2nite?
4304                                      Yup i'm free...
3266    44 7732584351, Do you want a New Nokia 3510i c...
                              ...                        
789     5 Free Top Polyphonic Tones call 087018728737,...
968     What do u want when i come back?.a beautiful n...
1667    Guess who spent all last night phasing in and ...
3321    Eh sorry leh... I din c ur msg. Not sad alread...
1688    Free Top ringtone -sub to weekly ringtone-get ...
Name: Message, Length: 4457, dtype: object


In [None]:
print(X_train_features)

  (0, 5413)	0.6198254967574347
  (0, 4456)	0.4168658090846482
  (0, 2224)	0.413103377943378
  (0, 3811)	0.34780165336891333
  (0, 2329)	0.38783870336935383
  (1, 4080)	0.18880584110891163
  (1, 3185)	0.29694482957694585
  (1, 3325)	0.31610586766078863
  (1, 2957)	0.3398297002864083
  (1, 2746)	0.3398297002864083
  (1, 918)	0.22871581159877646
  (1, 1839)	0.2784903590561455
  (1, 2758)	0.3226407885943799
  (1, 2956)	0.33036995955537024
  (1, 1991)	0.33036995955537024
  (1, 3046)	0.2503712792613518
  (1, 3811)	0.17419952275504033
  (2, 407)	0.509272536051008
  (2, 3156)	0.4107239318312698
  (2, 2404)	0.45287711070606745
  (2, 6601)	0.6056811524587518
  (3, 2870)	0.5864269879324768
  (3, 7414)	0.8100020912469564
  (4, 50)	0.23633754072626942
  (4, 5497)	0.15743785051118356
  :	:
  (4454, 4602)	0.2669765732445391
  (4454, 3142)	0.32014451677763156
  (4455, 2247)	0.37052851863170466
  (4455, 2469)	0.35441545511837946
  (4455, 5646)	0.33545678464631296
  (4455, 6810)	0.29731757715898277
  (4

Training the Model

Training the Model - Logistic Regression: The logistic regression model is trained on the training data using the LogisticRegression() function from sklearn. The fit() function is then called with the training features (X_train_features) and corresponding labels (Y_train) to train the model.

Logistic Regression

In [None]:
model = LogisticRegression()
from sklearn.linear_model import LogisticRegression

In [None]:
from sklearn.linear_model import LogisticRegression
#Training the logistic regression model with the training data
model.fit(X_train_features, Y_train)

Evaluating the trained model

Evaluating the trained model: The code evaluates the trained model by making predictions on both the training data and testing data. The predict() function is used to predict the categories for the features, and the accuracy_score() function is used to calculate the accuracy of the predictions by comparing them with the actual labels.

In [None]:
#Prediction on training data
from sklearn.linear_model import LogisticRegression
Prediction_on_training_data=model.predict(X_train_features)
accuracy_on_training_data=accuracy_score(Y_train,Prediction_on_training_data)

In [None]:
print("Accuracy_on_training_data:",accuracy_on_training_data)

Accuracy_on_training_data: 0.9670181736594121


In [None]:
#Prediction on test data
from sklearn.linear_model import LogisticRegression
Prediction_on_test_data=model.predict(X_test_features)
accuracy_on_test_data=accuracy_score(Y_test,Prediction_on_test_data)

In [None]:
print("Accuracy_on_test_data:",accuracy_on_test_data)

Accuracy_on_test_data: 0.9659192825112107


Building a predective system

Building a predictive system: This section demonstrates how to use the trained model and feature extraction to make predictions on new input data. It defines an input mail and converts it into feature vectors using the same vectorizer. The predict() function is then used to predict the category of the input mail, and the result is printed as either "Ham mail" or "Spam mail".

In [None]:
input_mail = ["XXXMobileMovieClub: To use your credit, click the WAP link in the next txt message or click here>> http://wap. xxxmobilemovieclub.com?n=QJKGIGHJJGCBL"]

# convert text to feature vectors
input_data_features = feature_extraction.transform(input_mail)

# making prediction

prediction = model.predict(input_data_features)
print(prediction)


if (prediction[0]==1):
  print('Ham mail')

else:
  print('Spam mail')

[0]
Spam mail


Model and Vectorizer Serialization: Finally, the code uses the pickle library to save the trained model and vectorizer as pickle files (model.pkl and vectorizer.pkl, respectively). This allows the model to be loaded and used later without retraining.

In [None]:
import pickle

pickle.dump(model,open('model.pkl','wb'))
pickle.dump(feature_extraction,open('vectorizer.pkl','wb'))