# **SIMPLBYTE**

## Task 1. EMAIL SPAM detection using Machine Learning

We’ve all been the recipient of spam emails before. Spam mail, or junk mail, is a type of email that is sent to a massive number of users at one time, frequently containing cryptic messages, scams, or most dangerously, phishing content.

In this Project, use Python to build an email spam detector. Then, use machine learning to the spam detector to recognize and classify emails into spam and non-spam. Let’s get started!

### Steps

* Collect a dataset of spam and non-spam emails.
* Clean and preprocess the text data.
* Extract numerical features from the preprocessed data.
* Train a machine learning model on the features.
* Evaluate the model's performance on a test dataset.
* Refine the model through tuning and experimentation.
* Deploy the model for automated spam detection.
 

## 1. Import all necessary


In [54]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import cv2

## 2. Import dataframe


In [55]:
df = pd.read_csv('spam.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0


## 3. Check for Null row


In [56]:
df.isnull().sum()

Unnamed: 0    0
label         0
text          0
label_num     0
dtype: int64

There is no null value in our data 

## 4. Check for Dublicate row


In [57]:
df.duplicated().sum()


0

There is no dublicate row in our data

## 5. Summery of data


In [58]:
df.describe()


Unnamed: 0.1,Unnamed: 0,label_num
count,5171.0,5171.0
mean,2585.0,0.289886
std,1492.883452,0.453753
min,0.0,0.0
25%,1292.5,0.0
50%,2585.0,0.0
75%,3877.5,1.0
max,5170.0,1.0


df.info()

In [59]:
df['label'].unique()

array(['ham', 'spam'], dtype=object)

## 6. Handling Inbalance dataset

In [60]:
df['label'].value_counts()

ham     3672
spam    1499
Name: label, dtype: int64

In [61]:
minority = df[df['label'] == 'spam']
majority = df[df['label'] == 'ham']


In [62]:
minority


Unnamed: 0.1,Unnamed: 0,label,text,label_num
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
7,4185,spam,Subject: looking for medication ? we ` re the ...,1
10,4922,spam,Subject: vocable % rnd - word asceticism\r\nvc...,1
11,3799,spam,Subject: report 01405 !\r\nwffur attion brom e...,1
13,3948,spam,Subject: vic . odin n ^ ow\r\nberne hotbox car...,1
...,...,...,...,...
5159,4381,spam,Subject: pictures\r\nstreamlined denizen ajar ...,1
5161,4979,spam,Subject: penny stocks are about timing\r\nnoma...,1
5162,4162,spam,Subject: anomaly boys from 3881\r\nuosda apapr...,1
5164,4365,spam,Subject: slutty milf wants to meet you\r\ntake...,1


In [63]:
majority

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0
5,2949,ham,Subject: ehronline web address change\r\nthis ...,0
...,...,...,...,...
5165,2849,ham,"Subject: fw : crosstex energy , driscoll ranch...",0
5166,1518,ham,Subject: put the 10 on the ft\r\nthe transport...,0
5167,404,ham,Subject: 3 / 4 / 2000 and following noms\r\nhp...,0
5168,2933,ham,Subject: calpine daily gas nomination\r\n>\r\n...,0


In [64]:
from sklearn.utils import resample
minority_upsampled=resample(minority,replace=True,    
         n_samples=len(majority),
         random_state=42
        )

In [65]:
minority_upsampled.count()

Unnamed: 0    3672
label         3672
text          3672
label_num     3672
dtype: int64

In [66]:
df=pd.concat([majority,minority_upsampled]).reset_index(drop=True)

In [67]:
df['label'].value_counts()

ham     3672
spam    3672
Name: label, dtype: int64

## 7. Check column name


In [68]:
df.columns

Index(['Unnamed: 0', 'label', 'text', 'label_num'], dtype='object')

## 8.Shape of dataset


In [69]:
df.shape

(7344, 4)

## 9. Check the datatype


In [70]:
df.dtypes

Unnamed: 0     int64
label         object
text          object
label_num      int64
dtype: object

## 10. Splitting the data


In [71]:
x =  df[["text"]]
y =df[["label"]] 

In [72]:
x

Unnamed: 0,text
0,Subject: enron methanol ; meter # : 988291\r\n...
1,"Subject: hpl nom for january 9 , 2001\r\n( see..."
2,"Subject: neon retreat\r\nho ho ho , we ' re ar..."
3,Subject: re : indian springs\r\nthis deal is t...
4,Subject: ehronline web address change\r\nthis ...
...,...
7339,"Subject: vulgar\r\nmuniz ,\r\ngovenment don ' ..."
7340,"Subject: confidence is back\r\nhello ,\r\nmy b..."
7341,"Subject: hey ,\r\nhello , it ' s me lauren . ...."
7342,"Subject: cheap v . iagra , phentermine , xa . ..."


In [73]:
y

Unnamed: 0,label
0,ham
1,ham
2,ham
3,ham
4,ham
...,...
7339,spam
7340,spam
7341,spam
7342,spam


## 11. training and testing data


In [74]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train,y_test = train_test_split(x,y,test_size = 0.20 )

In [75]:
x_train.shape

(5875, 1)

In [76]:

x_test

Unnamed: 0,text
1313,"Subject: wells\r\ndaren , eog is having proble..."
4381,Subject: my discovery\r\nfinally !\r\ni have a...
4020,Subject: do you care ?\r\nthem kind so work ga...
4823,Subject: strong buy alert : monthly newsletter...
5535,Subject: tutored best n @ k . ed lolitas . . ....
...,...
1415,Subject: hpl 3 - rivers gas flowing through ki...
325,Subject: = ? ansi _ x 3 . 4 - 1968 ? q ? new _...
2310,Subject: hl & p february premlinary flow numbe...
5917,"Subject: attract the opposite sex , the ultima..."


In [77]:
y_train

Unnamed: 0,label
2658,ham
6901,spam
1164,ham
2580,ham
4525,spam
...,...
6708,spam
4568,spam
478,ham
1285,ham


In [78]:
y_test

Unnamed: 0,label
1313,ham
4381,spam
4020,spam
4823,spam
5535,spam
...,...
1415,ham
325,ham
2310,ham
5917,spam


## 12. Feature Extraction

In [79]:
from sklearn import preprocessing

     
label_encoder = preprocessing.LabelEncoder()
    
label_encoder.fit(x_train)
    
x_train  = label_encoder.transform(x_train).reshape(-1,1)
    


In [80]:
label_encoder.fit(x_test)
    
x_test  = label_encoder.transform(x_test).reshape(-1,1)
    

In [81]:
y_train.replace({"spam" : 0 , "ham" : 1 } , inplace=True)
y_test.replace({"spam" : 0 , "ham" : 1 } , inplace=True)

In [82]:
y_train.shape

(5875, 1)

In [83]:
y_test.shape

(1469, 1)

In [84]:
x_train.shape

(5875, 1)

In [85]:
x_test.shape

(1469, 1)

## 13.   Model Fitting



In [86]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()

model.fit(x_train,y_train)

MultinomialNB()

In [87]:
y_Pred = model.predict(x_test)

## 14 . check model accuracy

**finding accuracy of the training dataset**


In [88]:
 model.score(x_train ,y_train)

0.5014468085106383

**#finding accuracy of the test dataset**

In [89]:
model.score(x_test,y_test)

0.494213750850919

In [90]:
from sklearn.metrics import r2_score
 
r2= r2_score(y_test,y_Pred)
r2

-1.0234159779614327

In [91]:
from sklearn.metrics import mean_squared_error,mean_absolute_error 
MAE = mean_absolute_error(y_test,y_Pred)
MAE

0.505786249149081

In [92]:
np.sqrt(MAE) # root mean squared error

0.7111865079914558

In [93]:
MSE = mean_squared_error(y_test,y_Pred)
MSE

0.505786249149081

In [94]:
slope = model.coef_
slope

array([[0.]])

In [95]:
intercept =   model.intercept_
intercept

array([-0.69604499])

## 15. result summary

In [96]:
import statsmodels.api as sm
x_train_Sm =sm.add_constant(x_train)
x_train_Sm =sm.add_constant(x_train)

ls=sm.OLS(y_train,x_train).fit()
print(ls.summary())

                                 OLS Regression Results                                
Dep. Variable:                  label   R-squared (uncentered):                   0.374
Model:                            OLS   Adj. R-squared (uncentered):              0.374
Method:                 Least Squares   F-statistic:                              3505.
Date:                Sun, 07 May 2023   Prob (F-statistic):                        0.00
Time:                        12:43:04   Log-Likelihood:                         -4917.1
No. Observations:                5875   AIC:                                      9836.
Df Residuals:                    5874   BIC:                                      9843.
Df Model:                           1                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------