<a href="https://colab.research.google.com/github/ravi-prakash1907/A-tracking-of-COVID-19/blob/master/Spam%20Filtering/Spam_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spam E-Mail Detection

### Required Libraries

In [64]:
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
import nltk
import string

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

## for data
from google.colab import files

#### Loading Data

In [7]:
uploaded = files.upload()

Saving emails.csv to emails.csv


In [32]:
# reading the csv
df = pd.read_csv('emails.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0


In [33]:
df.drop(columns=['Unnamed: 0','label'], inplace=True)
df.head()

Unnamed: 0,text,label_num
0,Subject: enron methanol ; meter # : 988291\r\n...,0
1,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,"Subject: photoshop , windows , office . cheap ...",1
4,Subject: re : indian springs\r\nthis deal is t...,0


In [34]:
df.rename(columns={'label_num':'spam'}, inplace=True)
df.head()

Unnamed: 0,text,spam
0,Subject: enron methanol ; meter # : 988291\r\n...,0
1,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,"Subject: photoshop , windows , office . cheap ...",1
4,Subject: re : indian springs\r\nthis deal is t...,0


In [35]:
print("Shape: ",df.shape)
print("\nColumn Names: ",df.columns)
print("\nCount: ",df.count())

Shape:  (5171, 2)

Column Names:  Index(['text', 'spam'], dtype='object')

Count:  text    5171
spam    5171
dtype: int64


In [37]:
## Removing the duplicate entries
df.drop_duplicates(inplace=True)
print("Shape: ",df.shape)

Shape:  (4993, 2)


In [39]:
## Checking for the Missing values
df.isnull().sum()

text    0
spam    0
dtype: int64

_**means, we don't have any missing data**_

In [41]:
# download stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

#### Cleaning the E-Mains

In [45]:
def processText(text):
  # remove punctions
  nopunc = [char for char in text if char not in string.punctuation]
  nopunc = ''.join(nopunc)
  
  # remove stopwords
  cleanWords = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

  # return a list of clean text words
  return cleanWords

In [46]:
# show a list of tokens
df['text'].head().apply(processText)

0    [Subject, enron, methanol, meter, 988291, foll...
1    [Subject, hpl, nom, january, 9, 2001, see, att...
2    [Subject, neon, retreat, ho, ho, ho, around, w...
3    [Subject, photoshop, windows, office, cheap, m...
4    [Subject, indian, springs, deal, book, teco, p...
Name: text, dtype: object

_email is splitted into **tokens** i.e. each word is separated by comma_

#### Text-to-Matrix Transformation


##### Example:  
_How does the Text-to-Matrix Transformation Work?_

In [48]:
message4 = 'hello world hello world hello world play'
message5 = 'test test test test one hello test'
print(message4)
print()

# convert test to a matrix
bow4  = CountVectorizer(analyzer=processText).fit_transform([[message4], [message5]])
print(bow4)

hello world hello world hello world play

  (0, 0)	3
  (0, 4)	3
  (0, 2)	1
  (1, 0)	1
  (1, 3)	5
  (1, 1)	1


In [51]:
print("Shape: ",bow4.shape)
print("\nRows in shape ({}) specify the number of unique inputs.".format(bow4.shape[0]))
print("Columns in shape ({}) specify the number of total unique words.".format(bow4.shape[1]))

Shape:  (2, 5)

Rows in shape (2) specify the number of unique inputs.
Columns in shape (5) specify the number of total unique words.


##### Transformation

In [52]:
## matrix creation
message_bow = CountVectorizer(analyzer=processText).fit_transform(df['text'])

#### Splitting

In [54]:
X_train, X_test, y_train, y_test = train_test_split(message_bow, df['spam'], test_size = 0.3, random_state = 0)

In [55]:
message_bow.shape

(4993, 50381)

In [56]:
type(message_bow)

scipy.sparse.csr.csr_matrix

#### Model Building  
Using _Naive Bayse Classifier_ (multinomial)

In [59]:
## Training
classifier = MultinomialNB().fit(X_train,y_train)

#### Prediction

In [61]:
## predictions
print(classifier.predict(X_train))

## actuals
print(y_train.values)

[0 0 1 ... 0 0 1]
[0 0 1 ... 0 0 1]


In [72]:
## Testing
print(classifier.predict(X_train))

[0 0 1 ... 0 0 1]


#### Evaluation

In [68]:
pred = classifier.predict(X_train)

## report
print(classification_report(y_train, y_pred=pred))

              precision    recall  f1-score   support

           0       1.00      0.99      0.99      2442
           1       0.99      0.99      0.99      1053

    accuracy                           0.99      3495
   macro avg       0.99      0.99      0.99      3495
weighted avg       0.99      0.99      0.99      3495



In [70]:
## confision mat
print("Confusion Matrics:\n\n", confusion_matrix(y_train, y_pred=pred))

Confusion Matrics:

 [[2428   14]
 [  11 1042]]


In [71]:
## accuracy
print("Model Accuracy:\n\n", accuracy_score(y_train, y_pred=pred))

Model Accuracy:

 0.9928469241773963


#### Checking performance on **test data**

In [75]:
## prediction
pred = classifier.predict(X_test)
print("Predictions: ",pred)

## report
print(classification_report(y_test, y_pred=pred))

## confision mat
print("\nConfusion Matrics:\n\n", confusion_matrix(y_test, y_pred=pred))

## accuracy
print("\nModel Accuracy:\n\n", accuracy_score(y_test, y_pred=pred))

Predictions:  [0 0 0 ... 1 0 0]
              precision    recall  f1-score   support

           0       0.99      0.98      0.98      1089
           1       0.95      0.97      0.96       409

    accuracy                           0.98      1498
   macro avg       0.97      0.97      0.97      1498
weighted avg       0.98      0.98      0.98      1498


Confusion Matrics:

 [[1067   22]
 [  14  395]]

Model Accuracy:

 0.9759679572763685


So, model has an accuracy of ~**97.6%**