# SPAM Email Detection
### Objective : Create a program to detect if an email is a spam (1) or not spam (0).



SPAM email, also called as junk email, are unsolicited messages sent in bulk by email companies.
The name actually comes from spam lunchmeat.

## 1. Import Dependencies

In [0]:
# import libraries
import numpy as np
import pandas as pd
import nltk  # natural language toolkit
from nltk.corpus import stopwords
import string

In [0]:
 # load dataset
 from google.colab import files
 uploaded = files.upload()

In [0]:
# read csv file
df = pd.read_csv("emails (1).csv")

## 2. Explore Dataset

In [0]:
# display top 5 rows
df.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


In [0]:
# get dimensions of dataset
print("Rows    : ", df.shape[0])
print("Columns : ", df.shape[1])

Rows    :  5728
Columns :  2


In [0]:
# meta data
df.describe()

Unnamed: 0,spam
count,5728.0
mean,0.238827
std,0.426404
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


In [0]:
# meta data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5728 entries, 0 to 5727
Data columns (total 2 columns):
text    5728 non-null object
spam    5728 non-null int64
dtypes: int64(1), object(1)
memory usage: 89.6+ KB


In [0]:
# features
print(df.columns)

Index(['text', 'spam'], dtype='object')


In [0]:
# remove duplicate records
df.drop_duplicates(inplace=True)

In [0]:
# get dimensions of dataset
print("Rows    : ", df.shape[0])
print("Columns : ", df.shape[1])

Rows    :  5695
Columns :  2


In [0]:
print("# duplcate rows deleted : ", 5728-5695)

# duplcate rows deleted :  33


In [0]:
# find missing data
df.isnull().sum()  # no null values present in the dataset

text    0
spam    0
dtype: int64

## 3. Text Processing

In [0]:
# download stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [0]:
# UDF for text processing
def process_text(text):
  # remove punctuation from text
  no_punc = [char for char in text if char not in string.punctuation]
  no_punc = ''.join(no_punc)

  # remove stopwords
  clean_words = [word for word in no_punc.split() if word.lower() not in stopwords.words('english')]

  # return a list of clean text words
  return clean_words

#### Tokenization
- **Tokenizing** means splitting your text into minimal meaningful units. It is a mandatory step before any kind of processing.
- **Lemma** (linguistics) is a word that stands at the head of a definition in a dictionary.


In [0]:
# tokenization - a list of tokens (aka lemmas)
df['text'].head().apply(process_text)

0    [Subject, naturally, irresistible, corporate, ...
1    [Subject, stock, trading, gunslinger, fanny, m...
2    [Subject, unbelievable, new, homes, made, easy...
3    [Subject, 4, color, printing, special, request...
4    [Subject, money, get, software, cds, software,...
Name: text, dtype: object

In [0]:
# sample check - display all words in one record
df['text'].head().apply(process_text)[0]

['Subject',
 'naturally',
 'irresistible',
 'corporate',
 'identity',
 'lt',
 'really',
 'hard',
 'recollect',
 'company',
 'market',
 'full',
 'suqgestions',
 'information',
 'isoverwhelminq',
 'good',
 'catchy',
 'logo',
 'stylish',
 'statlonery',
 'outstanding',
 'website',
 'make',
 'task',
 'much',
 'easier',
 'promise',
 'havinq',
 'ordered',
 'iogo',
 'company',
 'automaticaily',
 'become',
 'world',
 'ieader',
 'isguite',
 'ciear',
 'without',
 'good',
 'products',
 'effective',
 'business',
 'organization',
 'practicable',
 'aim',
 'hotat',
 'nowadays',
 'market',
 'promise',
 'marketing',
 'efforts',
 'become',
 'much',
 'effective',
 'list',
 'clear',
 'benefits',
 'creativeness',
 'hand',
 'made',
 'original',
 'logos',
 'specially',
 'done',
 'reflect',
 'distinctive',
 'company',
 'image',
 'convenience',
 'logo',
 'stationery',
 'provided',
 'formats',
 'easy',
 'use',
 'content',
 'management',
 'system',
 'letsyou',
 'change',
 'website',
 'content',
 'even',
 'structu

#### Bag of Words

In [0]:
# convert the text into matrix of tokens

# import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# bag of words - matrix of how many unique words are appearing how many times
messages_bow = CountVectorizer(analyzer=process_text).fit_transform(df['text'])

## 4. Train Test Split
- Split the dataset into training and test datasets

In [0]:
# import train_test_split
from sklearn.model_selection import train_test_split

# split dataset
X_train, X_test, y_train, y_test = train_test_split(messages_bow, df['spam'], train_size = 0.8, test_size = 0.2, random_state = 0)

In [0]:
# get dimensions of messages_bow dataset
print("Rows    : ", messages_bow.shape[0])
print("Columns : ", messages_bow.shape[1])

Rows    :  5695
Columns :  37229


## 5. Naive Bayes Classifier
- create and train a Multinomial Naive Bayes Classifier for prediction
- more suitable for text classification with discreet features

#### Model Training

In [0]:
# import Multinomial Naive Bayes
from sklearn.naive_bayes import MultinomialNB

# initialize MultinomialNB classifier and train model on dataset
classifier = MultinomialNB().fit(X_train, y_train)

## 6. Model Evaluation

### Train Dataset

#### Predictions

In [0]:
# print predictions
print(classifier.predict(X_train))
# print actual target values
print(y_train.values)

[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]


#### Model Evaluation Metrics

In [0]:
# evaluate the model on train dataset
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
pred = classifier.predict(X_train)
print("Classification Report : \n")
print(classification_report(y_train, pred))
print('\nConfusion Matrix : \n')
print(confusion_matrix(y_train, pred))
print("\nAccuracy : ", accuracy_score(y_train,pred))

Classification Report : 

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3457
           1       0.99      1.00      0.99      1099

    accuracy                           1.00      4556
   macro avg       0.99      1.00      1.00      4556
weighted avg       1.00      1.00      1.00      4556


Confusion Matrix : 

[[3445   12]
 [   1 1098]]

Accuracy :  0.9971466198419666


### Test Dataset

#### Predictions

In [0]:
# print predictions
print(classifier.predict(X_test))
# print actual target values
print(y_test.values)

[1 0 0 ... 0 0 0]
[1 0 0 ... 0 0 0]


#### Model Evaluation

In [0]:
# evaluate the model on test dataset
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
pred = classifier.predict(X_test)
print("Classification Report : \n")
print(classification_report(y_test, pred))
print('\nConfusion Matrix : \n')
print(confusion_matrix(y_test, pred))
print("\nAccuracy : ", accuracy_score(y_test,pred))

Classification Report : 

              precision    recall  f1-score   support

           0       1.00      0.99      0.99       870
           1       0.97      1.00      0.98       269

    accuracy                           0.99      1139
   macro avg       0.98      0.99      0.99      1139
weighted avg       0.99      0.99      0.99      1139


Confusion Matrix : 

[[862   8]
 [  1 268]]

Accuracy :  0.9920983318700615
