# **Spam Detection**
By Ralph Cajipe

This notebook shows how to detect if an email is a spam or not spam using Gradient Boosting Classifier model in Python. This code uses scikit-learn implementation.

The code in this notebook is inspired by the following work:


*   Build a Spam Classifier from [Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/) by Aurélien Géron
*   [Spam Detection](https://towardsdatascience.com/3-super-simple-projects-to-learn-natural-language-processing-using-python-8ef74c757cd9#:~:text=Project%202%3A%20Spam%20Detection) by Eric Kleppen



Of more than 300 billion emails sent every day, at least half are spam. Email providers have the huge task of filtering  out the spam and making sure their users receive the messages that matter. 
Spam detection is messy. The line between spam and non-spam messages is fuzzy, and the criteria change over  time. From various efforts to automate spam detection, machine learning has so far proven to be the most effective  and the favored approach by email providers. Although we still see spammy emails, a quick look at the junk folder will  show how much spam gets weeded out of our inboxes every day thanks to machine learning algorithms. 

Spam Detection is a **binary classification** problem since an email can be either **spam (1)** or **not spam (0)**. The following  steps would help you build a Machine Learning model that can identify whether or not an email is a spam. You will use  the Python library Scikit-Learn to explore tokenization, vectorization, and statistical classification algorithms. 


# **Part I**

## Data Preprocessing

In [None]:
import pandas as pd
import regex as re

In [None]:
df = pd.read_excel('/content/emails.xlsx')
df.head()

Unnamed: 0,text,spam,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 100,Unnamed: 101,Unnamed: 102,Unnamed: 103,Unnamed: 104,Unnamed: 105,Unnamed: 106,Unnamed: 107,Unnamed: 108,Unnamed: 109
0,Subject: naturally irresistible your corporate...,1,,,,,,,,,...,,,,,,,,,,
1,Subject: the stock trading gunslinger fanny i...,1,,,,,,,,,...,,,,,,,,,,
2,Subject: unbelievable new homes made easy im ...,1,,,,,,,,,...,,,,,,,,,,
3,Subject: 4 color printing special request add...,1,,,,,,,,,...,,,,,,,,,,
4,"Subject: do not have money , get software cds ...",1,,,,,,,,,...,,,,,,,,,,


In [None]:
df.shape

(5732, 110)

## Keep a copy of the original dataframe for backup!

In [None]:
df_original = df.copy()

## Check for non-integer values and missing values in the 'spam' column:

In [None]:
# Check for non-integer values
non_integer_rows = df[df['spam'].apply(lambda x: type(x) != int)]
print("Rows with non-integer values in 'spam' column:")
print(non_integer_rows)

# Check for missing values
missing_value_rows = df[df['spam'].isna()]
print("Rows with missing values in 'spam' column:")
print(missing_value_rows)


Rows with non-integer values in 'spam' column:
                                                   text  \
1380  Subject: from the enron india newsdesk - april...   
1381                                                NaN   
1382  e dpc contributed only 0 . 7 per  cent of the ...   
2652  Subject: from the enron india newsdesk - april...   
2653                                                NaN   
2654  lf against undeserved claims in the event of e...   

                                                   spam  \
1380                                                NaN   
1381                                                NaN   
1382   its termination would not  have such a phenom...   
2652                                                NaN   
2653                                                NaN   
2654                                  mr suresh prabhu    

                                             Unnamed: 2  \
1380                                                NaN   
1381   

## Once you've identified the problem rows, you can drop them

In [None]:
# Drop rows with non-integer values
df.drop(non_integer_rows.index, inplace=True)

# Drop rows with missing values
df.dropna(subset=['spam'], inplace=True)


## Exploratory Analysis

In [None]:
print("spam count: " +str(len(df.loc[df.spam==1])))
print("not spam count: " +str(len(df.loc[df.spam==0])))
print(df.shape)
df['spam'] = df['spam'].astype(int)

df = df.drop_duplicates()
print(df.shape)

df = df.reset_index(inplace = False)[['text','spam']]

spam count: 1368
not spam count: 4358
(5726, 110)
(5693, 110)


In [None]:
df.shape

(5693, 2)

In [None]:
df['spam'].unique()

array([1, 0])

In [None]:
df.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


In [None]:
clean_desc = []
for w in range(len(df.text)):
    desc = df['text'][w].lower()
    
    # Remove punctuation
    desc = re.sub('[^a-zA-Z]', ' ', desc)
    
    # Remove tags
    desc=re.sub("&lt;/?.*?&gt;"," &lt;&gt; ",desc)
    
    # Remove digits and special chars
    desc=re.sub("(\\d|\\W)+"," ",desc)
    
    clean_desc.append(desc)
    
# Assign the cleaned descriptions to the data frame
df['text'] = clean_desc
df = df.reset_index()        
df.head(3)

Unnamed: 0,index,text,spam
0,0,subject naturally irresistible your corporate ...,1
1,1,subject the stock trading gunslinger fanny is ...,1
2,2,subject unbelievable new homes made easy im wa...,1


## TL;DR ("too long; didn't read") - Part I
The first part of this Jupyter notebook is about a machine learning model for spam detection. It uses the Python library Scikit-Learn to explore tokenization, vectorization, and statistical classification algorithms. The code is preprocessing the data by loading an excel file of emails, checking for non-integer values and missing values in the 'spam' column, and dropping those rows. Then it is removing punctuation, tags, digits, and special characters from the text of the emails and making them lowercase. The final output is a cleaned dataframe with columns 'text' and 'spam' which can be used for training the model.

# **Part II**

## Import Dependencies for Learning

Import the **Scikit-Learn** functionality needed to transform and model the data. Use `CountVectorizer`, `train_test_split`,  `ensemble` models, and some metrics. 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.model_selection import train_test_split 
from sklearn import ensemble  
from sklearn.metrics import classification_report, accuracy_score 


## Transforming Text to Numbers
A demo (not connected to our actual DataFrame!)

**Tokenization** is the process of breaking down a sentence into individual words. The individual words are called  tokens. Using SciKit-Learn’s `CountVectorizer()`, it is easy to transform the body of text into a sparse matrix  of numbers that the computer can pass to machine learning algorithms. To simplify the concept of count vectorization, imagine you have two sentences: 

<br>

> The dog is white <br>
> The cat is black 


Converting the sentences to a vector space model would transform them in such a way that looks at the words in all  sentences, and then represents the words in the sentence with a number. 

> The dog cat is white black <br>

> The dog is white = [1,1,0,1,1,0] <br>

> The cat is black = [1,0,1,1,0,1] <br>

We can show this using code as well. Add a third sentence to show that it counts the tokens. 


In [None]:
# List of sentences 
text = ["the dog is white", "the cat is black", "the cat and the dog are friends"] 


In [None]:
# Instantiate the class 
cv = CountVectorizer() 


In [None]:
# Tokenize and build vocab 
cv.fit(text) 
print(cv.vocabulary_) 


{'the': 7, 'dog': 4, 'is': 6, 'white': 8, 'cat': 3, 'black': 2, 'and': 0, 'are': 1, 'friends': 5}


The sparse matrix of word counts. ☝

In [None]:
# Transform the text 
vector = cv.transform(text) 


In [None]:
print(vector.toarray())

[[0 0 0 0 1 0 1 1 1]
 [0 0 1 1 0 0 1 1 0]
 [1 1 0 1 1 1 0 2 0]]


Notice in the last vector☝, you can see a 2 since the word “the” appears twice. The **`CountVectorizer`** is counting the  tokens and allows me to construct the sparse matrix containing the transformed words into numbers.

## Bag of Words Method 

Because the model doesn’t take word placement into account, and instead mixes the words up as if they were tiles in  a scrabble game, this is called the **bag of words** method. Create the sparse matrix, then split the data using sk-learn **`train_test_split()`**. 

In [None]:
# Instantiate the class 
cv = CountVectorizer() 

In [None]:
# Tokenize and build vocab 
cv.fit(df['text']) 

CountVectorizer()

In [None]:
# Transform the text 
text_vec = cv.transform(df['text'])

In [None]:
'''
⚠️
Once the text has been transformed using the CountVectorizer(), 
it is difficult to decode it back to the original words because 
the mapping between the words and the integer representations is 
lost during the transformation process. 
So please, DO NOT RUN the code below this.
'''
#⚠️ text_vec = CountVectorizer().fit_transform(df['text'])

# Split the data
X_train, X_test, y_train, y_test = train_test_split(text_vec, df['spam'], 
                                                    test_size = 0.45, 
                                                    random_state = 42, 
                                                    shuffle = True)

Notice sparse matrix **`text_vec`** is set to `X` and the **`df[‘spam’]`** column to `Y`. Shuffle and take a test size of 45%. 

## Build The Classifier

It is highly recommended to experiment with several classifiers and determine which one works best for this scenario. In  this example, use the **`GradientBoostingClassifier()`** model from the Scikit-Learn Ensemble collection. 

In [None]:
classifier = ensemble.GradientBoostingClassifier(
    n_estimators = 100, # How many decision trees to build
    learning_rate = 0.5, # Learning rate
    max_depth = 6
)

Each algorithm will have its own set of parameters you can tweak. That is called **hyper-parameter tuning**.

## Generate Predictions on Unseen Data
Finally, **fit** the data, call `predict` and generate the classification report. Using `classification_report()`, it is  easy to build a text report showing the main classification metrics. 

In [None]:
classifier.fit(X_train, y_train)

GradientBoostingClassifier(learning_rate=0.5, max_depth=6)

In [None]:
predictions = classifier.predict(X_test)

print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.97      0.99      0.98      1926
           1       0.96      0.90      0.93       636

    accuracy                           0.97      2562
   macro avg       0.96      0.94      0.95      2562
weighted avg       0.97      0.97      0.97      2562



**Notice our model achieved 97% accuracy. Not bad!**

In [None]:
'''
Here, I have added a line to store the original text of the email in a variable called original_text. 
The specific email that you want to generate a prediction for is chosen using the index i and 
the original text is accessed using df_original.iloc[i]['text']. 
This original text is then passed to the transform() method to generate predictions.
Finally, I've added some print statements to display the text of the email and the model's prediction for it.
'''

# Generating predictions for a specific email
i = 5

# Store the original text of the email
original_text = df_original.iloc[i]['text']

# Transform the test email using the CountVectorizer
test_vec = cv.transform([df_original.iloc[i]['text']])

# Generate predictions
predictions = classifier.predict(test_vec)

# Print the text of the email
print("Text of email:")
print(original_text)

# Print the predictions
print("\nPrediction:")
if predictions[0] == 0:
    print("✅Not spam")
else:
    print("⚠️Spam")


Text of email:
Subject: great nnews  hello , welcome to medzonline sh groundsel op  we are pleased to introduce ourselves as one of the ieading online phar felicitation maceuticai shops .  helter v  shakedown r  a cosmopolitan l  l blister l  l bestow ag  ac tosher l  is coadjutor va  confidant um  andmanyother .  - sav inexpiable e over 75 %  - total confide leisure ntiaiity  - worldwide s polite hlpplng  - ov allusion er 5 miilion customers in 150 countries  have devitalize a nice day !

Prediction:
⚠️Spam


## TL;DR ("too long; didn't read") - Part II
This part of the Jupyter notebook is using the Scikit-Learn library to build a spam detection model. It is using the `CountVectorizer()` function to tokenize the text of the emails and convert them into a sparse matrix of word counts, which is then used as input for the machine learning model. The bag of words method is used, which ignores the order of the words in the text and treats them as if they were tiles in a game of Scrabble. The `train_test_split()` function is used to split the data into training and testing sets. The `GradientBoostingClassifier()` model from the Scikit-Learn Ensemble collection is used as the classifier, which is trained on the training data and then used to make predictions on the testing data. The predictions are then evaluated using the `classification_report()` function, which generates a report with metrics such as accuracy, precision, recall, and F1 score. The model achieved **97% accuracy** in this example.