<a href="https://colab.research.google.com/github/vinal-2/Email-Spam-Detection---THM/blob/main/SPAM_Detector.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **SPAM Detector**

# Step 0: Importing the required libraries
Before starting with Data collection, we will import the required libraries. Jupyter Notebook comes with all the libraries we need for Machine Learning. Here, we are importing two key libraries: Numpy and Pandas. These libraries are already explained in detail in the previous task.

**Numpy:** NumPy (Numerical Python) is the fundamental package for numerical computations in Python.

**pandas:**  Pandas provides high-level data structures and methods designed to make data analysis fast and easy in Python. It's built on top of NumPy.

In [4]:
import numpy as np
import pandas as pd

# Step 1: Data Collection

**Data collection** is the process of gathering raw data from various sources to be used for Machine Learning. This data can originate from numerous sources, such as databases, text files, APIs, online repositories, sensors, surveys, web scraping, and many others.

Here, we are using the Pandas library to load the data collected from various sources in the csv format. The dataset contains spam and ham (non-spam) emails.

In [9]:
from google.colab import files
uploaded = files.upload()
data = pd.read_csv("emails_dataset.csv")

Saving emails_dataset.csv to emails_dataset.csv


### Test/Check Dataset ##

Let's review the dataset we just imported. The category column contains the email classification, and the message column contains the email body, as shown below:

In [20]:
print(data.head(1))

  Classification                                          Message
0           spam  Congratulations !! You have won the Free ticket


In [11]:
df = pd.DataFrame(data)
print(df)

     Classification                                            Message
0              spam    Congratulations !! You have won the Free ticket
1              spam  Nah I don't think he goes to usf, he lives aro...
2               ham  As per your request 'Melle Melle (Oru Minnamin...
3              spam  WINNER!! As a valued network customer you have...
4               ham  Had your mobile 11 months or more? U R entitle...
...             ...                                                ...
4446            ham                                                NaN
4447            ham                                                NaN
4448           spam                                                NaN
4449            ham                                                NaN
4450            ham                                                NaN

[4451 rows x 2 columns]


# Step 2: Data Preprocessing

Data preprocessing refers to the techniques used to convert raw data into a clean, organised, understandable, and structured format suitable for Machine Learning. Given that raw data is often messy, inconsistent, and incomplete, preprocessing is an essential step to ensure that the data feeding into the ML models is relevant and of high quality.

There are several data pre-processing machine learning models, each has their own ways to process the data.

### Utilizing CountVectorizer()
Machine Learning models understand numbers, not text. This means the text needs to be transformed into a numerical format. CountVectorizer, a class provided by the scikit-learn library in Python, achieves this by converting text into a token (word) count matrix. It is used to prepare the data for the Machine Learning models to use and predict decisions on.

Here we are using CounterVectorizer which is used to extract featutres from the text

In [17]:
from sklearn.feature_extraction.text import CountVectorizer

We will now use the CountVectorizer function to transform the Message column into numeric, as shown below:

In [21]:
df.dropna(inplace=True)
df.fillna(0, inplace=True)

In [22]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['Message'])
print(X)

  (0, 1364)	1
  (0, 5440)	1
  (0, 2356)	1
  (0, 5357)	1
  (0, 4794)	1
  (0, 2109)	1
  (0, 4850)	1
  (1, 3279)	1
  (1, 1687)	1
  (1, 4818)	1
  (1, 2363)	2
  (1, 2228)	1
  (1, 4885)	1
  (1, 5104)	1
  (1, 2920)	1
  (1, 719)	1
  (1, 2391)	1
  (1, 4831)	1
  (2, 4885)	1
  (2, 733)	2
  (2, 3627)	1
  (2, 5441)	3
  (2, 4024)	1
  (2, 3111)	2
  (2, 3512)	1
  :	:
  (2672, 2980)	1
  (2672, 2190)	1
  (2672, 3937)	1
  (2673, 4885)	1
  (2673, 4888)	1
  (2673, 3268)	1
  (2673, 2627)	1
  (2673, 3092)	1
  (2673, 2390)	1
  (2673, 1662)	1
  (2673, 3701)	3
  (2673, 5331)	1
  (2673, 923)	2
  (2673, 1385)	1
  (2673, 1973)	1
  (2673, 3348)	1
  (2673, 3905)	1
  (2674, 4419)	1
  (2674, 5465)	2
  (2674, 737)	1
  (2674, 4170)	1
  (2674, 1372)	1
  (2674, 1404)	1
  (2674, 3026)	1
  (2674, 1520)	1


# Step 3: Train/Test split dataset
It's important to test the model's performance on unseen data. By splitting the data, we can train our model on one subset and test its performance on another.

Here, the variable X contains the dataset. We will use the functions from sklearn library to split the dataset into training data and testing data, as shown below:

In [23]:
from sklearn.model_selection import train_test_split

In [24]:
y = df['Classification']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

- **X**: The first argument to `train_test_split` is the feature matrix `X` which you obtained from the `CountVectorizer`. This matrix contains the token counts for each message in the dataset.
    
- **y**: The second argument is the labels for each instance in your dataset, which indicates whether a message is spam or ham.
    
- **test_size=0.2**: This argument specifies that 20% of the dataset should be kept as the test set and the rest (80%) should be used for training. It's a common practice to hold out a portion of the dataset for testing to evaluate the performance of the model on unseen data.
This is where the actual splitting of data into training and test sets happens.

The function then returns four values:

- **X_train**: The subset of the features to be used for training.
- **X_test**: The subset of the features to be used for testing.
- **y_train**: The corresponding labels for the `X_train` set.
- **y_test**: The corresponding labels for the `X_test` set.

Step 4: Model Training using Naive Bayes

Naive Bayes is a statistical method that uses the probability of certain words appearing in spam and non-spam emails to determine whether a new email is spam or not.
How Naive Bayes Classification Works

    Let's say we have a bunch of emails, some labelled as "spam" and others as "ham".
    The Naive Bayes algorithm learns from these emails. It looks at the words in each email and calculates how frequently each word appears in spam or ham emails. For instance, words like "free", "win", "offer", and "lottery" might appear more in spam emails.
    The Naive Bayes algorithm calculates the probability of the email being spam based on the words it contains.
    When the model is trained with Naive Bayes and gets a new email that says (for example) "Win a free toy now!", then it thinks: - "Win" often appears in spam, so this increases the chance of the email being spam. - "Free" is also common in spam, further increasing the spam probability. - "Toy" might be neutral, often appearing in both spam and ham. - After considering all the words, it calculates the overall probability of the email being spam and ham.

If the calculated probability of spam is higher than that of ham, the algorithm classifies the email as spam. Otherwise, it's classified as ham. Let's use Naive Bayes to train the model, as shown and explained below:

In [25]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X_train, y_train)


    X_train: This is the training data you want the model to learn from. It's the token counts for each message in the training dataset, obtained from the CountVectorizer.
    y_train: These are the correct labels (either "spam" or "ham") for each message in the X_train dataset.

This is where the actual training of the model happens. The fit method is used to train or "fit" the model on your training data.


# Step 5: Model Evaluation

After training, it's essential to evaluate the model's performance on the test set to gauge its predictive power. This will give you metrics such as accuracy, precision, and recall.


In [26]:
from sklearn.metrics import classification_report

y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         ham       0.89      0.95      0.92       474
        spam       0.16      0.08      0.11        61

    accuracy                           0.85       535
   macro avg       0.53      0.51      0.51       535
weighted avg       0.81      0.85      0.82       535



The classification_report function takes in the true labels (y_test) and the predicted labels (y_pred) and returns a text report showing the main classification metrics.

The report gives you insights into how well your model is performing for each class and overall, in terms of these metrics.

    Precision: It is the ratio of correctly predicted positive observations to the total predicted positives. The question it answers is: Of all the samples predicted as positive, how many were actually positive?
    Recall (Sensitivity): It is the ratio of correctly predicted positive observations to all the actual positives. The question it answers is: Of all the actual positive samples, how many did we predict correctly?
    F1-Score: It's the harmonic mean of Precision and Recall and gives a better measure of the incorrectly classified cases than the accuracy metric, especially when there's an imbalance between classes.
    Support: It is the number of actual occurrences of the class in the specified dataset.
    Accuracy: It's the ratio of correctly predicted observations to the total observations.
    Macro Avg: This averages the unweighted mean per label.
    Weighted Avg: This averages the support-weighted mean per label.


The report gives us insights into how well your model is performing for each class and overall, in terms of these metrics.

# Step 6: Testing the Model

Once satisfied with the model's performance, we can use it to classify new messages and determine if they are spam or ham.

In [27]:
message = vectorizer.transform(["Today's Offer! Claim ur £150 worth of discount vouchers! Text YES to 85023 now! SavaMob, member offers mobile! T Cs 08717898035. £3.00 Sub. 16 . Unsub reply X "])
prediction = clf.predict(message)
print("The email is :", prediction[0])

The email is : ham
