<a href="https://colab.research.google.com/github/vishaljbind/CVIP-Projects-Phase-2/blob/main/Spam_Email_Filter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CodersCave

# Spam Email Filter using NLP and Machine Learning Algorithm

### Task:- Build a spam filter using NLP and machine learning to identify and filter out spam emails

- Develop a robust Spam Email Filter using Natural Language Processing (NLP) techniques and machine learning algorithms.
- The goal is to create an intelligent system capable of accurately classifying emails as either spam or legitimate (ham) based on their content and linguistic features

## Importing necessary libraries

In [15]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
from sklearn.pipeline import Pipeline

In [16]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import joblib

## Loading and Exploring the Dataset

In [17]:
df = pd.read_csv('emails.csv')
df.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


In [18]:
df.shape

(5728, 2)

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5728 entries, 0 to 5727
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    5728 non-null   object
 1   spam    5728 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 89.6+ KB


In [20]:
df.groupby('spam').describe()

Unnamed: 0_level_0,text,text,text,text
Unnamed: 0_level_1,count,unique,top,freq
spam,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,4360,4327,Subject: * special notification * aurora versi...,2
1,1368,1368,Subject: naturally irresistible your corporate...,1


## Data Preprocessing

In [21]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [22]:
stop_words = set(stopwords.words('english'))
ps = PorterStemmer()

In [23]:
def preprocess_text(text):
    words = word_tokenize(text)
    words = [ps.stem(word) for word in words if word.isalpha() and word.lower() not in stop_words]

    return ' '.join(words)
df['processed_text'] = df['text'].apply(preprocess_text)

In [24]:
df.sample(5)

Unnamed: 0,text,spam,processed_text
684,Subject: selling travel in today ' s economy ...,1,subject sell travel today economi good morn si...
2989,Subject: re : weather and energy price data d...,0,subject weather energi price data dear dr kami...
4080,"Subject: re : one more thing clayton , i agr...",0,subject one thing clayton agre would happen in...
1157,Subject: make your rivals envy lt is really h...,1,subject make rival envi lt realli hard recolle...
3583,Subject: re : spreadsheet for george posey vi...,0,subject spreadsheet georg posey vinc analysi f...


## Training the Model

In [25]:
X = df['processed_text']
y = df['spam']

In [26]:
# Splitting the Dataset
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [27]:
len(x_train), len(y_train)

(4582, 4582)

In [28]:
len(x_test), len(y_test)

(1146, 1146)

In [29]:
# Building the Machine Learning Pipeline
model = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('nb', MultinomialNB())
])

In [30]:
model.fit(x_train, y_train)

In [31]:
model.score(x_test, y_test)

0.9860383944153578

## Evaluating the Model

In [32]:
y_test.head()

4445    0
4118    0
3893    0
4210    0
5603    0
Name: spam, dtype: int64

In [33]:
y_pred = model.predict(x_test)
y_pred

array([0, 0, 0, ..., 1, 0, 0])

In [34]:
print(f'Accuracy of the model: {accuracy_score(y_test, y_pred)}')

Accuracy of the model: 0.9860383944153578


In [35]:
print('Classification Report:\n')
print(classification_report(y_test, y_pred))

Classification Report:

              precision    recall  f1-score   support

           0       0.99      0.99      0.99       856
           1       0.98      0.97      0.97       290

    accuracy                           0.99      1146
   macro avg       0.98      0.98      0.98      1146
weighted avg       0.99      0.99      0.99      1146



In [36]:
# Using Cross-validation to assess generalizability
from sklearn.model_selection import cross_val_score
cv_score = cross_val_score(model, X, y, cv=5)

In [37]:
print(f'Cross-validation Scores: {cv_score}')

Cross-validation Scores: [0.9877836  0.9904014  0.9921466  0.99039301 0.99388646]


In [38]:
print('Mean CV Score:', cv_score.mean())

Mean CV Score: 0.9909222128230336


### Example Mail

In [39]:
new_email = ["Congratulations! You've won a free vacation. Click here to claim your prize."]
prediction = model.predict(new_email)

In [40]:
if prediction[0] == 1:
    print('The email is classified as "ham" (non-spam).')
elif prediction[0] == 0:
    print('The email is classified as "spam".')
else:
    print('Invalid prediction label.')

The email is classified as "ham" (non-spam).


In [42]:
probability_spam = model.predict_proba(new_email)[0][0]
probability_ham = model.predict_proba(new_email)[0][1]

print(f"Spam Probability: {probability_spam:.2f}")
print(f"Ham Probability: {probability_ham:.2f}")

Spam Probability: 0.00
Ham Probability: 1.00
