## Project Description

This project will mainly focused on building model to classifiy either an email is spam or not with dataset obtained from Kaggle. The machine learning algorithm being used in this project is Multinomial Naive Bayes algorithm.

Dataset link: https://www.kaggle.com/datasets/mfaisalqureshi/spam-email?select=spam.csv

## Import Necessary Library

In [69]:
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
import pandas as pd
import pickle
import string
import nltk

In [30]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

## Load Dataset

In [4]:
df = pd.read_csv('dataset/spam.csv')

In [5]:
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  5572 non-null   int32 
 1   Message   5572 non-null   object
dtypes: int32(1), object(1)
memory usage: 65.4+ KB


In [13]:
df.shape

(5572, 2)

## Apply LabelEncoder To "Category" Features

In [6]:
label_encoder = LabelEncoder()

In [8]:
df['Category'] = label_encoder.fit_transform(df['Category'])

In [9]:
df.head()

Unnamed: 0,Category,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


## Check And Handle Duplicated Values

In [14]:
df.duplicated().sum()

415

In [15]:
df = df.drop_duplicates(keep='first')

In [16]:
df.duplicated().sum()

0

In [17]:
df.shape

(5157, 2)

## Process Data To Be Used For Model

In [42]:
def transform_text(text):
    text = text.lower()
    text = nltk.word_tokenize(text)
    
    y = []
    for i in text:
        if i.isalnum():
            y.append(i)
        
    text = y[:]
    y.clear()
    
    for i in text:
        if i not in stopwords.words('english') and i not in string.punctuation:
            y.append(i)
            
    text = y[:]
    y.clear()
    
    ps = PorterStemmer()
    for i in text:
        y.append(ps.stem(i))
            
    return ' '.join(y)

In [45]:
df['transformed_message'] = df['Message'].apply(transform_text)

In [46]:
df.head()

Unnamed: 0,Category,Message,transformed_message
0,0,"Go until jurong point, crazy.. Available only ...",go jurong point crazi avail bugi n great world...
1,0,Ok lar... Joking wif u oni...,ok lar joke wif u oni
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,free entri 2 wkli comp win fa cup final tkt 21...
3,0,U dun say so early hor... U c already then say...,u dun say earli hor u c alreadi say
4,0,"Nah I don't think he goes to usf, he lives aro...",nah think goe usf live around though


## Split Data Into Training And Testing To Fit Into Model

In [48]:
tfidf = TfidfVectorizer(max_features=3000)

In [49]:
X = tfidf.fit_transform(df['transformed_message']).toarray()
y = df['Category'].values

In [52]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=2)

## Fit Dataset Into Gaussian Naive Bayes Model

In [53]:
gnb = GaussianNB()

In [54]:
gnb.fit(X_train, y_train)

In [55]:
gnb_prediction = gnb.predict(X_test)

In [61]:
print(confusion_matrix(y_test, gnb_prediction))
print('Accuracy Score :', accuracy_score(y_test, gnb_prediction))
print('Precision Score :', precision_score(y_test, gnb_prediction))

[[784 121]
 [ 17 110]]
Accuracy Score : 0.8662790697674418
Precision Score : 0.47619047619047616


## Fit Dataset Into Multinomial Naive Bayes Model

In [58]:
mnb = MultinomialNB()

In [59]:
mnb.fit(X_train, y_train)

In [60]:
mnb_prediction = mnb.predict(X_test)

In [63]:
print(confusion_matrix(y_test, mnb_prediction))
print('Accuracy Score :', accuracy_score(y_test, mnb_prediction))
print('Precision Score :', precision_score(y_test, mnb_prediction))

[[905   0]
 [ 30  97]]
Accuracy Score : 0.9709302325581395
Precision Score : 1.0


## Fit Dataset Into Bernoulli Naive Bayes Model

In [64]:
bnb = BernoulliNB()

In [65]:
bnb.fit(X_train, y_train)

In [66]:
bnb_prediction = bnb.predict(X_test)

In [68]:
print(confusion_matrix(y_test, bnb_prediction))
print('Accuracy Score :', accuracy_score(y_test, bnb_prediction))
print('Precision Score :', precision_score(y_test, bnb_prediction))

[[903   2]
 [ 15 112]]
Accuracy Score : 0.9835271317829457
Precision Score : 0.9824561403508771


## Generate Byte Stream File To Be Used On Project Development Using Streamlit

In [70]:
pickle.dump(tfidf, open('vectorizer.pkl', 'wb'))
pickle.dump(mnb, open('model.pkl', 'wb'))