## Email Spam Classifier

A Naive Bayes Model created to classify Emails as Spam and Ham.

### Gathering dataset and importing necessary modules

In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv('C:/Users/rishu/ML_Codebasics/Email_SpamClassifier/emails.csv')
df.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


In [2]:
df.tail()

Unnamed: 0,text,spam
5723,Subject: re : research and development charges...,0
5724,"Subject: re : receipts from visit jim , than...",0
5725,Subject: re : enron case study update wow ! a...,0
5726,"Subject: re : interest david , please , call...",0
5727,Subject: news : aurora 5 . 2 update aurora ve...,0


In [2]:
df.groupby('spam').describe()

Unnamed: 0_level_0,text,text,text,text
Unnamed: 0_level_1,count,unique,top,freq
spam,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,4360,4327,Subject: re : weather and energy price data m...,2
1,1368,1368,Subject: i want to mentor you - no charge thi...,1


### Splitting of dataset into train and test data for Classification

In [3]:
from sklearn.model_selection import train_test_split

In [4]:
X_train,X_test,y_train,y_test = train_test_split(df.text,df.spam,test_size=0.3,random_state=40)

In [5]:
len(X_train)

4009

In [6]:
len(X_test)

1719

### Converting emails into vectors with unique words as parameters

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

In [8]:
cv = CountVectorizer()
X_train_cv = cv.fit_transform(X_train.values)
X_train_cv.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

### Naive Bayes

In [9]:
from sklearn.naive_bayes import MultinomialNB

In [10]:
model = MultinomialNB()
model.fit(X_train_cv,y_train)

MultinomialNB()

In [11]:
emails = [
    'Hey Reeja,movie tonight?',
    "50% discount on sale,Don't miss this reward!"
]
emails_count=cv.transform(emails)
model.predict(emails_count)

array([0, 1], dtype=int64)

In [12]:
X_test_cv=cv.transform(X_test)
y_predicted = model.predict(X_test_cv)
y_predicted

array([0, 0, 1, ..., 0, 1, 0], dtype=int64)

In [13]:
model.score(X_test_cv,y_test)

0.9924374636416521

### Pipelining whole procedure

In [14]:
from sklearn.pipeline import Pipeline

In [15]:
clf = Pipeline([
    ('vectorizer',CountVectorizer()),
    ('nb',MultinomialNB())
])

In [16]:
clf.fit(X_train,y_train)

Pipeline(steps=[('vectorizer', CountVectorizer()), ('nb', MultinomialNB())])

In [17]:
clf.score(X_test,y_test)

0.9924374636416521

In [18]:
clf.predict(emails)

array([0, 1], dtype=int64)

### Conclusion

The Model thus detects whether a mail is spam or ham.