<a href="https://colab.research.google.com/github/ramu11/PyTorch_ML_Models/blob/main/ML_NaiveBayes_Email_classification_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### EMAIL SPAM or Not SPAM classification using Naive Bayes sk learn libraries

Bernoulli Naive Bayes : It assumes that all our features are binary such that they take only two values. Means 0s can represent “word does not occur in the document” and 1s as "word occurs in the document" .

Multinomial Naive Bayes : Its is used when we have discrete data (e.g. movie ratings ranging 1 and 5 as each rating will have certain frequency to represent). In text learning we have the count of each word to predict the class or label.

Gaussian Naive Bayes : Because of the assumption of the normal distribution, Gaussian Naive Bayes is used in cases when all our features are continuous. For example in Iris dataset features are sepal width, petal width, sepal length, petal length. So its features can have different values in data set as width and length can vary. We can’t represent features in terms of their occurrences. This means data is continuous. Hence we use Gaussian Naive Bayes here.

Note: Here for spam or not sam we are using `Multinomial Naive Bayes`

In [None]:
# Download income csv from github to google collab
import requests
from pathlib import Path

# Download csv data from git repo (if not already downloaded)
if Path("mail.csv").is_file():
  print("mail.csv already exists, skipping download")
else:
  print("Downloading mail.csv")
  # Note: you need the "raw" GitHub URL for this to work
  request = requests.get("https://github.com/ramu11/PyTorch_ML_Models/raw/main/data/mail.csv")
  with open("mail.csv", "wb") as f:
    f.write(request.content)

Downloading mail.csv


In [None]:
# Import mail data set using pandas
import pandas as pd
df = pd.read_csv("mail.csv")
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
# group by category and describe
df.groupby('Category').describe()


Unnamed: 0_level_0,Message,Message,Message,Message
Unnamed: 0_level_1,count,unique,top,freq
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,641,Please call our customer service representativ...,4


In [None]:
# Convert category, Message from string to numeric
df['spam'] = df['Category'].apply(lambda x: 1 if x=='spam' else 0)
df.head()


Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [None]:
# create train test splits using sklearn libraries
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.Message, df.spam, test_size=0.25)


In [None]:
# convers Message cloumn text to numeric using count vecorized technique
from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer()
X_train_count = v.fit_transform(X_train.values)
X_train_count.toarray()[:3]



array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [None]:
# Fit the model
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train_count, y_train)

model.get_params()

{'alpha': 1.0, 'class_prior': None, 'fit_prior': True, 'force_alpha': 'warn'}

In [None]:
# predict the emails
emails = [
    'Hey Ramu, can we get together to watch footbal game tomorrow?',
    'Upto 30% discount on parking, exclusive offer just for you. Dont miss this reward!'
]

emails_count = v.transform(emails)
model.predict(emails_count) # 1 means it is spam


array([0, 1])

In [None]:
# Find the score of the model
X_test_count = v.transform(X_test)
model.score(X_test_count, y_test)



0.9856424982053122

In [None]:
# use sklearn pipeline to simplify
from sklearn.pipeline import Pipeline
clf = Pipeline([
    ('vectorizer', CountVectorizer()), # convert the text into vector
    ('nb', MultinomialNB())
])

In [None]:
# Now train the model
clf.fit(X_train, y_train)

In [None]:
# Now check the model score
clf.score(X_test, y_test)

0.9856424982053122

In [None]:
# Now the test model
clf.predict(emails)

array([0, 1])