# Naive Bayes

##### Bayes’s Theorem

According to the Wikipedia, In probability theory and statistics,** Bayes’s theorem** (alternatively *Bayes’s law* or *Bayes’s rule*) describes the probability of an event, based on prior knowledge of conditions that might be related to the event.
Mathematically, it can be written as:

![formula.jpeg](attachment:formula.jpeg)

Where A and B are events and P(B)≠0
* P(A|B) is a conditional probability: the likelihood of event A occurring given that B is true.
* P(B|A) is also a conditional probability: the likelihood of event B occurring given that A is true.
* P(A) and P(B) are the probabilities of observing A and B respectively; they are known as the marginal probability.


Let’s understand it with the help of an example:

**The problem statement:**

You are planning a picnic today, but the morning is cloudy

Oh no! 50% of all rainy days start off cloudy!
But cloudy mornings are common (about 40% of days start cloudy)
And this is usually a dry month (only 3 of 30 days tend to be rainy, or 10%)
What is the chance of rain during the day?

We will use Rain to mean rain during the day, and Cloud to mean cloudy morning.

The chance of Rain given Cloud is written P(Rain|Cloud)

So let's put that in the formula:

$P(Rain|Cloud) = \frac{P(Rain)*P(Cloud|Rain)} {P(Cloud)}$          
                      


- P(Rain) is Probability of Rain = 10%
- P(Cloud|Rain) is Probability of Cloud, given that Rain happens = 50%
- P(Cloud) is Probability of Cloud = 40%

$P(Rain|Cloud) =  \frac{(0.1 x 0.5)} {0.4}   = .125$

Or a 12.5% chance of rain. Not too bad, let's have a picnic!

**Naïve:** It is called Naïve because it assumes that the occurrence of a certain feature is independent of the occurrence of other features. Such as if the fruit is identified on the bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence each feature individually contributes to identify that it is an apple without depending on each other.<br>
**Bayes:** It is called Bayes because it depends on the principle of Bayes' Theorem

# Problem statement

Spam filtering using naive Bayes classifiers in order to predict whether a new mail based on its content, can be categorized as spam or not-spam.

### Data processing using panda library

In [1]:
# Import the required libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
import string
import matplotlib.pyplot as plt

In [2]:
data=pd.read_excel('spam.xlsx',names=['Class','Message'])

In [3]:
data.head()

Unnamed: 0,Class,Message
0,spam,Free entry in 2 a wkly comp to win FA Cup fina...
1,ham,"Nah I don't think he goes to usf, he lives aro..."
2,ham,Even my brother is not like to speak with me. ...
3,ham,I HAVE A DATE ON SUNDAY WITH WILL!!!
4,ham,As per your request 'Melle Melle (Oru Minnamin...


In [None]:
# Load the dataset

data = pd.read_csv("spam.tsv",sep='\t',names=['Class','Message'])
data.head(8) # View the first 8 records of our dataset

Unnamed: 0,Class,Message
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!!
5,ham,As per your request 'Melle Melle (Oru Minnamin...
6,spam,WINNER!! As a valued network customer you have...
7,spam,Had your mobile 11 months or more? U R entitle...


In [None]:
# to view the first record
data.loc[:0]

Unnamed: 0,Class,Message
0,ham,I've been searching for the right words to tha...


In [None]:
# Summary of the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5567 entries, 0 to 5566
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Class    5567 non-null   object
 1   Message  5567 non-null   object
dtypes: object(2)
memory usage: 87.1+ KB


In [None]:
# create a column to keep the count of the characters present in each record
data['Length'] = data['Message'].apply(len)

In [None]:
data['Length']

0       196
1       155
2        61
3        77
4        36
       ... 
5562    160
5563     36
5564     57
5565    125
5566     26
Name: Length, Length: 5567, dtype: int64

In [None]:
# view the dataset with the column 'Length' which contains the number of characters present in each mail
data.head(10)

Unnamed: 0,Class,Message,Length
0,ham,I've been searching for the right words to tha...,196
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155
2,ham,"Nah I don't think he goes to usf, he lives aro...",61
3,ham,Even my brother is not like to speak with me. ...,77
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!!,36
5,ham,As per your request 'Melle Melle (Oru Minnamin...,160
6,spam,WINNER!! As a valued network customer you have...,157
7,spam,Had your mobile 11 months or more? U R entitle...,154
8,ham,I'm gonna be home soon and i don't want to tal...,109
9,spam,"SIX chances to win CASH! From 100 to 20,000 po...",136


In [None]:
## The mails are categorised into 2 classes ie., spam and ham.
# Let's see the count of each class
data.groupby('Class').count()

Unnamed: 0_level_0,Message,Length
Class,Unnamed: 1_level_1,Unnamed: 2_level_1
ham,4821,4821
spam,746,746


### Data Visualization

In [None]:
data['Length'].describe() # to find the max length of the message.

count    5567.000000
mean       80.450153
std        59.891023
min         2.000000
25%        36.000000
50%        62.000000
75%       122.000000
max       910.000000
Name: Length, dtype: float64

See what we found, A 910 character long message. Let's use masking to find this message:

In [None]:
data['Length']==910

0       False
1       False
2       False
3       False
4       False
        ...  
5562    False
5563    False
5564    False
5565    False
5566    False
Name: Length, Length: 5567, dtype: bool

In [None]:
# the message that has the max characters
data[data['Length']==910]['Message']

1080    For me the love should start with attraction.i...
Name: Message, dtype: object

In [None]:
# view the message that has 910 characters in it
data[data['Length']==910]['Message'].iloc[0]

"For me the love should start with attraction.i should feel that I need her every time around me.she should be the first thing which comes in my thoughts.I would start the day and end it with her.she should be there every time I dream.love will be then when my every breath has her name.my life should happen around her.my life will be named to her.I would cry for her.will give all my happiness and take all her sorrows.I will be ready to fight with anyone for her.I will be in love when I will be doing the craziest things for her.love will be when I don't have to proove anyone that my girl is the most beautiful lady on the whole planet.I will always be singing praises for her.love will be when I start up making chicken curry and end up makiing sambar.life will be the most beautiful then.will get every morning and thank god for the day because she is with me.I would like to say a lot..will tell later.."

In [None]:
# View the message that has min characters
data[data['Length']==2]['Message'].iloc[0]

'Ok'

### Text Pre-Processing

In [None]:
# creating an object for the target values
dObject = data['Class'].values
dObject

array(['ham', 'spam', 'ham', ..., 'ham', 'ham', 'ham'], dtype=object)

In [4]:
data['Target']=data['Class'].map({"spam":0,'ham':1})

In [None]:
# Lets assign ham as 1
data.loc[data['Class']=="ham","Class"] = 1

In [None]:
# Lets assign spam as 0
data.loc[data['Class']=="spam","Class"] = 0

In [None]:
dObject2=data['Class'].values
dObject2

array([1, 0, 1, ..., 1, 1, 1], dtype=object)

In [None]:
data.head(8)

Unnamed: 0,Class,Message,Length
0,1,I've been searching for the right words to tha...,196
1,0,Free entry in 2 a wkly comp to win FA Cup fina...,155
2,1,"Nah I don't think he goes to usf, he lives aro...",61
3,1,Even my brother is not like to speak with me. ...,77
4,1,I HAVE A DATE ON SUNDAY WITH WILL!!!,36
5,1,As per your request 'Melle Melle (Oru Minnamin...,160
6,0,WINNER!! As a valued network customer you have...,157
7,0,Had your mobile 11 months or more? U R entitle...,154


First removing punctuation. We can just take advantage of Python's built-in string library to get a quick list of all the possible punctuation:

In [None]:
# the default list of punctuations
import string

string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [None]:
# Why is it important to remove punctuation?

"This message is spam" == "This message is spam."

False

In [6]:
l1=['a','b','c']
"*".join(l1)

'a*b*c'

In [8]:
str1=data['Message'][0]

In [9]:
import string

In [None]:
[i for i in str1 if i not in string.punctuation]

In [16]:
def remove(text):
  out=''
  for i in text:
    if i not in string.punctuation:
      out=out+i
  return out

In [10]:
l1=[]
for i in str1:
  if i not in string.punctuation:
    l1.append(i)

In [12]:
"".join(l1)

'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive entry questionstd txt rateTCs apply 08452810075over18s'

In [21]:
data['Message'][1606]='645'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Message'][1606]='645'


In [19]:
data.loc[data['Message']==645]

Unnamed: 0,Class,Message,Target
1606,ham,645,1


In [32]:
"hello"=="Hello"

False

In [None]:
# Let's remove the punctuation

def remove_punct(text):
    text = "".join([char for char in text if char not in string.punctuation])
    return text.lower()

data['text_clean'] = data['Message'].apply(lambda x: remove_punct(x))

data.head()

__Tokenization__ (process of converting the normal text strings in to a list of tokens(also known as lemmas)).

In [None]:
# original text and cleaned text
data.head(8)

Now we need to convert each of those messages into a vector the SciKit Learn's algorithm models can work with and machine learning model which we will gonig to use can understand.

In [33]:
# Countvectorizer is a method to convert text to numerical data.

# Initialize the object for countvectorizer
CV = CountVectorizer(stop_words="english")

[Stopwords are the words in any language which does not add much meaning to a sentence. They are the words which are very common in text documents such as a, an, the, you, your, etc. The Stop Words highly appear in text documents. However, they are not being helpful for text analysis in many of the cases, So it is better to remove from the text. We can focus on the important words if stop words have removed.]

In [35]:
CV.fit(data['text_clean'])

In [39]:
x=CV.transform(data['text_clean']).toarray()

In [None]:
data.head()

In [29]:
x=CV.transform(data['Message']).toarray()

In [None]:
# Splitting x and y

xSet = data['text_clean'].values
ySet = data['Class'].values
ySet

array([1, 0, 1, ..., 1, 1, 1], dtype=object)

In [None]:
# Datatype for y is object. lets convert it into int
ySet = ySet.astype('int')
ySet

array([1, 0, 1, ..., 1, 1, 1])

In [None]:
xSet

array(['Ive been searching for the right words to thank you for this breather I promise i wont take your help for granted and will fulfil my promise You have been wonderful and a blessing at all times',
       'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive entry questionstd txt rateTCs apply 08452810075over18s',
       'Nah I dont think he goes to usf he lives around here though', ...,
       'Pity  was in mood for that Soany other suggestions',
       'The guy did some bitching but I acted like id be interested in buying something else next week and he gave it to us for free',
       'Rofl Its true to its name'], dtype=object)

### Splitting Train and Test Data

In [None]:
x

In [42]:
y=data['Target']

In [None]:
xSet_train,xSet_test,ySet_train,ySet_test = train_test_split(xSet,ySet,test_size=0.2, random_state=10)

In [43]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25)

In [None]:

xSet_train_CV = CV.fit_transform(xSet_train)
xSet_train_CV

<4453x8159 sparse matrix of type '<class 'numpy.int64'>'
	with 34532 stored elements in Compressed Sparse Row format>

### Training a model

With messages represented as vectors, we can finally train our spam/ham classifier. Now we can actually use almost any sort of classification algorithms. For a variety of reasons, the Naive Bayes classifier algorithm is a good choice.

In [44]:
# Initialising the model
NB = MultinomialNB()

In [45]:
NB.fit(x_train,y_train)

In [None]:
# feed data to the model
NB.fit(xSet_train_CV,ySet_train)

MultinomialNB()

In [None]:
# Let's test CV on our test data
xSet_test_CV = CV.transform(xSet_test)

In [46]:
y_pred=NB.predict(x_test)

In [47]:
y_pred

array([1, 1, 1, ..., 0, 0, 1])

In [None]:
# prediction for xSet_test_CV

ySet_predict = NB.predict(xSet_test_CV)
ySet_predict

array([1, 1, 1, ..., 1, 1, 1])

In [None]:
# Checking accuracy

accuracyScore = accuracy_score(ySet_test,ySet_predict)*100

print("Prediction Accuracy :",accuracyScore)

Prediction Accuracy : 98.29443447037703


### SpamClassificationApplication

In [51]:
msg = input("Enter Message: ") # to get the input message
msgInput = CV.transform([msg]) #
predict = NB.predict(msgInput)
if(predict[0]==0):
    print("------------------------MESSAGE-SENT-[CHECK-SPAM-FOLDER]---------------------------")
else:
    print("---------------------------MESSAGE-SENT-[CHECK-INBOX]------------------------------")

Enter Message: dude i won lottery
------------------------MESSAGE-SENT-[CHECK-SPAM-FOLDER]---------------------------


## BAG OF WORDS

We cannot pass text directly to train our models in Natural Language Processing, thus we need to convert it into numbers, which machine can understand and can perform the required modelling on it

In [None]:
# Let's understand it with a simple example

In [None]:
# creating a list of sentences
documents = ["Dog bites man.", "Man bites dog.", "Dog eats meat.", "Man eats food."]

# Changing the text to lower case and remove the full stop from text
processed_docs = [doc.lower().replace(".","") for doc in documents]
processed_docs[3]

'man eats food'

In [None]:
# corpus is the collection of text
#look at the documents list
print("Our corpus: ", processed_docs)


# Initialise the object for CountVectorizer
count_vect = CountVectorizer()

#Build a BOW representation for the corpus
bow_rep = count_vect.fit_transform(processed_docs)

#Look at the vocabulary mapping
print("Our vocabulary: ", count_vect.vocabulary_)

#see the BOW rep for first 2 documents
print("BoW representation for 'dog bites man': ", bow_rep[0].toarray())
print("BoW representation for 'man bites dog: ",bow_rep[1].toarray())

#Get the representation using this vocabulary, for a new text
temp = count_vect.transform(["dog and dog are friends"])
print("Bow representation for 'dog and dog are friends':", temp.toarray())

Our corpus:  ['dog bites man', 'man bites dog', 'dog eats meat', 'man eats food']
Our vocabulary:  {'dog': 1, 'bites': 0, 'man': 4, 'eats': 2, 'meat': 5, 'food': 3}
BoW representation for 'dog bites man':  [[1 1 0 0 1 0]]
BoW representation for 'man bites dog:  [[1 1 0 0 1 0]]
Bow representation for 'dog and dog are friends': [[0 2 0 0 0 0]]


## TF-IDF

In **BOW approach** we saw so far, all the words in the text are treated equally important. There is no notion of some words in the document being more important than others. TF-IDF addresses this issue. It aims to quantify the importance of a given word relative to other words in the document and in the


<font color=darkviolet>  **Term Frequency (tf)** </font>
TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization:

TF(t) = (Number of times term 't' appears in a document) / (Total number of terms in the document).



<font color=darkviolet>  **Inverse Document Frequency (idf)** </font>
              It measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).corpus. It was commonly used representation scheme for information retrieval systems, for extracting relevant documents from a corpus for given text query.



__Let's see an example:__

Consider a document containing 100 words wherein the word cat appears 3 times.

The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03.

Now, assume we have 10 million documents and the word cat appears in one thousand of these.

Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4.

Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12

In [None]:
# Splitting x and y

X = data['text_clean'].values
y = data['Class'].values
y

array([1, 0, 1, ..., 1, 1, 1], dtype=object)

In [None]:
# Datatype for y is object. lets convert it into int
y = y.astype('int')
y

array([1, 0, 1, ..., 1, 1, 1])

In [None]:
type(X)

numpy.ndarray

In [None]:
## text preprocessing and feature vectorizer
# To extract features from a document of words, we import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer


tf=TfidfVectorizer() ## object creation
X=tf.fit_transform(X) ## fitting and transforming the data into vectors


In [None]:
X.shape

(5567, 9537)

In [None]:
## print feature names selected from the raw documents
tf.get_feature_names()

In [None]:
## number of features created
len(tf.get_feature_names())

9537

In [None]:
X

<5567x9537 sparse matrix of type '<class 'numpy.float64'>'
	with 72701 stored elements in Compressed Sparse Row format>

In [None]:
## getting the feature vectors
X=X.toarray()

In [None]:
## Creating training and testing
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=6)

In [None]:
## Model creation
from sklearn.naive_bayes import BernoulliNB

## model object creation
nb=BernoulliNB(alpha=0.01)

## fitting the model
nb.fit(X_train,y_train)

## getting the prediction
y_hat=nb.predict(X_test)

In [None]:
y_hat

array([1, 1, 1, ..., 1, 1, 1])

In [None]:
## Evaluating the model
from sklearn.metrics import classification_report,confusion_matrix

In [None]:
print(classification_report(y_test,y_hat))

              precision    recall  f1-score   support

           0       0.96      0.96      0.96       186
           1       0.99      0.99      0.99      1206

    accuracy                           0.99      1392
   macro avg       0.98      0.98      0.98      1392
weighted avg       0.99      0.99      0.99      1392



In [None]:
## confusion matrix
pd.crosstab(y_test,y_hat)

col_0,0,1
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
0,178,8
1,7,1199


### Pros of Naive Bayes

- Naive Bayes Algorithm is a fast, highly scalable algorithm
- Naive Bayes can be classified for both binary classification and multi class classification. It provides different types of Naive Bayes Algorithms like GaussianNB, MultinominalNB, BernoulliNB.
- It is simple algorithm that depends on doing a bunch of count.
- Great choice for text classification problems. it's a popular choice for spam email classification.
- It can be easily trained on small datasets.
- Naive Bayes can handle misssing data, as they ignored when a probabilty is calculated for a class value.


### Cons of Naive Bayes

- It considers all the features to be unrelated, so it cannot learn the relationship between features. This limits the applicability of this algorithm in real-world use cases.
- Naive Bayes can learn individual featutre importance but can't determine the relationship among features.

## Application of Naive Bayes

##### Text classification / spam filtering / Sentiment analysis:
 - Naive Bayes classifiers mostly used in text classification
 - News article classification SPORTS, TECHNOLOGY etc.
 - Spam or Ham: Naive Bayes is the most popular method for mail filtering
 - Sentiment analysis focuses on identifying whether the customers think positively or negatively about a certain topic (product or service).


##### Recommendation System:
- Naive Bayes classifier and Collabrative filtering together buids a recommendation system that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a given resource or not.



### 3 Types of Naive Bayes in Scikit Learn

__Gaussian__

- It is used in classification and it assumes that features follow a normal distribution.

__Multinominal__
- It is used for discrete counts. For eg., let's say we have a text cLassification problem. Here we consider Bernoulli trails which is one step further and instead of "word occuring in the document", we have "count how often word occurs in the document" you can think of it as "number of times outcome number_x is observed over n trails".

__Bernoulli__
- The binomial model is useful if your feature vectors are binary (ie., Zeroes and One). One application would be text classification with 'bag of words' model where the 1s and 0s are "words occur in the document" and "word does not occur in the document" respectively.