# Understanding Naive Bayes 

Naive Bayes is a classification algorithm that works based on the Bayes theorem. Bayes theorem is used to find the probability of a hypothesis with given evidence.

***P(A/B) = P(B/A) * P(A) / P(B)***

A is the hypothesis and B is the evidence.

P(B|A) is the probability of B given that A is True.

P(A) and P(B) is the independent probabilities of A and B.

***Mutually Exclusive and Exhaustive events***

When two events are mutually exclusive, it means they cannot both occur at the same time. But it doesn’t necessarily imply that one of the two events has to happen.

When two events are exhaustive, it means that one of them must occur.

Think again of a coin toss. The results are mutually exclusive (it will be either heads or tails; it can’t be both on the same flip). And it will be one of the two options — heads and tails are the only possible options (thus they are exhaustive).

**Understanding Deck of Cards**

A standard deck of playing cards contains 52 cards. Divided equally into two colors "Red" and "Black". Deck of 52 cards has four suits "Spades (♠)", "Hearts (♥)", "Diamonds (♦)" and "Clubs (♣)". Hearts and Diamonds comes in Red color and Spades and Clubs comes in Black color. 

Each 4 suits contains 13 cards: Ace, 2, 3, 4, 5, 6, 7, 8, 9, 10, Jack, Queen, King.

***Probability Concepts***

1. Marginal Probability [P(A) or P(B)]: If A is an event, then the marginal probability is the probability of that event occurring, P(A). 

Example: The probability that a card drawn from a pack is red: 
P(red) = (13 Hearts)+(13 Diamonds)/Total No of Cards (52) = 26/52 = 0.5

2. Joint Probability [P(A∩B) or P(B∩A)]: If A and B are two events then the joint probability of the two events is written as P(A ∩ B).

Example: The probability that a card drawn from a pack is red and has the value 4 is P(red and 4): (1 Red Heart)+(1 Red Diamond)/Total Cards (52) = 2/52 = 1/26

3. Conditional Probability [P(A/B) or P(B/A)]: If A and B are two events then the conditional probability of A occurring given that B has occurred is written as P(A|B).

Example: The probability that a card is a four given that we have drawn a red card is P(4/red) = (1 Red Heart) + (1 Red Daimond)/ (Total Red Cards 26) = 2/26 = 1/13.

Linking of all 3 Probabilities with a general multiplication rule as:
**P(A ∩ B) = P(A|B) ✕ P(B)**

Example: Let A be the event that the card is a 4 and B is the event that the card is red. So 

P(A) = 4/52 = 1/13
P(B) = 26/52 = 1/2

The probability that a card is a 4 given that we have drawn a red card is P(4/red) = P(A/B) = 2/26 = 1/13

So if we substitute in this formula we get P(A ∩ B) = (1/13) * (1/2) = 1/26 which is the same as the card drawn from a pack is red and has the value 4.

#### Working of Naive Bayes on Text Classification

Suppose we are building a classifier that says whether a text is about sports or not. Our training data has 5 sentences as follows:

In [1]:
import pandas as pd
import numpy as np

data = pd.DataFrame([])
data['Text'] = ["A great game", "The election was over", "Very clean match", "A clean but forgettable game", "It was a close election"]
data['Tag'] = ['Sports', 'Not Sports', 'Sports', 'Sports', 'Not Sports']

data.head()

Unnamed: 0,Text,Tag
0,A great game,Sports
1,The election was over,Not Sports
2,Very clean match,Sports
3,A clean but forgettable game,Sports
4,It was a close election,Not Sports


Now we need to identify the tag for the given **test_sentence = 'A very close game'**. Since Naive Bayes is a probabilistic classifier, we have to calculate the probability that the sentence "A very close game" is Sports and the probability that it’s Not Sports. Then, we take the largest probability as our end result. 

Lets transform our requirement as mathematically as:

P(Sports/'A very close game') and P(Not Sports/'A very close game')

Lets generate the features for the given text data by using **Word Frequencies**. That is, we ignore word order and sentence construction, treating every document as a set of the words it contains. Our features will be the counts of each of these words.

In [13]:
### Sports
sports_vocab = data[data['Tag'] == 'Sports']['Text'].apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0)
print('Total no of words in Sports: ', len(sports_vocab))
print(sports_vocab)

Total no of words in Sports:  8
great          1.0
game           2.0
A              2.0
clean          2.0
Very           1.0
match          1.0
but            1.0
forgettable    1.0
dtype: float64


In [15]:
### Not Sports
not_sports_vocab = data[data['Tag'] == 'Not Sports']['Text'].apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0)
print('Total no of words in Not Sports: ', len(not_sports_vocab))
print(not_sports_vocab)

Total no of words in Not Sports:  7
The         1.0
over        1.0
was         2.0
election    2.0
a           1.0
It          1.0
close       1.0
dtype: float64


In [18]:
### Not Sports
combined_vocab = data['Text'].apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0)
print('Total no of words: ', len(combined_vocab))
print(combined_vocab)

Total no of words:  15
great          1.0
game           2.0
A              2.0
The            1.0
over           1.0
was            2.0
election       2.0
clean          2.0
Very           1.0
match          1.0
but            1.0
forgettable    1.0
a              1.0
It             1.0
close          1.0
dtype: float64


##### Apply Bayes Therom

P(A/B) = P(B/A) * P(A) / P(B)

In our case, we have 

P(Sports | a very close game) = P(a very close game | Sports) * P(Sports) / P(a very close game) 

and

P(Not Sports | a very close game) = P(a very close game | Not Sports) * P(Not Sports) / P(a very close game)

From P(a very close game) we have to calculate the count of occurances of the given sentence across the rows and then divide it by total no of rows will gives the result.

There's a problem though: “A very close game” doesn’t appear in our training data, so this probability is zero. Unless every sentence that we want to classify appears in our training data, the model won’t be very useful.

So here comes the ***Naive*** part of the algorithm by assuming every word in the given sentences as independant of other ones which can be written as 

P(a very close game) = P(a) * P(very) * P(close) * P(game)

If we appy this step to the above we will get:
    
    P(a very close game | Sports) = P(a/Sports) * P(very/Sports) * P(close/Sports) * P(game/Sports)

##### Calculation

P(Sports) = 3/5
P(Not Sports) = 2/5

P(a/Sports) = 2/8 
P(very/Sports) = 1/8
P(close/Sports) = 0/8 = 0
P(game/Sports) = 2/8

So P(a very close game | Sports) = 2/8 * 1/8 * 0 * 2/8 = 0

 Doing things this way simply doesn't give us any information at all, so we have to find a way around.

To handle scenarios like this we have to use **Laplace smoothing** technique. we add 1 to every count so it’s never zero. To balance this, we add the number of possible words to the divisor, so the division will never be greater than 1. In our case it will be (8+7) = 15 and it can be applied as below.

P(game/Sports) = (2+1)/(8+15) = 3/23
P(close/Sports) = (0+1)/(8+15) = 1/23
P(very/Sports) = (1+1)/(8+15) = 2/23
P(a/Sports) = (0+1)/(8+15) = 3/23

P(a very close game | Sports) = 3/23 * 1/23 * 2/23 * 1/23 = 6/279841 = **0.000021440**

P(game/Not Sports) = (0+1)/(7+15) = 1/22
P(close/Not Sports) = (1+1)/(7+15) = 2/22
P(very/Not Sports) = (0+1)/(7+15) = 1/22
P(a/Not Sports) = (1+1)/(7+15) = 2/22

P(a very close game | Not Sports) = 4/234256 = **0.00001707**

**0.000021440** > **0.00001707**

So from this we can say that our classifier concludes this as Sports Tag

There are many things that can be done to improve this basic model. These techniques allow Naive Bayes to perform at the same level as more advanced methods. Some of these techniques are:

1. Removing stopwords.
2. Lemmatizing words. 
3. Using n-grams.
4. Using TF-IDF.