## Naive Bayes

#### What is Naive Bayes? 

Naive Bayes is among one of the most simple and powerful algorithms for classification based on Bayes’ Theorem with an assumption of independence among predictors. Naive Bayes model is easy to build and particularly useful for very large data sets. There are two parts to this algorithm:

    Naive
    Bayes

The Naive Bayes classifier assumes that the presence of a feature in a class is unrelated to any other feature. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that a particular fruit is an apple or an orange or a banana and that is why it is known as “Naive”. 

#### What is Bayes Theorem?

In Statistics and probability theory, Bayes’ theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event. It serves as a way to figure out conditional probability.

Given a Hypothesis H and evidence E, Bayes’ Theorem states that the relationship between the probability of Hypothesis before getting the evidence **P(H)** and the probability of the hypothesis after getting the evidence **P(H|E)** is :

![](https://d1jnx9ba8s6j9r.cloudfront.net/blog/wp-content/uploads/2018/07/formula-528x90.png)

This relates the probability of the hypothesis before getting the evidence **P(H)**, to the probability of the hypothesis after getting the evidence, **P(H|E)**. For this reason,  is called the prior probability, while P(H|E) is called the posterior probability. The factor that relates the two, **P(H|E) / P(E)**, is called the likelihood ratio. Using these terms, Bayes’ theorem can be rephrased as:

 

**“The posterior probability equals the prior probability times the likelihood ratio.”**

### Process of Table Creation

![](https://d1jnx9ba8s6j9r.cloudfront.net/blog/wp-content/uploads/2018/07/table-259x300.png)

![](https://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1543836883/image_4_lyi0ob.png)

### Steps for the Algorithm

#### Step-1: First, we will create a frequency table using each attribute of the dataset.

![](https://d1jnx9ba8s6j9r.cloudfront.net/blog/wp-content/uploads/2018/07/table1-227x300.png)

#### Step-2: For each frequency table, we will generate a likelihood table.

![](https://d1jnx9ba8s6j9r.cloudfront.net/blog/wp-content/uploads/2018/07/table2-528x158.png)

#### Solution

Likelihood of **‘Yes’** given **‘Sunny‘** is:

**P(c|x) = P(Yes|Sunny) = P(Sunny|Yes)* P(Yes) / P(Sunny) = (0.3 x 0.71) /0.36  = 0.591**

Similarly Likelihood of **‘No’** given **‘Sunny‘** is:

**P(c|x) = P(No|Sunny) = P(Sunny|No)* P(No) / P(Sunny) = (0.4 x 0.36) /0.36  = 0.40**

#### Now, in the same way, we need to create the Likelihood Table for other attributes as well.

![](https://d1jnx9ba8s6j9r.cloudfront.net/blog/wp-content/uploads/2018/07/table3.png)

### Problem Statement:

#### Suppose we have a Day with the following values :

    Outlook   =  Rain
    Humidity   =  High
    Wind  =  Weak
    Play =?

So, with the data, we have to predict whether **“we can play on that day or not”**.

 

Likelihood of **‘Yes’** on that Day = **P(Outlook = Rain|Yes)*P(Humidity= High|Yes)* P(Wind= Weak|Yes)*P(Yes)**

                                    =  2/9 * 3/9 * 6/9 * 9/14 =  0.0199

 

Likelihood of **‘No’** on that Day = **P(Outlook = Rain|No)*P(Humidity= High|No)* P(Wind= Weak|No)*P(No)**

                                   =  2/5 * 4/5 * 2/5 * 5/14 =  0.0166

 

**Now we normalize the values, then**

**P(Yes)** =  0.0199 / (0.0199+ 0.0166) = 0.55

**P(No)** = 0.0166 / (0.0199+ 0.0166)  = 0.45

 

**Our model predicts that there is a 55% chance there will be a Game tomorrow.**

## Implementation in Python

#### Import Libraries

In [35]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
import numpy as np

#### Import Data

In [4]:
link="C:/Users/comp/Desktop/Summer/ML/"
df = pd.read_csv(link+"tennis.csv")
df.head()

Unnamed: 0,outlook,temp,humidity,windy,play
0,sunny,hot,high,False,no
1,sunny,hot,high,True,no
2,overcast,hot,high,False,yes
3,rainy,mild,high,False,yes
4,rainy,cool,normal,False,yes


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 5 columns):
outlook     14 non-null object
temp        14 non-null object
humidity    14 non-null object
windy       14 non-null bool
play        14 non-null object
dtypes: bool(1), object(4)
memory usage: 542.0+ bytes


### Check Variables

#### Check Outlook

In [16]:
outlook_count = df.groupby(['outlook', 'play']).size()
print("Checking Outlook Variable for Play Segregation: \n",outlook_count)
outlook_total = df.groupby(['outlook']).size()
print("Checking Total Outlook: ",outlook_total)

Checking Outlook Variable for Play Segregation: 
 outlook   play
overcast  yes     4
rainy     no      2
          yes     3
sunny     no      3
          yes     2
dtype: int64
Checking Total Outlook:  outlook
overcast    4
rainy       5
sunny       5
dtype: int64


#### Check Temperature

In [17]:
temp_count = df.groupby(['temp', 'play']).size()
temp_total = df.groupby(['temp']).size()
print("Checking Temperature Variable for Play Segregation: \n",temp_count)
print("Checking Total Outlook",temp_total)

Checking Temperature Variable for Play Segregation: 
 temp  play
cool  no      1
      yes     3
hot   no      2
      yes     2
mild  no      2
      yes     4
dtype: int64
Checking Total Outlook temp
cool    4
hot     4
mild    6
dtype: int64


#### Check Humidity

In [19]:
humidity_count = df.groupby(['humidity', 'play']).size()
humidity_total = df.groupby(['humidity']).size()
print("Checking Humidity Variable for Play Segregation: \n",humidity_count)
print("Checking Humidity Outlook",humidity_total)

Checking Humidity Variable for Play Segregation: 
 humidity  play
high      no      4
          yes     3
normal    no      1
          yes     6
dtype: int64
Checking Humidity Outlook humidity
high      7
normal    7
dtype: int64


#### Check Windy

In [20]:
windy_count = df.groupby(['windy', 'play']).size()
windy_total = df.groupby(['windy']).size()
print("Checking Windy Variable for Play Segregation: \n",windy_count)
print("Checking Windy Outlook",windy_total)

Checking Windy Variable for Play Segregation: 
 windy  play
False  no      2
       yes     6
True   no      3
       yes     3
dtype: int64
Checking Windy Outlook windy
False    8
True     6
dtype: int64


In [29]:
p_over_yes = outlook_count['overcast','yes']
#p_over_no = outlook_count['overcast','no']
print("Total OVERCAST+YES: ",p_over_yes)
#print(p_over_no)

Total OVERCAST+YES:  4


In [32]:
p_rainy_yes = outlook_count['rainy','yes']
print("Total RAINY+YES: ",p_rainy_yes)
p_rainy_no = outlook_count['rainy','no']
print("Total RAINY+NO: ",p_rainy_no)

Total RAINY+YES:  3
Total RAINY+NO:  2


#### Creating Data Subset for training

In [33]:
X_train = pd.get_dummies(df[['outlook', 'temp', 'humidity', 'windy']])
y_train = pd.DataFrame(df['play'])

In [34]:
X_train.head()

Unnamed: 0,windy,outlook_overcast,outlook_rainy,outlook_sunny,temp_cool,temp_hot,temp_mild,humidity_high,humidity_normal
0,False,0,0,1,0,1,0,1,0
1,True,0,0,1,0,1,0,1,0
2,False,1,0,0,0,1,0,1,0
3,False,0,1,0,0,0,1,1,0
4,False,0,1,0,1,0,0,0,1


#### Creating Model

In [36]:
model = GaussianNB()
model.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


GaussianNB(priors=None)

#### Predicting Model

In [40]:
predicted= model.predict([[False,1,0,0,0,1,0,1,0]])
#print(predicted)

['yes']


In [41]:
if predicted=='yes':
    print("I will play Tennis")
else:
    print("I will not play Tennis")

I will play Tennis
