### Naive Bayes Algorithm
 

* Naive Bayes is a supervised Machine Learning algorithm inspired by the Bayes theorem. It works on the principles of conditional probability. Naive Bayes is a classification algorithm for binary and multi-class classification. The Naive Bayes algorithm uses the probabilities of each attribute belonging to each class to make a prediction.

* Example
    * What is the probability of playing tennis when it is sunny, hot, highly humid and windy? So using the tennis dataset, we need to use the Naive Bayes method to predict the probability of someone playing tennis given the mentioned weather conditions.

#### Type of Naive Bayes Algorithm
 
* Python's Scikitlearn gives the user access to the following 3 Naive Bayes models.

1. Gaussian
    *  The gaussian NB Alogorithm assumes all contnuous features (predictors) and all follow a Gaussian (Normal Distribution).
2. Multinomial
    * Multinomial NB is suited for discrete data that have frequencies and counts. Spam Filtering and Text/Document Classification are two very well-known use cases.
3. Bernoulli
    * Bernoulli is similar to Multinomial except it is for boolean/binary features. Like the multinomial method it can be used for spam filtering and document classification in which binary terms (i.e. word occurrence in a document represented with True or False)
    
    
* Assumption data is categorical in nature
* For numerical data, we have to do binning
* If data is categorical, we have to calculate frequency table as well as likelihood table 
* For test data, we can calculate probability using liklihood table

### Objective
* Implement Naive Bayes using Python/Numpy

### Loading Tennis data

In [1]:
import pandas as pd

In [2]:
tennis_data = pd.read_csv('https://raw.githubusercontent.com/edyoda/data-science-complete-tutorial/master/Data/tennis.csv.txt')

In [4]:
tennis_data

Unnamed: 0,outlook,temp,humidity,windy,play
0,sunny,hot,high,False,no
1,sunny,hot,high,True,no
2,overcast,hot,high,False,yes
3,rainy,mild,high,False,yes
4,rainy,cool,normal,False,yes
5,rainy,cool,normal,True,no
6,overcast,cool,normal,True,yes
7,sunny,mild,high,False,no
8,sunny,cool,normal,False,yes
9,rainy,mild,normal,False,yes


### Calculate Frequency Table

In [17]:
outlook_frequency_tab = pd.crosstab(tennis_data.outlook, tennis_data.play)

In [18]:
temp_frequency_tab = pd.crosstab(tennis_data.temp, tennis_data.play)

In [19]:
humidity_frequency_tab = pd.crosstab(tennis_data.humidity, tennis_data.play)

In [20]:
windy_frequency_tab = pd.crosstab(tennis_data.windy, tennis_data.play)

In [28]:
outlook_frequency_tab

play,no,yes
outlook,Unnamed: 1_level_1,Unnamed: 2_level_1
overcast,0,4
rainy,2,3
sunny,3,2


In [29]:
temp_frequency_tab

play,no,yes
temp,Unnamed: 1_level_1,Unnamed: 2_level_1
cool,1,3
hot,2,2
mild,2,4


In [30]:
humidity_frequency_tab

play,no,yes
humidity,Unnamed: 1_level_1,Unnamed: 2_level_1
high,4,3
normal,1,6


In [31]:
windy_frequency_tab

play,no,yes
windy,Unnamed: 1_level_1,Unnamed: 2_level_1
False,2,6
True,3,3


### What is the probability of playing tennis given it is rainy?

* P(rain|play=yes)
    * frequency of (outlook=rainy) when (play=yes) / frequency of (play=yes) = 3/9
* P(play=yes)
    * frequency of (play=yes) / total(play) = 9/14
* P(outlook=rainy)
    * frequency of (outlook=rainy) / total(outlook) = 5/14
<img src='https://latex.codecogs.com/gif.latex?\boldsymbol{\mathbf{P(play=yes|outlook=rainy)%20=%20\frac{P(outlook=rainy|play=yes)%20*%20P(play=yes)}{P(outlook=rainy)}}}'>

In [26]:
(3/9)*(9/14)/(5/14)

0.6


The probability of playing tennis when it is rainy is 60%. The process is very simple once you obtain the frequencies for each category.

#### Now Generate Likelihood Table

In [32]:
outlook_frequency_tab.no.sum()

5

In [34]:
outlook_frequency_tab.yes.sum()

9

In [36]:
outlook_likelihood_tab = outlook_frequency_tab
outlook_likelihood_tab['no'] = outlook_frequency_tab.no/outlook_frequency_tab.no.sum()

In [37]:
outlook_likelihood_tab['yes'] = outlook_frequency_tab.yes/outlook_frequency_tab.yes.sum()

In [38]:
outlook_likelihood_tab

play,no,yes
outlook,Unnamed: 1_level_1,Unnamed: 2_level_1
overcast,0.0,0.444444
rainy,0.4,0.333333
sunny,0.6,0.222222


In [49]:
class MyNaiveBayes:
    def __init__(self):
        self.likelihood_tables = {}
        self.class_prior_probability = None
    # Generate Frequency tables
    def get_frequency_tables(self, feature_data, target_data):
        
        freq_dict = {}
        
        for col in feature_data.columns:
            freq_tab = pd.crosstab(feature_data[col], target_data)
            freq_dict[col] = freq_tab
            
        return freq_dict
    
    # Generate Likelihood Tables
    def get_likelihood_tables(self, frequency_tables):
        likelihood_dict = {}
        
        for col, freq_table in frequency_tables.items():
            likelihood_tab = freq_table
            
            for tgt_name in freq_table.columns:
                total_count = freq_table[tgt_name].sum()
                likelihood_tab[tgt_name] = freq_table[tgt_name]/total_count
            
            likelihood_dict[col] = likelihood_tab
        return likelihood_dict
    
    def myfit(self,feature_data, target_data):
        frequency_tables = self.get_frequency_tables(feature_data, target_data)
        likelihood_tables = self.get_likelihood_tables(frequency_tables)
        self.likelihood_tables = likelihood_tables
        
        target_freq = target_data.value_counts()
        target_events  = target_data.value_counts().sum()
        self.class_prior_probability = target_freq/target_events
        
    def mypredict(self, feature_data):
        tests = feature_data.to_dict(orient='records')
        for test in tests:
            p_yes = 1
            for col,val in test.items():
                p_yes *=self.likelihood_tables[col].loc[val]['yes']
            p_yes = p_yes*self.class_prior_probability['yes']
            
            p_no = 1
            for col,val in test.items():
                p_no *= self.likelihood_tables[col].loc[val]['no']
            p_no = p_no * self.class_prior_probability['no']
            
            #print(p_yes,p_no)
            yes = p_yes/(p_yes+p_no)
            no = p_no/(p_yes+p_no)
            if yes > no:
                print('Yes')
            else:
                print('No')

In [50]:
feature_data = tennis_data.drop(columns='play')

In [51]:
target_data = tennis_data.play

In [52]:
mynb = MyNaiveBayes()
mynb.myfit(feature_data, target_data)

In [53]:
mynb.likelihood_tables['outlook'].loc['rainy']['yes']

0.3333333333333333

In [54]:
mynb.likelihood_tables['outlook'].loc['sunny']['yes']

0.2222222222222222

In [55]:
mynb.likelihood_tables['temp']

play,no,yes
temp,Unnamed: 1_level_1,Unnamed: 2_level_1
cool,0.2,0.333333
hot,0.4,0.222222
mild,0.4,0.444444


In [56]:
test = feature_data[:5]


In [57]:
mynb.mypredict(test)

No
No
Yes
Yes
Yes


In [58]:
s = target_data.value_counts()

In [59]:
s/14

yes    0.642857
no     0.357143
Name: play, dtype: float64

In [60]:
mynb.class_prior_probability

yes    0.642857
no     0.357143
Name: play, dtype: float64

In [61]:
target_data[:5]

0     no
1     no
2    yes
3    yes
4    yes
Name: play, dtype: object

In [62]:
mynb.mypredict(feature_data)

No
No
Yes
Yes
Yes
Yes
Yes
No
Yes
Yes
Yes
Yes
Yes
No
