# Naive bayes algorithm

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable. Bayes’ theorem states the following relationship, given class variable $y$ and dependent feature vector $x_1$ through $x_n$. 

$P(y \mid x_1, \dots, x_n) = \frac{P(y) P(x_1, \dots, x_n \mid y)}
                                 {P(x_1, \dots, x_n)}$
                                 
$P(x_i | y, x_1, \dots, x_{i-1}, x_{i+1}, \dots, x_n) = P(x_i | y)$

$ P(y \mid x_1, \dots, x_n) = \frac{P(y) \prod_{i=1}^{n} P(x_i \mid y)}
                                 {P(x_1, \dots, x_n)}$

$\begin{align}\begin{aligned}P(y \mid x_1, \dots, x_n) \propto P(y) \prod_{i=1}^{n} P(x_i \mid y)\\\Downarrow\\\hat{y} = \arg\max_y P(y) \prod_{i=1}^{n} P(x_i \mid y),\end{aligned}\end{align}$

$\begin{align}\begin{aligned}P(y \mid x_1, \dots, x_n) \propto P(y) \prod_{i=1}^{n} P(x_i \mid y)\\\Downarrow\\\hat{y} = \arg\max_y P(y) \prod_{i=1}^{n} P(x_i \mid y),\end{aligned}\end{align}$


Naive Bayes is relatively immune to overfitting . Here we have a simple hypothesis, so it can not accurately represent many complex situations. since the bias is high model exhibits low variance.


 * first we have to find the prior probabilitis for both classes and feature categories 
 * then have to fing the likelihoods 
 * finally the use the base theorem for finding the final probability

### Advantage of Naive Bayes
 * Since there are no gradients or iteration parameter updates to compute, training process and predictions are quick.
 * Naive Bayes requires a small amount of training data to estimate the test data.
 * Naive Bayes handles missing values a lot easier.
 * Model handles both continuous and discreet data.
### Disadvantages of Naive Bayes
 * Model cannot incorporate feature interactions.
 * Model performance is affected if we have skewed training data.
 * Zero Frequency problem.
 
### Improving the model performance
 * Zero frequency problem : There can be cases where a categorical attribute has a value that was not observed in training. In this case the model will assign zero probability and unable to make the predictions. Then we usually add 1 to every value in the frequency table. This is also known as additive smoothing.
 * Naive Bayes model performance increases if we have non correlated features. So, we can remove those features that are highly correlated using pairwise correlation.
 * Naive Bayes handles missing data. if a data instance has a missing vale it will be get ignored while computing the probabilities.
 * Its good to use log probabilities to avoid difficulty with floating point.
 * Instead of using usual normal, binomial distributions we can always try different distributions to compute the posterior probabilities.

<img src="https://namesorts.files.wordpress.com/2019/12/playnotplay_excerpt.png">

## Implementing naive bayes algorithm from scratch 

In [7]:
import pandas as pd 
import numpy as np 

class NaiveBayes():
    def __init__(self,X,y):
        '''X should be a dataframe'''
        self.X = X
        self.y = y
        self.features = list(self.X.columns)
        
         # making a list of unique classes
        self.classes = list(self.y.unique())
        # counting the total no of classes in the target feature 
        self.class_0count = list(self.y).count(self.classes[0])
        self.class_1count = list(self.y).count(self.classes[1])
        # finding the class probabilities        
        self.class_proba = dict.fromkeys(self.classes)
        
        
        # finding the prior probabilies of classes 
        for i in range(len(self.classes)):
            ct = len(self.y[self.y == self.classes[i]])/len(self.y) 
            self.class_proba[self.classes[i]] = ct
            
        # finding the prior probabilies of categories in each feature 
        self.feature_priors = dict.fromkeys(self.features)
        for f in self.feature_priors.keys():
            final = dict.fromkeys(set(self.X[f]))
            for i in final.keys():
                count = 0 
                for j in list(self.X[f]):
                    if j ==i:
                        count+=1 
                final[i] = count/len(self.X)
            self.feature_priors[f] = final
    
        
    def fit(self):       
        # getting the features names and target name      
        target = self.y.name
            
        # creating a dictionary with features as keys 
        final_dict = dict.fromkeys(self.features)

        # start looping over the features using feature index 
        for f in range(len(self.features)):
            
            # creating the gps list consists of uniques values for each feature 
            gps = list(self.X[self.features[f]].unique())
            
            # creating a dictionary for each gp  
            gp_counts = dict.fromkeys(gps)
            
            for j in range(len(gps)):          
                # getting the count of first group 
                dict_count = self.y[self.X[self.features[f]] == gps[j]].value_counts()
                # dividing the value counts by total no of classes respectively (yes and no)
                dc = dict.fromkeys(self.classes)
                dc[self.classes[0]] = dict_count[self.classes[0]]/self.class_0count
                dc[self.classes[1]] = dict_count[self.classes[1]]/self.class_1count
                
                # storing the counts dictionary in 
                gp_counts[gps[j]] = dc
                         
            # adding all the gp counts to the respective features :    
            final_dict[self.features[f]] = gp_counts
                    
        # returning the final dictionary which holds the apriori probabilities 
        return final_dict
    
    def predict(self):
        final_dict = NaiveBayes.fit(self)
         
        def pred(feature_vec):                                   
            prob_yes = 1
            prob_no = 1
            for i in range(len(feature_vec)):
                
                # finding posterior probabilities              
                r = final_dict[self.features[i]]  [feature_vec[i]]    [self.classes[0]]
                prob_yes = prob_yes * r 
                prob_yes = prob_yes/ self.feature_priors[self.features[i]] [feature_vec[i]] 
                
                g = final_dict[self.features[i]][feature_vec[i]]    [self.classes[1]]
                prob_no = prob_no * g 
                prob_no = prob_no/ self.feature_priors[self.features[i]][feature_vec[i]] 
                
            # multiplying with class probabilities     
            fin_class0 = prob_yes * self.class_proba[self.classes[0]]
            fin_class1 = prob_no * self.class_proba[self.classes[1]]
            
            print(fin_class0,fin_class1)
            
            if fin_class0> fin_class1:
                return self.classes[0]
            else:
                return self.classes[1]
        
        preds = []
        for row in range(len(self.X)):
            preds.append(pred(list(self.X.iloc[row])))
        return pd.DataFrame(preds,columns = ['predictions'])

In [8]:
df = pd.read_csv('play.csv')
x = df[['weather']]
y = df.play
df

Unnamed: 0,weather,play
0,sunny,no
1,sunny,no
2,overcast,yes
3,rainy,yes
4,rainy,yes
5,rainy,no
6,overcast,no
7,sunny,no
8,sunny,yes
9,rainy,yes


In [9]:
model = NaiveBayes(x,y)
model.fit()

{'weather': {'sunny': {'no': 0.5, 'yes': 0.2222222222222222},
  'overcast': {'no': 0.16666666666666666, 'yes': 0.3333333333333333},
  'rainy': {'no': 0.3333333333333333, 'yes': 0.4444444444444444}}}

In [10]:
print('features = ', model.features)
print('no count = ',model.class_0count)# no 
print('yes count = ',model.class_1count) # yes 
print('classes = ', model.classes)
print('class_prios = ', model.class_proba) 
print('feature_priors = ', model.feature_priors)

features =  ['weather']
no count =  6
yes count =  9
classes =  ['no', 'yes']
class_prios =  {'no': 0.4, 'yes': 0.6}
feature_priors =  {'weather': {'rainy': 0.4, 'overcast': 0.26666666666666666, 'sunny': 0.3333333333333333}}


In [11]:
model.predict()

0.6000000000000001 0.39999999999999997
0.6000000000000001 0.39999999999999997
0.25 0.75
0.3333333333333333 0.6666666666666665
0.3333333333333333 0.6666666666666665
0.3333333333333333 0.6666666666666665
0.25 0.75
0.6000000000000001 0.39999999999999997
0.6000000000000001 0.39999999999999997
0.3333333333333333 0.6666666666666665
0.3333333333333333 0.6666666666666665
0.6000000000000001 0.39999999999999997
0.25 0.75
0.25 0.75
0.3333333333333333 0.6666666666666665


Unnamed: 0,predictions
0,no
1,no
2,yes
3,yes
4,yes
5,yes
6,yes
7,no
8,no
9,yes
