# Unfolding Naïve Bayes from Scratch! Take-2 🎬


![](https://cdn-images-1.medium.com/max/1000/1*sjet9qSO4O8fX2-FXvxflw.jpeg)

### Implementation of Naïve Bayes from Scratch using Python ONLY - No Fancy Frameworks

![](using.png)

# Outcome of this Tutorial - A Hands-On Pythonic Implementation of NB
So in my previous blog post of [Unfolding Naïve Bayes from Scratch! Take-1 🎬](https://towardsdatascience.com/unfolding-na%C3%AFve-bayes-from-scratch-2e86dcae4b01) , I tried to decode the rocket science behind the working of The Naïve Bayes (NB) ML algorithm, and after going through it's algorithmic insights, you too must have realized that it's quite a painless algorithm and once we walk-through it's complete step by step pythonic implementation ( no API and by using basic python  only) , it will be quite evident that how easy it is to code NB from scratch and that NB is not that Naïve at classifying !
Oh and by the way, once you reach the end of this blog post, you will be done with complete 80% of understanding NB and only 20% will be remaining to master it! Only 20% remaining to go from zero to hero 🎓💯🥇

Before we begin writing code for Naive Bayes, I assume you are familiar with:
1.  Python Lists
2.  Numpy & a lil bit of writing vectorized codes
3.  Dictionaries
4.  Regex

Let's begin with a few imports that we would need while implementing Naive Bayes

In [1]:
import pandas as pd 
import numpy as np 
from collections import defaultdict
import re 

It is much much more easier to organize and reuse the code if we define a class of NaiveBayes rather than using the traditional functional programming approach. So we will be defining a NaiveBayes class and write all relevant functions inside this class.
However, funcions that are not relevant to NaiveBayes class will be defined separately

Oh by the way - We will be writing a a fully generic code for the NaiveBayes Classifier! <br> No matter how many classes come into the training dataset - it will be able to train a fully working model  👏 

#### Lets first write a handy text preprocessing function which is not part of the NaiveBayes class

In [2]:
def preprocess_string(str_arg):
    
    """"
        Parameters:
        ----------
        str_arg: example string to be preprocessed
        
        What the function does?
        -----------------------
        Preprocess the string argument - str_arg - such that :
        1. everything apart from letters is excluded
        2. multiple spaces are replaced by single space
        3. str_arg is converted to lower case 
        
        Example:
        --------
        Input :  Menu is absolutely perfect,loved it!
        Output:  ['menu', 'is', 'absolutely', 'perfect', 'loved', 'it']
        

        Returns:
        ---------
        Preprocessed string 
        
    """
    
    cleaned_str=re.sub('[^a-z\s]+',' ',str_arg,flags=re.IGNORECASE) #every char except alphabets is replaced
    cleaned_str=re.sub('(\s+)',' ',cleaned_str) #multiple spaces are replaced by single space
    cleaned_str=cleaned_str.lower() #converting the cleaned string to lower case
    
    return cleaned_str # eturning the preprocessed string in tokenized form

In [3]:
class NaiveBayes:
    
    def __init__(self,unique_classes):
        
        self.classes=unique_classes # Constructor is sinply passed with unique number of classes of the training set
        

    def addToBow(self,example,dict_index):
        
        '''
            Parameters:
            1. example 
            2. dict_index - implies to which BoW category this example belongs to

            What the function does?
            -----------------------
            It simply splits the example on the basis of space as a tokenizer and adds every tokenized word to
            its corresponding dictionary/BoW

            Returns:
            ---------
            Nothing
        
       '''
        
        if isinstance(example,np.ndarray): example=example[0]
     
        for token_word in example.split(): #for every word in preprocessed example
          
            self.bow_dicts[dict_index][token_word]+=1 #increment in its count
            
    def train(self,dataset,labels):
        
        '''
            Parameters:
            1. dataset - shape = (m X d)
            2. labels - shape = (m,)

            What the function does?
            -----------------------
            This is the training function which will train the Naive Bayes Model i.e compute a BoW for each
            category/class. 

            Returns:
            ---------
            Nothing
        
        '''
    
        self.examples=dataset
        self.labels=labels
        self.bow_dicts=np.array([defaultdict(lambda:0) for index in range(self.classes.shape[0])])
        
        #only convert to numpy arrays if initially not passed as numpy arrays - else its a useless recomputation
        
        if not isinstance(self.examples,np.ndarray): self.examples=np.array(self.examples)
        if not isinstance(self.labels,np.ndarray): self.labels=np.array(self.labels)
            
        #constructing BoW for each category
        for cat_index,cat in enumerate(self.classes):
          
            all_cat_examples=self.examples[self.labels==cat] #filter all examples of category == cat
            
            #get examples preprocessed
            
            cleaned_examples=[preprocess_string(cat_example) for cat_example in all_cat_examples]
            
            cleaned_examples=pd.DataFrame(data=cleaned_examples)
            
            #now costruct BoW of this particular category
            np.apply_along_axis(self.addToBow,1,cleaned_examples,cat_index)
            
                
        ###################################################################################################
        
        '''
            Although we are done with the training of Naive Bayes Model BUT!!!!!!
            ------------------------------------------------------------------------------------
            Remember The Test Time Forumla ? : {for each word w [ count(w|c)+1 ] / [ count(c) + |V| + 1 ] } * p(c)
            ------------------------------------------------------------------------------------
            
            We are done with constructing of BoW for each category. But we need to precompute a few 
            other calculations at training time too:
            1. prior probability of each class - p(c)
            2. vocabulary |V| 
            3. denominator value of each class - [ count(c) + |V| + 1 ] 
            
            Reason for doing this precomputing calculations stuff ???
            ---------------------
            We can do all these 3 calculations at test time too BUT doing so means to re-compute these 
            again and again every time the test function will be called - this would significantly
            increase the computation time especially when we have a lot of test examples to classify!!!).  
            And moreover, it doensot make sense to repeatedly compute the same thing - 
            why do extra computations ???
            So we will precompute all of them & use them during test time to speed up predictions.
            
        '''
        
        ###################################################################################################
      
        prob_classes=np.empty(self.classes.shape[0])
        all_words=[]
        cat_word_counts=np.empty(self.classes.shape[0])
        for cat_index,cat in enumerate(self.classes):
           
            #Calculating prior probability p(c) for each class
            prob_classes[cat_index]=np.sum(self.labels==cat)/float(self.labels.shape[0]) 
            
            #Calculating total counts of all the words of each class 
            count=list(self.bow_dicts[cat_index].values())
            cat_word_counts[cat_index]=np.sum(np.array(list(self.bow_dicts[cat_index].values())))+1 # |v| is remaining to be added
            
            #get all words of this category                                
            all_words+=self.bow_dicts[cat_index].keys()
                                                     
        
        #combine all words of every category & make them unique to get vocabulary -V- of entire training set
        
        self.vocab=np.unique(np.array(all_words))
        self.vocab_length=self.vocab.shape[0]
                                  
        #computing denominator value                                      
        denoms=np.array([cat_word_counts[cat_index]+self.vocab_length+1 for cat_index,cat in enumerate(self.classes)])                                                                          
      
        '''
            Now that we have everything precomputed as well, its better to organize everything in a tuple 
            rather than to have a separate list for every thing.
            
            Every element of self.cats_info has a tuple of values
            Each tuple has a dict at index 0, prior probability at index 1, denominator value at index 2
        '''
        
        self.cats_info=[(self.bow_dicts[cat_index],prob_classes[cat_index],denoms[cat_index]) for cat_index,cat in enumerate(self.classes)]                               
        self.cats_info=np.array(self.cats_info)                                 
                                              
                                              
    def getExampleProb(self,test_example):                                
        
        '''
            Parameters:
            -----------
            1. a single test example 

            What the function does?
            -----------------------
            Function that estimates posterior probability of the given test example

            Returns:
            ---------
            probability of test example in ALL CLASSES
        '''                                      
                                              
        likelihood_prob=np.zeros(self.classes.shape[0]) #to store probability w.r.t each class
        
        #finding probability w.r.t each class of the given test example
        for cat_index,cat in enumerate(self.classes): 
                             
            for test_token in test_example.split(): #split the test example and get p of each test word
                
                ####################################################################################
                                              
                #This loop computes : for each word w [ count(w|c)+1 ] / [ count(c) + |V| + 1 ]                               
                                              
                ####################################################################################                              
                
                #get total count of this test token from it's respective training dict to get numerator value                           
                test_token_counts=self.cats_info[cat_index][0].get(test_token,0)+1
                
                #now get likelihood of this test_token word                              
                test_token_prob=test_token_counts/float(self.cats_info[cat_index][2])                              
                
                #remember why taking log? To prevent underflow!
                likelihood_prob[cat_index]+=np.log(test_token_prob)
                                              
        # we have likelihood estimate of the given example against every class but we need posterior probility
        post_prob=np.empty(self.classes.shape[0])
        for cat_index,cat in enumerate(self.classes):
            post_prob[cat_index]=likelihood_prob[cat_index]+np.log(self.cats_info[cat_index][1])                                  
      
        return post_prob
    
   
    def test(self,test_set):
      
        '''
            Parameters:
            -----------
            1. A complete test set of shape (m,)
            

            What the function does?
            -----------------------
            Determines probability of each test example against all classes and predicts the label
            against which the class probability is maximum

            Returns:
            ---------
            Predictions of test examples - A single prediction against every test example
        '''       
       
        predictions=[] #to store prediction of each test example
        for example in test_set: 
                                              
            #preprocess the test example the same way we did for training set exampels                                  
            cleaned_example=preprocess_string(example) 
             
            #simply get the posterior probability of every example                                  
            post_prob=self.getExampleProb(cleaned_example) #get prob of this example for both classes
            
            #simply pick the max value and map against self.classes!
            predictions.append(self.classes[np.argmax(post_prob)])
                
        return np.array(predictions) 

# That's it!!! Let's Move to Training! ⛸⛸⛸

We will just load a dataset from sklearn - but we are still coding NB from scratch!

In [4]:
from sklearn.datasets import fetch_20newsgroups
""" 
just so you know - fetch_20newsgroups is a dataset that has 20 categories but we will restrict the categories
to 4 for the time being 
"""
categories=['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med'] 
newsgroups_train=fetch_20newsgroups(subset='train',categories=categories)

"""
    It's not a problem at all if you didnt understand this block of code - You should just know that some
    training data is being loaded where training examples are saved in train_data and train labels are 
    saved in train_labels

"""

train_data=newsgroups_train.data #getting all trainign examples
train_labels=newsgroups_train.target #getting training labels
print ("Total Number of Training Examples: ",len(train_data))
print ("Total Number of Training Labels: ",len(train_labels))

Total Number of Training Examples:  2257
Total Number of Training Labels:  2257


In [5]:
from pprint import pprint
print ("------------------- Dataset Categories -------------- ") 
pprint(list(newsgroups_train.target_names))

------------------- Dataset Categories -------------- 
['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']


# If you are curious to know what training data actually looks like .....  🤔 
<br> Training Examples : <br>
    The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics 
    
<br> Training Labels : <br>
    Training Labels ['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian'] are changed to their
    corresponding integer forms. If you don't get this - just accept that Labels are in integer form

In [6]:
pd.options.display.max_colwidth=250
pd.DataFrame(data=np.column_stack([train_data,train_labels]),columns=["Training Examples","Training Labels"]).head()

Unnamed: 0,Training Examples,Training Labels
0,From: sd345@city.ac.uk (Michael Collier)\nSubject: Converting images to HP LaserJet III?\nNntp-Posting-Host: hampton\nOrganization: The City University\nLines: 14\n\nDoes anyone know of a good way (standard PC application/PD utility) to\nconvert ...,1
1,"From: ani@ms.uky.edu (Aniruddha B. Deglurkar)\nSubject: help: Splitting a trimming region along a mesh \nOrganization: University Of Kentucky, Dept. of Math Sciences\nLines: 28\n\n\n\n\tHi,\n\n\tI have a problem, I hope some of the 'gurus' can he...",1
2,"From: djohnson@cs.ucsd.edu (Darin Johnson)\nSubject: Re: harrassed at work, could use some prayers\nOrganization: =CSE Dept., U.C. San Diego\nLines: 63\n\n(Well, I'll email also, but this may apply to other people, so\nI'll post also.)\n\n>I've b...",3
3,"From: s0612596@let.rug.nl (M.M. Zwart)\nSubject: catholic church poland\nOrganization: Faculteit der Letteren, Rijksuniversiteit Groningen, NL\nLines: 10\n\nHello,\n\nI'm writing a paper on the role of the catholic church in Poland after 1989. \n...",3
4,"From: stanly@grok11.columbiasc.ncr.com (stanly)\nSubject: Re: Elder Brother\nOrganization: NCR Corp., Columbia SC\nLines: 15\n\nIn article <Apr.8.00.57.41.1993.28246@athos.rutgers.edu> REXLEX@fnal.gov writes:\n>In article <Apr.7.01.56.56.1993.228...",3


# Woohoo! Let's actually begin Training .... 🏋🏋🏋

In [7]:
nb=NaiveBayes(np.unique(train_labels)) #instantiate a NB class object

print ("---------------- Training In Progress --------------------")
 
nb.train(train_data,train_labels) #start tarining by calling the train function

print ('----------------- Training Completed ---------------------')

---------------- Training In Progress --------------------
----------------- Training Completed ---------------------


# So Now That We Have Trained NB Model - Let's Move to Testing! 🏄🏽🏄🏽🏄🏽

In [8]:
"""
    Again - it's not a problem at all if you didnt understand this block of code - You should just know that some
    test data is being loaded where test examples are saved in test_data and test labels are saved in test_labels

"""
newsgroups_test=fetch_20newsgroups(subset='test',categories=categories) #loading test data
test_data=newsgroups_test.data #get test set examples
test_labels=newsgroups_test.target #get test set labels
print ("Number of Test Examples: ",len(test_data))
print ("Number of Test Labels: ",len(test_labels))

Number of Test Examples:  1502
Number of Test Labels:  1502


# Let's Test on Above Loaded Test Examples Using the Trained NB Model

In [9]:
pclasses=nb.test(test_data) #get predcitions for test set

#check how many predcitions actually match original test labels
test_acc=np.sum(pclasses==test_labels)/float(test_labels.shape[0]) 

print ("Test Set Examples: ",test_labels.shape[0])
print ("Test Set Accuracy: ",test_acc*100,"%")

Test Set Examples:  1502
Test Set Accuracy:  93.8748335553 %


### Wow! Pretty Good Accuracy of ~ 93% 
### See now you realise NB is not soooooo Naïve 👍👍👍

## Plus, as I said, the code we have written is generic! <br> So let's use the same code on a different dataset and with different class labels

I have taken this dataset from Kaggle : https://www.kaggle.com/c/word2vec-nlp-tutorial

In [10]:
training_set=pd.read_csv('./data/labeledTrainData.tsv',sep='\t') # reading the training data-set

### Let's see what the dataset looks like? 🤔 🤔 🤔 

It has movie reviews and their corresponding sentiment labels....

In [11]:
training_set.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i th..."
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate H. G. Wells' classic book. Mr. Hines succeeds in doing so. I, and those who watched his film with..."
2,7759_3,0,"The film starts with a manager (Nicholas Bell) giving welcome investors (Robert Carradine) to Primal Park . A secret project mutating a primal animal using fossilized DNA, like ¨Jurassik Park¨, and some scientists resurrect one of nature's most f..."
3,3630_4,0,"It must be assumed that those who praised this film (\the greatest filmed opera ever,\"" didn't I read somewhere?) either don't care for opera, don't care for Wagner, or don't care about anything except their desire to appear Cultured. Either as a..."
4,9495_8,1,"Superbly trashy and wondrously unpretentious 80's exploitation, hooray! The pre-credits opening sequences somewhat give the false impression that we're dealing with a serious and harrowing drama, but you need not fear because barely ten minutes l..."


In [12]:
#getting training set examples labels
y_train=training_set['sentiment'].values
x_train=training_set['review'].values
print ("Unique Classes: ",np.unique(y_train))
print ("Total Number of Training Examples: ",x_train.shape)

Unique Classes:  [0 1]
Total Number of Training Examples:  (25000,)


In [13]:

"""
    Again - it's not a problem at all if you didnt understand this block of code - You should just know that some
    train & test data is being loaded and saved in their corresponding variables

"""

from sklearn.model_selection import train_test_split
train_data,test_data,train_labels,test_labels=train_test_split(x_train,y_train,shuffle=True,test_size=0.25,random_state=42,stratify=y_train)
classes=np.unique(train_labels)

# Let's train a NB classifier on this dataset

In [14]:
# Training phase....

nb=NaiveBayes(classes)
print ("------------------Training In Progress------------------------")
print ("Training Examples: ",train_data.shape)
nb.train(train_data,train_labels)
print ('------------------------Training Completed!')

# Testing phase 

pclasses=nb.test(test_data)
test_acc=np.sum(pclasses==test_labels)/float(test_labels.shape[0])
print ("Test Set Examples: ",test_labels.shape[0])
print ("Test Set Accuracy: ",test_acc)

------------------Training In Progress------------------------
Training Examples:  (18750,)
------------------------Training Completed!
Test Set Examples:  6250
Test Set Accuracy:  0.84192


### Let's test on Kaggle test set and uplaoding our predictions on kaggle

In [15]:
# Loading the kaggle test dataset
test=pd.read_csv('./data/testData.tsv',sep='\t')
Xtest=test.review.values

#generating predictions....
pclasses=nb.test(Xtest) 

#writing results to csv to uplaoding on kaggle!
kaggle_df=pd.DataFrame(data=np.column_stack([test["id"].values,pclasses]),columns=["id","sentiment"])
kaggle_df.to_csv("./naive_bayes_model_take1.csv",index=False)
print ('Predcitions Generated and saved to naive_bayes_model_take1.csv')

Predcitions Generated and saved to naive_bayes_model_take1.csv


## A screen shot of kaggle results 

![](take1.png)