First, I loaded the given data into a df - df_original by seperating the comments/messages from the corresponding labels. Then , I created two dataframes, df_raw which contained all the training data and df_test which contained all the testing data. df_positive contains all the data pertaining to the positive reviews and df_negative contains all the data of the negative reviews. The working of the functions used can be found in their docstrings. I create a 2 dictionaries, dict_positive and dict_negative - each having the number of occurences of unique positive and unique negative words from their respective comments. ie. each key in, say dict_positive is a unique word contained in the positive comments and its value is the number of times it appeared in the positive comments. 
Then, I use laplace smoothening and find out the probability that a word can be observed, given that it belongs to a particular category. And to find out the probability that a list of words belong to a particular category, i multiply the individual probabilities, assuming they are independent of each other. Then, I calculate the probability that a list of words belongs to a particular category, using the bayes theorem, then based on which probability is higher, i categorize an example as either positive or negative.

In [1]:
import numpy as np
import re
import pandas as pd

def find_words(stop_words,df_raw,df_positive,df_negative):
    '''
    Takes in the list of stop words and the raw data frame, positive
    dataframe and a negative data frame and returns 
    three items as a list - set of all words, set of all words in the 
    positive reviews and a set of all words in the negative reviews.
    '''
    word_pattern = '[a-zA-Z\']+'
    total_words = set({})
    positive_words = set({})
    negative_words = set({})
    for line in df_raw['comments']:
        for i in re.finditer(word_pattern,line):
            word = i.group().lower()
            if (word not in stop_words and re.search('\'',word)):
                word = word.replace("'","")
                total_words.add(word)
            elif (word not in stop_words):
                total_words.add(word)
    for line in df_positive['comments']:
        for i in re.finditer(word_pattern,line):
            word = i.group().lower()
            if (word not in stop_words and re.search('\'',word)):
                word = word.replace("'","")
                positive_words.add(word)
            elif (word not in stop_words):
                positive_words.add(word)
    for line in df_negative['comments']:
        for i in re.finditer(word_pattern,line):
            word = i.group().lower()
            if (word not in stop_words and re.search('\'',word)):
                word = word.replace("'","")
                negative_words.add(word)
            elif (word not in stop_words):
                negative_words.add(word)
    return [total_words,positive_words,negative_words]

def segregate(label):
    '''
    Will be used in the groupby function to segregate the dataframe into df_positive and df_negative.
    '''
    if (label == '1'):
        return 'positive'
    else:
        return 'negative'
    
def word_counts(stop_words,df_raw,df_positive,df_negative):
    '''
    Takes in three dataframes - the raw df, a positive df and 
    a negative df and the stop_words and returns a list of 2 dictionaries - 
    one for positive and the other for negative reviews each having the 
    number of occurences of each unique word in the respective comments.
    '''
    total_words,positive_words,negative_words = find_words(stop_words,df_raw,df_positive,df_negative)
    dict_positive = {word:0 for word in total_words}
    dict_negative = {word:0 for word in total_words}
    for line in df_positive['comments']:
        for i in re.finditer('[a-zA-Z\']+',line):
            word = i.group().lower()
            if (re.search('\'',word)):
                word = word.replace("'","")
            if (word in dict_positive):
                dict_positive[word] += 1
               
    for line in df_negative['comments']:
        for i in re.finditer('[a-zA-Z\']+',line):
            word = i.group().lower()
            if (re.search('\'',word)):
                word = word.replace("'","")
            if (word in dict_negative):
                dict_negative[word] += 1
               
    return [dict_positive,dict_negative]

def prob_positive(words,dict_positive,alpha):
    '''
    Takes in a list of words, a dict containing the number of occurences of each positive word in the positive comments and
    the value of alpha - the probabilities are calculated after laplace smoothening. Returns the probability that the words form 
    a positive comment.
    '''
    prob = 1
    prob *= num_positive/(num_negative + num_positive)
    for i in range(len(words)):
        if words[i] in dict_positive:
            num = dict_positive[words[i]]
        else:
            num = 0
        prob *= (num + alpha)/(alpha*total_num_unique_words + num_positive_words)
    return prob
            
        

        
def prob_negative(words,dict_negative,alpha):
    '''
    Takes in a list of words, a dict containing the number of occurences of each negative word in the negative comments and
    the value of alpha - the probabilities are calculated after laplace smoothening. Returns the probability that the words form 
    a negative comment.
    '''
    prob = 1
    prob *= num_negative/(num_negative + num_positive)
    for i in range(len(words)):
        if (words[i] in dict_negative):
            num = dict_negative[words[i]]
        else:
            num = 0
        prob *= (num + alpha)/(alpha*total_num_unique_words + num_negative_words)
    return prob

        
    

def predict(string,dict_positive,dict_negative,alpha):
    '''
    Takes in a string, dict_positive, dict_negative, alpha and returns 1 if the comment is positive and 0 if the comment is 
    negative.
    '''
    words = []
    pattern = '[a-zA-Z\']+'
    for i in re.finditer(pattern,string):
        word = i.group().lower()
        if (word not in stop_words):
            if (re.search("'",word)):
                word = word.replace("'","")
        words.append(word)
    i = len(words)
    pos_prob = prob_positive(words,dict_positive,alpha)
    neg_prob = prob_negative(words,dict_negative,alpha)
    if (pos_prob >= neg_prob):
        return 1
    else:
        return 0

    
def accuracy(df_test):
    '''
    Returns the accuracy of the test dataframe.
    '''
    a = []
    for string in df_test['comments']:
        a.append(predict(string,dict_positive,dict_negative,1))
    count = 0
    for y_pred,y in zip(a,df_test['labels']):
        if y_pred == int(y):
            count += 1
    total = len(df_test)
    accuracy = count/total
    return accuracy




stop_words = ['a','the','an','be']
with open('dataset_NB.txt','r') as dataset:
    file_reader = dataset.read()
pattern = '(?P<comments>.*)\..*(?P<label>\d)\\n'
comment = []
labels = []
for item in re.finditer(pattern,file_reader):
    comment.append(item.groupdict()['comments'])
    labels.append(item.groupdict()['label'])

df_original = pd.DataFrame({'comments':comment,'labels':labels}, index = np.arange(1,len(labels)+1))

accuracies_list = []
length = int(len(df_original)/7)

a = 0
b = length

for i in range(7):
    df_one = df_original[:a].copy()
    df_two = df_original[b:].copy()
    df_raw = pd.concat([df_one,df_two])
    df_test = df_original[a:b]
    df_positive = None
    df_negative = None
    df_raw.set_index('labels',inplace = True)
    for group,frame in df_raw.groupby(segregate):
        if (group == 'positive'):
            df_positive = frame
        else:
            df_negative = frame
    df_positive.reset_index(inplace = True)
    df_negative.reset_index(inplace = True)



    num_positive = len(df_positive)

    num_negative = len(df_negative)

    pos_words = []
    neg_words = []
    for line in df_positive['comments']:
        for i in re.finditer('[a-zA-Z\']+',line):
            word = i.group().lower()
            if (word not in stop_words and re.search('\'',word)):
                word = word.replace("'","")
                pos_words.append(word)
            elif (word not in stop_words):
                pos_words.append(word)
    for line in df_negative['comments']:
        for i in re.finditer('[a-zA-Z\']+',line):
            word = i.group().lower()
            if (word not in stop_words and re.search('\'',word)):
                word = word.replace("'","")
                neg_words.append(word)
            elif (word not in stop_words):
                neg_words.append(word)
    num_positive_words = len(pos_words)
    num_negative_words = len(neg_words)


        
    total_words,positive_words,negative_words = find_words(stop_words,df_raw,df_positive,df_negative)

    total_num_unique_words = len(total_words)


    dict_positive,dict_negative = word_counts(stop_words,df_raw,df_positive,df_negative)
    
    accuracies_list.append(accuracy(df_test))
    
    a += length
    b += length


Accuracy of model over each fold and the overall accuracy:

In [2]:
for i in range(7):
    print('Accuracy in fold ' + str(i+1) + ' is ' + str(accuracies_list[i]*100))
print()
print('Average accuracy across all folds is ' + str(np.mean(accuracies_list*100)))

Accuracy in fold 1 is 82.44274809160305
Accuracy in fold 2 is 86.25954198473282
Accuracy in fold 3 is 75.57251908396947
Accuracy in fold 4 is 83.96946564885496
Accuracy in fold 5 is 77.09923664122137
Accuracy in fold 6 is 81.67938931297711
Accuracy in fold 7 is 80.1526717557252

Average accuracy across all folds is 0.8102508178844057


Major limitations of the Naive bayes classifier:

1) Assumption that the feature parameters predict the probability independently of one another.
2) The order of the words is ignored while implementing the algorithm.
3) Without laplace smoothening, the presence of a word absent in the original dataset results in 0 probabilities for the whole sentence.