# PART THREE 
### Naive Bayes Sentiment Polarity Classifier

Write a sentiment polarity classifier which uses the Naive Bayes algorithm to train a
sentiment polarity classifier which assigns a sentiment polarity of positive or negative
to a review.

Your program should accept as input a training file and a test file. The training file contains a
list of reviews and their actual sentiment labels ( positive or negative). The test file
contains either a list of reviews with the actual sentiment labels or list of the reviews on their
own. Your program should output the predictions of the NB classifier (positive or
negative)for each of the reviews in the test file. If the actual labels (sometimes referred to
as gold labels or ground truth) are also available for the test reviews, your program should
also print the accuracy of the classifier.

You should use the following training data:
https://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz

described in the following paper:

Pang and Lee 2004. A Sentimental Education: Sentiment Analysis Using Subjectivity
Summarization Based on Minimum Cuts. Proceedings of the 42nd ACL.
https://www.aclweb.org/anthology/P04-1035/

There are 1000 positive reviews and 1000 negative reviews. Reserve the last 100 of each
type for testing (files starting with CV9) and the first 900 for training (files starting with
CV[0-8]).

Update!! Your program should be case-insensitive and should ignore punctuation.

Analyse the output of your classifier on 5 correct and 5 incorrect samples chosen at random
from the test set. For each example, say why you think your classifier made the correct or
incorrect decision.

In [108]:
#import required libraries
import os
import string 
import pandas as pd
import numpy as np
import random
import warnings
warnings.filterwarnings("ignore")
pos_path='review_polarity/txt_sentoken/pos/' #path of positive review files 
neg_path='review_polarity/txt_sentoken/neg/' #path of positive negative review files 

In [109]:
#store the Positive and negative filenames in the list
pos_file_names=os.listdir("review_polarity/txt_sentoken/pos") # store the filenames in list
neg_file_names=os.listdir("review_polarity/txt_sentoken/neg")

In [110]:
def labels(fi,label): # label the files with positive and negative and store into dataframe
    
    fi=sorted(fi)     # sorted the filenames so cv001 to cv999
    file=pd.DataFrame(fi,columns=['file_name']) #create a dataframe with column filename
    file['label']=label  #assign a label positive/negative to all rows 
    
    return file

def words_Extraction(files):  #This Function is to extract word from a files
    
    for i in range(0,files.shape[0]): #loop to process each file by file
        
        if files['label'].iloc[i] == 'positive':  # if it label is positive then access positive folder 
            fil=pos_path+files['file_name'].iloc[i] #filepath
        else:                                     # else negative folder
            fil=neg_path+files['file_name'].iloc[i]
        
        
        with open(fil, "r", encoding='utf-8') as file: # open and read the file 
            a=file.read()                              # read the file content
            a=a.replace('\n',' ')                      # replace newline character with space 
            a=a.replace('\t',' ')                      # replace tab character with space 
            a=a.lower()                                # convert text into lowercase
            b=a.split(' ')                             # split the text into words
            c=[i for i in b if i not in string.punctuation] # ignore the punctuation marks
            c=[i for i in c if i.isdigit() == False]      # ignore the number which can be comment as per choice
            c=list(filter(None, c))                    # remove the empty strings from list
            files['words'].iloc[i]=c                   # store in word column of that row file 
            file.close()                               # close the file object
        
    return files   # return dataframe

def frequency(train,label):  #create list of words of a given class and calcluate their frequency in given class.
    
    pos=train.words[train['label']==label].to_list() # Extract list of words from positive/negative class documents. 
    
    total_words=[i for x in pos for i in x] #flat list of list of words to single word list. 
    
    word_Dict={}  #intialize dictionary to store their frequency
    
    for i in total_words: # loop over all words to calculate their frequency
        if i in word_Dict.keys(): # check if word is in the dictionary keys then increase its counts 
            word_Dict[i]+=1
        else:
            word_Dict[i]=1   #else add that word in the dictionary
    
    return total_words,word_Dict  # return all words list and their frequency dictionary
    

In [143]:
#filenames with their labels and create a word columns
positive_file=labels(pos_file_names,'positive') 
positive_file['words']=''
negative_file=labels(neg_file_names,'negative')
negative_file['words']=''

In [144]:
#extract words for each class files
positive_file=words_Extraction(positive_file)
negative_file=words_Extraction(negative_file)

In [145]:
#create train and test set as defined in assignment
train_files=pd.concat([positive_file.iloc[0:900],negative_file.iloc[0:900]])
test_files=pd.concat([positive_file.iloc[900:1000],negative_file.iloc[900:1000]])

In [146]:
#reset the train dataframe index to ease in loop
train_files.reset_index(inplace=True,drop=True)
train_files

Unnamed: 0,file_name,label,words
0,cv000_29590.txt,positive,"[films, adapted, from, comic, books, have, had..."
1,cv001_18431.txt,positive,"[every, now, and, then, a, movie, comes, along..."
2,cv002_15918.txt,positive,"[you've, got, mail, works, alot, better, than,..."
3,cv003_11664.txt,positive,"[jaws, is, a, rare, film, that, grabs, your, a..."
4,cv004_11636.txt,positive,"[moviemaking, is, a, lot, like, being, the, ge..."
...,...,...,...
1795,cv895_22200.txt,negative,"[days, in, the, valley, is, more, or, less, a,..."
1796,cv896_17819.txt,negative,"[what, would, inspire, someone, who, cannot, w..."
1797,cv897_11703.txt,negative,"[synopsis, a, novelist, struggling, with, his,..."
1798,cv898_1576.txt,negative,"[okay, okay, maybe, i, wasn't, in, the, mood, ..."


In [147]:
#reset the test dataframe index to ease in loop
test_files.reset_index(inplace=True,drop=True)
test_files

Unnamed: 0,file_name,label,words
0,cv900_10331.txt,positive,"[in, a, ship, set, sail, on, her, maiden, voya..."
1,cv901_11017.txt,positive,"[the, start, of, this, movie, reminded, me, of..."
2,cv902_12256.txt,positive,"[note, some, may, consider, portions, of, the,..."
3,cv903_17822.txt,positive,"[robert, altman's, cookie's, fortune, is, that..."
4,cv904_24353.txt,positive,"[well, i'll, be, damned, the, canadians, can, ..."
...,...,...,...
195,cv995_23113.txt,negative,"[if, anything, stigmata, should, be, taken, as..."
196,cv996_12447.txt,negative,"[john, boorman's, zardoz, is, a, goofy, cinema..."
197,cv997_5152.txt,negative,"[the, kids, in, the, hall, are, an, acquired, ..."
198,cv998_15691.txt,negative,"[there, was, a, time, when, john, carpenter, w..."


In [148]:
#probablities of each class [positive and negative]
p_positive_class=len(train_files[train_files['label']=='positive'])/train_files.shape[0] # total positive class documents divided by total documents
p_negative_class=len(train_files[train_files['label']=='negative'])/train_files.shape[0] # total negative class documents divided by total documents
print('Positive Class Probability:',p_positive_class)
print('Negative Class Probability:',p_negative_class)

Positive Class Probability: 0.5
Negative Class Probability: 0.5


In [151]:
#Collect all positive and negative class words and their frequency
positive_words,positive_words_frequency=frequency(train_files,'positive')
negative_words,negative_words_frequency=frequency(train_files,'negative')
print('Length of Positive words',len(positive_words))
print('Length of Negative words',len(negative_words))

Length of Positive words 612203
Length of Negative words 546472


In [152]:
#Create a Unique list of words in all train files Vocabulary 
Vocabulary=positive_words+negative_words
print(len(Vocabulary))
Vocabulary=list(set(Vocabulary))
print(len(Vocabulary))

1158675
47941


In [119]:
def test_document_positive(words,class_prob): # positive class calculations on document words
    
    positive_class_calculations_for_single_doc=[] #intialize the list to get calculation value of each word
    for i in words:  # loop on doc list words
        if i in Vocabulary: # check if that word is in Vocabulary if yes
            if i in positive_words: # check that word is in positive words if yes fetch its frequency/count value
                a=positive_words_frequency[i] # counts of word in mega document of positive class 
            else: # If not then assign zero value
                a=0
            b=1  #add one smoothing value
            c=len(positive_words) # total number of words in positive class
            d=len(Vocabulary) # len of list Vocabulary = V
            total=np.log((a+b)/(c+d)) # take a log of calculation
            positive_class_calculations_for_single_doc.append(total) #store that value in list of each words
    
    #Note: Reason to use log is because in product formula calulations went to zero  
    #take positive class output by taking log of postive class probability and sum it with the sum of all words calculations
    p_class_output=np.log(class_prob)+sum(positive_class_calculations_for_single_doc) 
    
    return p_class_output #return probability of posclass


def test_document_negative(words,class_prob): #negative class calculations on document words
    
    negative_class_calculations_for_single_doc=[] #intialize the list to get calculation value of each word
    for i in words:  # loop on doc list words
        if i in Vocabulary: # check if that word is in Vocabulary if yes
            if i in negative_words: # check that word is in positive words if yes fetch its frequency/count value
                a=negative_words_frequency[i]  # counts of word in mega document of negative class 
            else:   # If not then assign zero value
                a=0
            b=1 #add one smoothing
            c=len(negative_words) # total number of words in negative class
            d=len(Vocabulary) # len of list Vocabulary = V
            total=np.log((a+b)/(c+d)) # take a log of calculation
            negative_class_calculations_for_single_doc.append(total) #store that value in list of each words
    
    
    n_class_output=np.log(class_prob)+sum(negative_class_calculations_for_single_doc) 
    
    return n_class_output #return probability of posclass

In [153]:
#randomly select test documents
test_pos_5Sentences = random.choices(range(0, 100), k=5) #take five positive label test documents index
test_neg_5Sentences = random.choices(range(100, 200), k=5) # take five negataive label documents index
print(test_pos_5Sentences,test_neg_5Sentences) #print the random indexes of positive and negative  class

[36, 51, 94, 73, 65] [174, 118, 113, 121, 135]


In [154]:
#select from those random index from test set
test=test_pos_5Sentences+test_neg_5Sentences
test_reports=test_files.iloc[test]

In [155]:
#calculate probability of test document of positive and negative class
test_reports['positive_output']=test_reports.words.apply(lambda x: test_document_positive(x,p_positive_class))
test_reports['negative_output']=test_reports.words.apply(lambda x: test_document_negative(x,p_negative_class))

#label the predicted class on maximum probaility between positive and negative class
test_reports['predicted']=np.where(test_reports['positive_output'] > test_reports['negative_output'], 'positive', 'negative')
test_reports

Unnamed: 0,file_name,label,words,positive_output,negative_output,predicted
36,cv936_15954.txt,positive,"[it's, tough, to, really, say, something, nice...",-10600.279682,-10677.580949,positive
51,cv951_10926.txt,positive,"[how, many, of, us, would, become, strippers, ...",-3701.688835,-3708.098399,positive
94,cv994_12270.txt,positive,"[a, thriller, set, in, modern, day, seattle, t...",-3449.427752,-3426.898144,negative
73,cv973_10066.txt,positive,"[i, like, movies, with, albert, brooks, and, i...",-8024.854132,-8053.779688,positive
65,cv965_26071.txt,positive,"[in, many, ways, twotg, does, for, tough-guy, ...",-3863.698754,-3867.490087,positive
174,cv974_24303.txt,negative,"[long, ago, films, were, constructed, of, stro...",-2971.072201,-2973.26562,positive
118,cv918_27080.txt,negative,"[you, know, something, christmas, is, not, abo...",-3933.426644,-3912.955298,negative
113,cv913_29127.txt,negative,"[frank, detorri's, bill, murray, a, single, da...",-3019.20163,-3018.419033,negative
121,cv921_13988.txt,negative,"[note, some, may, consider, portions, of, the,...",-6676.631035,-6621.737839,negative
135,cv935_24977.txt,negative,"[the, plot, of, big, momma's, house, is, marti...",-6076.020486,-6033.741107,negative


In [157]:
# check the accuracy 
print('10 selected Test documents Accuracy: ',(sum(test_reports['label']==test_reports['predicted'])/test_reports.shape[0])*100)

10 selected Test documents Accuracy:  80.0


In [163]:
# To further analysis  I try random choice without in sequence like no arrangment of first positivr and then negative
# now its more random and no idea what will be division of class/labels as well.
test_r=test_files.iloc[random.choices(range(0, 200), k=10)]
test_r

Unnamed: 0,file_name,label,words
158,cv958_13020.txt,negative,"[in, times, of, crisis, people, are, driven, t..."
11,cv911_20260.txt,positive,"[usually, when, a, blockbuster, comes, out, it..."
181,cv981_16679.txt,negative,"[director, luis, mandoki's, last, film, was, t..."
70,cv970_18450.txt,positive,"[synopsis, in, phantom, menace, the, galaxy, i..."
82,cv982_21103.txt,positive,"[i, rented, this, movie, with, very, high, hop..."
184,cv984_14006.txt,negative,"[while, i, am, not, fond, of, any, writer's, u..."
106,cv906_12332.txt,negative,"[writing, a, screenplay, for, a, thriller, is,..."
173,cv973_10171.txt,negative,"[in, the, continuation, of, warner, brother's,..."
78,cv978_20929.txt,positive,"[when, you, get, out, of, jail, you, can, kill..."
42,cv942_17082.txt,positive,"[it, is, always, refreshing, to, see, a, super..."


In [164]:
test_r['positive_output']=test_r.words.apply(lambda x: test_document_positive(x,p_positive_class))
test_r['negative_output']=test_r.words.apply(lambda x: test_document_negative(x,p_negative_class))
test_r['predicted']=np.where(test_r['positive_output'] > test_r['negative_output'], 'positive', 'negative')
test_r

Unnamed: 0,file_name,label,words,positive_output,negative_output,predicted
158,cv958_13020.txt,negative,"[in, times, of, crisis, people, are, driven, t...",-3686.359729,-3674.877493,negative
11,cv911_20260.txt,positive,"[usually, when, a, blockbuster, comes, out, it...",-3937.696119,-3915.195787,negative
181,cv981_16679.txt,negative,"[director, luis, mandoki's, last, film, was, t...",-5742.692264,-5721.467338,negative
70,cv970_18450.txt,positive,"[synopsis, in, phantom, menace, the, galaxy, i...",-4878.96302,-4952.983901,positive
82,cv982_21103.txt,positive,"[i, rented, this, movie, with, very, high, hop...",-4055.067925,-4091.684047,positive
184,cv984_14006.txt,negative,"[while, i, am, not, fond, of, any, writer's, u...",-3212.380432,-3189.310444,negative
106,cv906_12332.txt,negative,"[writing, a, screenplay, for, a, thriller, is,...",-5486.153469,-5424.904004,negative
173,cv973_10171.txt,negative,"[in, the, continuation, of, warner, brother's,...",-8111.276285,-8017.294311,negative
78,cv978_20929.txt,positive,"[when, you, get, out, of, jail, you, can, kill...",-5543.689664,-5526.003348,negative
42,cv942_17082.txt,positive,"[it, is, always, refreshing, to, see, a, super...",-4101.123371,-4105.794971,positive


In [165]:
print('10 randomly selected Test documents Accuracy: ',(sum(test_r['label']==test_r['predicted'])/test_r.shape[0])*100)

10 randomly selected Test documents Accuracy:  80.0
