# Thumbs Up? Sentiment Classification using Machine Learning Techniques
originally by: Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan

re-created by: Yel Choo and Robee Te

## Dataset

To start, we first downloaded the dataset found in http://www.cs.cornell.edu/people/pabo/movie-review-data/. We used the polarity version 0.9 which was the dataset used by the orginal authors to derived at the same results later on.

## Reading the Files

To perform the actual classification, we must first read the data. The ReadFile.py file, contains 3 methods, namely, readFile(), getCorpusNeg(), and getCorpusPos().

The readFile() method just iterates through the whole dataset separating the negative documents from the positive documents. The getCorpusNeg() and getCorpusPos() returns the positive and negative documents, respectively.

In [16]:
import glob

class ReadFile(object):
    def __init__(self):
        self.corpus = []
        self.corpus_neg = []
        self.corpus_pos = []
        
    def readFile(self):
            path = "C:/Users/Robee Khyra Te/Documents/GitHub/machlrn/Final Project/tokens/*"

            for folder in glob.glob(path):
                if "neg" in folder:
                    for textfile in glob.glob(folder.replace("\\neg", "/neg/*")):
                        with open(textfile.replace("\\", "/")) as f:
                            text = f.read()
                        self.corpus_neg.append((text, 0))
                elif "pos" in folder:
                    for textfile in glob.glob(folder.replace("\\pos", "/pos/*")):
                        with open(textfile.replace("\\", "/")) as f:
                            text = f.read()
                        self.corpus_pos.append((text, 1))

            print(len(self.corpus_neg))
            print(len(self.corpus_pos))
    
    def getCorpusNeg(self):
        return self.corpus_neg

    def getCorpusPos(self):
        return self.corpus_pos

In [17]:
rf = ReadFile()
rf.readFile()

700
700


## Pre-Processing

After reading the file, we must pre-process each of them. The original authors adapted the technique used by Das and Chen (2001) in which they added a string "NOT_" to every word after the negation words (no, isn't, wasn't, among others) and the first punctuation mark.

In our case, the negation words used were no, not, and every word ending in n't. We can easily find the negation words because nltk's word tokenizer automatically splits the word didn't to [did and n't]. While, for the punctuation marks we used [, . ? ! ;].

To be able to do this, we had a class named Features, the method getListWithNegation() will be the one appendning "NOT_" to the given documents.

In [61]:
def getListWithNegation(self, corpus):
    all_document = []

    for document, category in corpus:
        all_words = []
        tokens = nltk.word_tokenize(document)
        i = 0
        while i < len(tokens):
            if tokens[i] in self.negation_list:
                all_words.append(tokens[i])
                i += 1
                while i < len(tokens):
                    if tokens[i] not in self.punctuation_list:
                        a = tokens[i] + self.negate
                        all_words.append(a)
                    else:
                        break
                    i += 1
            else:
                all_words.append(tokens[i])
            i += 1

        all_document.append((all_words, category))

    #print(all_document)

    #print(all_words)

    return all_document

## Classification

After the pre-processing of the dataset, we are now ready to perform the actual classification of sentiments to the dataset. There are 8 different combination that must be performed.

### (1) Unigrams using Frequency
First, we must chose the features to be used. In the original document, they only choose the word that appears 4 or more times. A method was created to count the number of occurrences of each word in the documents and filter the words which appears 4 or more times.

In [63]:
def getUnigram(self, all_document):
        unigrams = {}

        for words_in_document, category in all_document:
            for word in words_in_document:
                unigrams[word] = unigrams.get(word, 0) + 1

        for word, count in unigrams.items():
            if count >= 4: #SABI SA DOCU AT LEAST 4
                self.features['unigram'].append(word)

        #print(len(self.features['unigram']))

        return self.features['unigram']

After getting the words, the next step is to get the frequency of the chosen unigrams which will be performed by the getChosenFeatures.

In [64]:
def getChosenFeatures(self, all_document, type = 'None'):
        chosen_features = np.zeros((len(all_document), len(self.features['unigram'])))

        i = 0
        for words_in_document, category in all_document:
            if type == 'frequency':
                frequencies = Counter(words_in_document)
                for word, count in frequencies.items():
                    try:
                        chosen_features[i][self.features['unigram'].index(word)] = count
                    except: #PAG WALA
                        pass
            elif type == 'presence':
                for word in set(words_in_document):
                    try:
                        chosen_features[i][self.features['unigram'].index(word)] = 1
                    except: #PAG WALA
                        pass
            i += 1

        #print(chosen_features)

        return chosen_features

The last step is to split the features into n folds, which will then be used to get the training dataset and the testing dataset will finally be passed to the classifiers. 

The Classify class contains 3 methods: splitToMiniBatches, naiveBayes Classifier, and SVMClassifier. The splitToMiniBatches method will just split the whole datset into 3 batches, then for each batch it will split the features into the train and test array given the indexOfTrain. 

The paper stated that we must maintain a balanced distribution between the 2 classes (negative and positive) that's why the number of negative features present in each batch['x'] must be equal to the number of positive features. For the batch['y'], we just assign 0 - if the feature is from the negative documents and 1 - if it is from the positive documents.

To split the dataset equally, we used scikit-learn's KFold function (ref: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html).

In [73]:
def splitToMiniBatches(self, negative, positive, indexOfTest):
    batch = {'fold' : [], 'X' : [], 'Y' : []}

    split = len(negative) // self.number_fold

    #Create mini batches
    for i in range(self.number_fold):
        mergedlist = []

        batch['fold'].append(i)

        #Split to equal sized batch, maintaining balance from neg and pos
        neg = negative[i * split : i * split + split]
        #print(len(neg))
        pos = positive[i * split : i * split + split]
        #print(len(pos))

        mergedlist.extend(neg)
        mergedlist.extend(pos)
        batch['X'].append(mergedlist)
        #print(len(self.batch['X'][i]))

        batch['Y'].append(np.append(np.zeros(split), np.ones(split)))
        #print(len(self.batch['Y'][i]))

    batch['fold'] = np.array(batch['fold'])
    batch['X'] = np.array(batch['X'])
    batch['Y'] = np.array(batch['Y'])

    x_train = batch['X'][batch['fold'] != i]
    y_train = batch['Y'][batch['fold'] != i]

    x_test = batch['X'][i]
    y_test = batch['Y'][i]

    #print(x_test.shape)
    #print(y_test.shape)

    return x_train, x_test, y_train, y_test

For the actual testing, we used scikit-learns's Naive Bayes classifier for multivariate Bernoulli models (ref: http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html). 
And scikit-learn's Linear Support Vector Classifier (ref: http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)

In [None]:
def naiveBayesClassifier(self, x_train, x_test, y_train, y_test):
    bernoulliNB = BernoulliNB()
    bernoulliNB.fit(x_train.reshape(x_train.shape[0] * x_train.shape[1], -1), np.ravel(y_train.reshape(y_train.shape[0] * y_train.shape[1], -1)))
    accuracy = accuracy_score(bernoulliNB.predict(x_test), y_test)

    #print("Accuracy for 1 fold in Naive Bayes: ", accuracy)

    return accuracy

def SVMClassifier(self, x_train, x_test, y_train, y_test):
    linearSVM = LinearSVC()
    linearSVM.fit(x_train.reshape(x_train.shape[0] * x_train.shape[1], -1), np.ravel(y_train.reshape(y_train.shape[0] * y_train.shape[1], -1)))
    accuracy = accuracy_score(linearSVM.predict(x_test), y_test)

    #print("Accuracy for 1 fold in SVM: ", accuracy)

    return accuracy

Finally, we will test this using 2 classifiers: Naive Bayes and SVM. As stated in the paper, we will run this 3 times and get the average of the accuracies.

In [74]:
from sklearn.cross_validation import KFold

f = Features()
negativeDocuments = f.getListWithNegation(rf.getCorpusNeg())
positiveDocuments = f.getListWithNegation(rf.getCorpusPos())

kf = KFold(len(rf.getCorpusNeg()))

aveAccuracyNB = 0
aveAccuracySVM = 0
indexOfTest = 0

for train, test in kf: #Get training indices and testing indices
    negDoc = []
    posDoc = []

    for index in train:
        negDoc.append(negativeDocuments[index])
        posDoc.append(positiveDocuments[index])

    negativeUnigrams = f.getUnigram(negDoc)
    positiveUnigrams = f.getUnigram(posDoc)

    negativeFeatures = f.getChosenFeatures(negDoc, type = 'frequency')
    positiveFeatures = f.getChosenFeatures(posDoc, type = 'frequency')

    f.features['unigram'] = []

    c = Classify()

    x_train, x_test, y_train, y_test = c.splitToMiniBatches(negativeFeatures, positiveFeatures, indexOfTest)

    aveAccuracyNB += c.naiveBayesClassifier(x_train, x_test, y_train, y_test)

    aveAccuracySVM += c.SVMClassifier(x_train, x_test, y_train, y_test)

    indexOfTest += 1

print("Average Accuracy of the Naive Bayes Classifier: ", aveAccuracyNB / 3 * 100)
print("Average Accuracy of the SVM Classifier: ", aveAccuracySVM / 3 * 100)

7472
15690
Accuracy for 1 fold in Naive Bayes:  0.787096774194
Accuracy for 1 fold in SVM:  0.783870967742
7289
15386
Accuracy for 1 fold in Naive Bayes:  0.796774193548
Accuracy for 1 fold in SVM:  0.777419354839
7416
15575
Accuracy for 1 fold in Naive Bayes:  0.78064516129
Accuracy for 1 fold in SVM:  0.767741935484
Average Accuracy of the Naive Bayes Classifier:  78.8172043011
Average Accuracy of the SVM Classifier:  77.6344086022


### (2) Unigrams using Presence

The process of getting the features was the same. The only difference was that instead of frequency, we will just check if the feature is present in the document or not. 1 if it is present and 0 otherwise. The testing process was also similar to the Unigrams using frequency.

In [86]:
from sklearn.cross_validation import KFold

f = Features()
negativeDocuments = f.getListWithNegation(rf.getCorpusNeg())
positiveDocuments = f.getListWithNegation(rf.getCorpusPos())

kf = KFold(len(rf.getCorpusNeg()))

aveAccuracyNB = 0
aveAccuracySVM = 0
indexOfTest = 0

for train, test in kf: #Get training indices and testing indices
    negDoc = []
    posDoc = []

    for index in train:
        negDoc.append(negativeDocuments[index])
        posDoc.append(positiveDocuments[index])

    negativeUnigrams = f.getUnigram(negDoc)
    positiveUnigrams = f.getUnigram(posDoc)

    negativeFeatures = f.getChosenFeatures(negDoc, type = 'presence')
    positiveFeatures = f.getChosenFeatures(posDoc, type = 'presence')

    f.features['unigram'] = []

    c = Classify()

    x_train, x_test, y_train, y_test = c.splitToMiniBatches(negativeFeatures, positiveFeatures, indexOfTest)

    aveAccuracyNB += c.naiveBayesClassifier(x_train, x_test, y_train, y_test)

    aveAccuracySVM += c.SVMClassifier(x_train, x_test, y_train, y_test)

    indexOfTest += 1

print("Average Accuracy of the Naive Bayes Classifier: ")
print(aveAccuracyNB / 3 * 100)
print("Average Accuracy of the SVM Classifier: ")
print(aveAccuracySVM / 3 * 100)

7472
15690
Accuracy for 1 fold in Naive Bayes:  0.787096774194
Accuracy for 1 fold in SVM:  0.758064516129
7289
15386
Accuracy for 1 fold in Naive Bayes:  0.796774193548
Accuracy for 1 fold in SVM:  0.81935483871
7416
15575
Accuracy for 1 fold in Naive Bayes:  0.78064516129
Accuracy for 1 fold in SVM:  0.803225806452
Average Accuracy of the Naive Bayes Classifier: 
78.8172043011
Average Accuracy of the SVM Classifier: 
79.3548387097


### (3) Bigrams using Presence

For this combination instead of just getting the unigram, we will now use bigram. For us to get the bigrams, we used nltk's ngrams function and passed the tokenized words in the document and it returns all the bigrams from the documents (e.g. (word1, word2)). In the paper, it is stated that they choose the bigrams which appeared at least 7 times. 

In [None]:
def getBigram(self, all_document):
    bigrams = {}
    words = []

    for document, category in all_document:
        tokens = nltk.word_tokenize(document)
        words += ngrams(tokens, 2)

    for word in words:
        bigrams[word] = bigrams.get(word, 0) + 1

    for word, count in bigrams.items():
        if count >= 7: #SABI SA DOCU AT LEAST 7
            self.features['bigram'].append(word)

    #print(self.features['bigram'][:100])
    return self.features['bigram'][:16165]

In [117]:
from sklearn.cross_validation import KFold

f = Features()
negativeDocuments = f.getListWithNegation(rf.getCorpusNeg())
positiveDocuments = f.getListWithNegation(rf.getCorpusPos())

kf = KFold(len(rf.getCorpusNeg()))

aveAccuracyNB = 0
aveAccuracySVM = 0
indexOfTest = 0

for train, test in kf: #Get training indices and testing indices
    negDoc = []
    posDoc = []

    for index in train:
        negDoc.append(negativeDocuments[index])
        posDoc.append(positiveDocuments[index])

    negativeUnigrams = f.getBigram(negDoc)
    positiveUnigrams = f.getBigram(posDoc)

    negativeFeatures = f.getChosenFeaturesBigram(negDoc)
    positiveFeatures = f.getChosenFeaturesBigram(posDoc)

    f.features['bigram'] = []

    c = Classify()

    x_train, x_test, y_train, y_test = c.splitToMiniBatches(negativeFeatures, positiveFeatures, indexOfTest)

    aveAccuracyNB += c.naiveBayesClassifier(x_train, x_test, y_train, y_test)

    aveAccuracySVM += c.SVMClassifier(x_train, x_test, y_train, y_test)

    indexOfTest += 1

print("Average Accuracy of the Naive Bayes Classifier: ")
print(aveAccuracyNB / 3 * 100)
print("Average Accuracy of the SVM Classifier: ")
print(aveAccuracySVM / 3 * 100)

Accuracy for 1 fold in Naive Bayes:  0.793548387097
Accuracy for 1 fold in SVM:  0.777419354839
Accuracy for 1 fold in Naive Bayes:  0.764516129032
Accuracy for 1 fold in SVM:  0.764516129032
Accuracy for 1 fold in Naive Bayes:  0.741935483871
Accuracy for 1 fold in SVM:  0.758064516129
Average Accuracy of the Naive Bayes Classifier: 
76.6666666667
Average Accuracy of the SVM Classifier: 
76.6666666667


### (4) Unigrams + Bigrams using Presence

This just combines the features from the Unigram and Bigram chosen before.

In [None]:
from sklearn.cross_validation import KFold

f = Features()
negativeDocuments = f.getListWithNegation(rf.getCorpusNeg())
positiveDocuments = f.getListWithNegation(rf.getCorpusPos())

kf = KFold(len(rf.getCorpusNeg()))

aveAccuracyNB = 0
aveAccuracySVM = 0
indexOfTest = 0

for train, test in kf: #Get training indices and testing indices
    negDoc = []
    posDoc = []

    for index in train:
        negDoc.append(negativeDocuments[index])
        posDoc.append(positiveDocuments[index])
    
    mergedlistNegative = []
    mergedlistNegative.extend(f.getUnigram(negDoc))
    mergedlistNegative.extend(f.getBigram(negDoc))
    
    mergedlistPositive = []
    mergedlistPositive.extend(f.getUnigram(negDoc))
    mergedlistPositive.extend(f.getBigram(negDoc))

    negativeFeatures = f.getChosenFeaturesUniBigram(negDoc, mergedlistNegative)
    positiveFeatures = f.getChosenFeaturesUniBigram(posDoc, mergedlistPositive)

    f.features['unigram_bigram'] = []

    c = Classify()

    x_train, x_test, y_train, y_test = c.splitToMiniBatches(negativeFeatures, positiveFeatures, indexOfTest)

    aveAccuracyNB += c.naiveBayesClassifier(x_train, x_test, y_train, y_test)

    aveAccuracySVM += c.SVMClassifier(x_train, x_test, y_train, y_test)

    indexOfTest += 1

print("Average Accuracy of the Naive Bayes Classifier: ")
print(aveAccuracyNB / 3 * 100)
print("Average Accuracy of the SVM Classifier: ")
print(aveAccuracySVM / 3 * 100)



### (5) Top Unigrams using Presence

The process of getting the features was the same. The only difference of this from the original unigrams was that instead of all the unigrams, we will just get the top 2,633 unigrams.

In [None]:
def getTopUnigrams(self, all_document):
        frequencies = Counter(self.getUnigram(all_document))
        self.features['topunigrams'] = sorted(frequencies, key=frequencies.get, reverse=True)[:2633]

        chosen_features = np.zeros((len(all_document), len(self.features['topunigrams'])))

        i = 0
        for words_in_document, category in all_document:
            for word in set(words_in_document):
                try:
                    chosen_features[i][self.features['topunigrams'].index(word)] = 1
                except: #PAG WALA
                    pass
            i += 1

        #print(chosen_features)
        return chosen_features

In [109]:
from sklearn.cross_validation import KFold

f = Features()
negativeDocuments = f.getListWithNegation(rf.getCorpusNeg())
positiveDocuments = f.getListWithNegation(rf.getCorpusPos())

kf = KFold(len(rf.getCorpusNeg()))

aveAccuracyNB = 0
aveAccuracySVM = 0
indexOfTest = 0

for train, test in kf: #Get training indices and testing indices
    negDoc = []
    posDoc = []

    for index in train:
        negDoc.append(negativeDocuments[index])
        posDoc.append(positiveDocuments[index])

    negativeFeatures = f.getTopUnigrams(negDoc)
    positiveFeatures = f.getTopUnigrams(posDoc)

    f.features['topunigrams'] = []

    c = Classify()

    x_train, x_test, y_train, y_test = c.splitToMiniBatches(negativeFeatures, positiveFeatures, indexOfTest)

    aveAccuracyNB += c.naiveBayesClassifier(x_train, x_test, y_train, y_test)

    aveAccuracySVM += c.SVMClassifier(x_train, x_test, y_train, y_test)

    indexOfTest += 1

print("Average Accuracy of the Naive Bayes Classifier: ")
print(aveAccuracyNB / 3 * 100)
print("Average Accuracy of the SVM Classifier: ")
print(aveAccuracySVM / 3 * 100)

Accuracy for 1 fold in Naive Bayes:  1.0
Accuracy for 1 fold in SVM:  1.0
Accuracy for 1 fold in Naive Bayes:  1.0
Accuracy for 1 fold in SVM:  1.0
Accuracy for 1 fold in Naive Bayes:  1.0
Accuracy for 1 fold in SVM:  1.0
Average Accuracy of the Naive Bayes Classifier: 
100.0
Average Accuracy of the SVM Classifier: 
100.0


### (6) Unigrams + POS tags using Presence

This also has the process as the original unigram but we first need to determine the POS tag of the words. To be able to do this, we utilized nltk's POS tagger. Each POS tag was append at the end of each word (e.g. Hard to Hard-ADJ).

In [102]:
from sklearn.cross_validation import KFold

f = Features()
negativeDocuments = f.getListWithNegation(rf.getCorpusNeg())
positiveDocuments = f.getListWithNegation(rf.getCorpusPos())

kf = KFold(len(rf.getCorpusNeg()))

aveAccuracyNB = 0
aveAccuracySVM = 0
indexOfTest = 0

for train, test in kf: #Get training indices and testing indices
    negDoc = []
    posDoc = []

    for index in train:
        negDoc.append(negativeDocuments[index])
        posDoc.append(positiveDocuments[index])

    negativeUnigrams = f.getUnigramPos(negDoc)
    positiveUnigrams = f.getUnigramPos(posDoc)
    
    negativeFeatures = f.getChosenFeaturesPos(negDoc)
    positiveFeatures = f.getChosenFeaturesPos(posDoc)

    f.features['unigrams_pos'] = []

    c = Classify()

    x_train, x_test, y_train, y_test = c.splitToMiniBatches(negativeFeatures, positiveFeatures, indexOfTest)

    aveAccuracyNB += c.naiveBayesClassifier(x_train, x_test, y_train, y_test)

    aveAccuracySVM += c.SVMClassifier(x_train, x_test, y_train, y_test)

    indexOfTest += 1

print("Average Accuracy of the Naive Bayes Classifier: ")
print(aveAccuracyNB / 3 * 100)
print("Average Accuracy of the SVM Classifier: ")
print(aveAccuracySVM / 3 * 100)

Accuracy for 1 fold in Naive Bayes:  0.777419354839
Accuracy for 1 fold in SVM:  0.761290322581
Accuracy for 1 fold in Naive Bayes:  0.777419354839
Accuracy for 1 fold in SVM:  0.790322580645
Accuracy for 1 fold in Naive Bayes:  0.770967741935
Accuracy for 1 fold in SVM:  0.806451612903
Average Accuracy of the Naive Bayes Classifier: 
77.5268817204
Average Accuracy of the SVM Classifier: 
78.6021505376


### (7) Adjective Unigrams using Presence

This feature is similar to the process in getting the POS tags of the unigrams. But instead of getting all the words, only the adjectives found in the documents will be chosen.

In [101]:
from sklearn.cross_validation import KFold

f = Features()
negativeDocuments = f.getListWithNegation(rf.getCorpusNeg())
positiveDocuments = f.getListWithNegation(rf.getCorpusPos())

kf = KFold(len(rf.getCorpusNeg()))

aveAccuracyNB = 0
aveAccuracySVM = 0
indexOfTest = 0

for train, test in kf: #Get training indices and testing indices
    negDoc = []
    posDoc = []

    for index in train:
        negDoc.append(negativeDocuments[index])
        posDoc.append(positiveDocuments[index])

    negativeUnigrams = f.getUnigramAdjective(negDoc)
    positiveUnigrams = f.getUnigramAdjective(posDoc)
    
    negativeFeatures = f.getChosenFeaturesAdjective(negDoc)
    positiveFeatures = f.getChosenFeaturesAdjective(posDoc)

    f.features['adjectives'] = []

    c = Classify()

    x_train, x_test, y_train, y_test = c.splitToMiniBatches(negativeFeatures, positiveFeatures, indexOfTest)

    aveAccuracyNB += c.naiveBayesClassifier(x_train, x_test, y_train, y_test)

    aveAccuracySVM += c.SVMClassifier(x_train, x_test, y_train, y_test)

    indexOfTest += 1

print("Average Accuracy of the Naive Bayes Classifier: ")
print(aveAccuracyNB / 3 * 100)
print("Average Accuracy of the SVM Classifier: ")
print(aveAccuracySVM / 3 * 100)

Accuracy for 1 fold in Naive Bayes:  0.754838709677
Accuracy for 1 fold in SVM:  0.741935483871
Accuracy for 1 fold in Naive Bayes:  0.764516129032
Accuracy for 1 fold in SVM:  0.774193548387
Accuracy for 1 fold in Naive Bayes:  0.767741935484
Accuracy for 1 fold in SVM:  0.729032258065
Average Accuracy of the Naive Bayes Classifier: 
76.2365591398
Average Accuracy of the SVM Classifier: 
74.8387096774


### (8) Unigrams + Position of word using Presence

This also has the process as the unigram but an addition process of determining the word's position was done.

First, get the split per quarter of each document. To be able to achieve this, get the total word count of each document and divide this by 4. Words in the 1st quater will be appended with "-0". Words falling in the 2nd and 3rd quarter, will be appended with "-1". And finally, words in the last quarter will be appended "-2".

In [None]:
def getUnigramPosition(self, all_document):
        unigramsposition = {}
        
        for words_in_document, category in all_document:
            splitPerQtr = len(words_in_document) // 4
            i = 0
            position = 0
            for word in words_in_document: #START
                if i < splitPerQtr:
                    unigramsposition[word + "-" + str(position)] = unigramsposition.get(word + "-" + str(position), 0) + 1
                if i >= splitPerQtr and i < len(words_in_document) - splitPerQtr: #MIDDLE
                    position = 1
                elif i <= len(words_in_document) - splitPerQtr: #END
                    position = 2
                i += 1 

        for word, count in unigramsposition.items():
            if count >= 4:
                self.features['unigram_position'].append(word)

        #print(len(self.features['unigram_position']))
        
        return self.features['unigram_position']

In [108]:
from sklearn.cross_validation import KFold

f = Features()
negativeDocuments = f.getListWithNegation(rf.getCorpusNeg())
positiveDocuments = f.getListWithNegation(rf.getCorpusPos())

kf = KFold(len(rf.getCorpusNeg()))

aveAccuracyNB = 0
aveAccuracySVM = 0
indexOfTest = 0

for train, test in kf: #Get training indices and testing indices
    negDoc = []
    posDoc = []

    for index in train:
        negDoc.append(negativeDocuments[index])
        posDoc.append(positiveDocuments[index])

    negativeUnigrams = f.getUnigramPosition(negDoc)
    positiveUnigrams = f.getUnigramPosition(posDoc)
    
    negativeFeatures = f.getChosenFeaturesPosition(negDoc)
    positiveFeatures = f.getChosenFeaturesPosition(posDoc)

    f.features['unigram_position'] = []

    c = Classify()

    x_train, x_test, y_train, y_test = c.splitToMiniBatches(negativeFeatures, positiveFeatures, indexOfTest)

    aveAccuracyNB += c.naiveBayesClassifier(x_train, x_test, y_train, y_test)

    aveAccuracySVM += c.SVMClassifier(x_train, x_test, y_train, y_test)

    indexOfTest += 1

print("Average Accuracy of the Naive Bayes Classifier: ")
print(aveAccuracyNB / 3 * 100)
print("Average Accuracy of the SVM Classifier: ")
print(aveAccuracySVM / 3 * 100)

Accuracy for 1 fold in Naive Bayes:  0.612903225806
Accuracy for 1 fold in SVM:  0.716129032258
Accuracy for 1 fold in Naive Bayes:  0.606451612903
Accuracy for 1 fold in SVM:  0.751612903226
Accuracy for 1 fold in Naive Bayes:  0.61935483871
Accuracy for 1 fold in SVM:  0.722580645161
Average Accuracy of the Naive Bayes Classifier: 
61.2903225806
Average Accuracy of the SVM Classifier: 
73.0107526882


## Results

Our results are slightly different from the original. There are several reasons in why this may happen. First, the original authors did not state which negation words and punctuation marks were used. Also, the POS tagger may also affect the result of the features. Some of the pre-process made may also be different 

In [128]:
import numpy as np
import nltk
from nltk.util import ngrams
from collections import Counter

class Features(object):
    def __init__(self):
        self.negate = "NOT_"

        self.negation_list = ["no", "not", "n\'t"]
        self.punctuation_list = [".", ",", "?", "!", ";"]

        self.features = {}

        self.features['unigram'] = []
        self.features['bigram'] = []
        self.features['unigram_bigram'] = []
        self.features['topunigrams'] = []
        self.features['unigram_pos'] = []
        self.features['adjectives'] = []
        self.features['unigram_position'] = []

    def getListWithNegation(self, corpus):
        all_document = []

        for document, category in corpus:
            all_words = []
            tokens = nltk.word_tokenize(document)
            i = 0
            while i < len(tokens):
                if tokens[i] in self.negation_list:
                    all_words.append(tokens[i])
                    i += 1
                    while i < len(tokens):
                        if tokens[i] not in self.punctuation_list:
                            a = tokens[i] + self.negate
                            all_words.append(a)
                        else:
                            break
                        i += 1
                else:
                    all_words.append(tokens[i])
                i += 1

            all_document.append((all_words, category))

        #print(all_document)

        #print(all_words)

        return all_document

    def getUnigram(self, all_document):
        unigrams = {}

        for words_in_document, category in all_document:
            for word in words_in_document:
                unigrams[word] = unigrams.get(word, 0) + 1

        for word, count in unigrams.items():
            if count >= 4: #SABI SA DOCU AT LEAST 4
                self.features['unigram'].append(word)

        #print(len(self.features['unigram']))

        return self.features['unigram']

    def getChosenFeatures(self, all_document, type = 'None'):
        chosen_features = np.zeros((len(all_document), len(self.features['unigram'])))

        i = 0
        for words_in_document, category in all_document:
            if type == 'frequency':
                frequencies = Counter(words_in_document)
                for word, count in frequencies.items():
                    try:
                        chosen_features[i][self.features['unigram'].index(word)] = count
                    except: #PAG WALA
                        pass
            elif type == 'presence':
                for word in set(words_in_document):
                    try:
                        chosen_features[i][self.features['unigram'].index(word)] = 1
                    except: #PAG WALA
                        pass
            i += 1

        #print(chosen_features)

        return chosen_features

    def getBigram(self, all_document):
        bigrams = {}
        words = []

        for words_in_document, category in all_document:
            #tokens = nltk.word_tokenize(words_in_document)
            words += ngrams(words_in_document, 2)
        
        for word in words:
            bigrams[word] = bigrams.get(word, 0) + 1

        for word, count in bigrams.items():
            if count >= 7: #SABI SA DOCU AT LEAST 7
                self.features['bigram'].append(word)

        #print(self.features['bigram'][:100])
        return self.features['bigram'][:16165]

    def getChosenFeaturesBigram(self, all_document):
        chosen_features = np.zeros((len(all_document), len(self.features['bigram'])))

        i = 0
        for words_in_document, category in all_document:
            words = []
            #tokens = nltk.word_tokenize(words_in_document)
            words += ngrams(words_in_document, 2)
            for word in set(words):
                try:
                    chosen_features[i][self.features['bigram'].index(word)] = 1
                except: #PAG WALA
                    pass
            i += 1

        #print(chosen_features)
        return chosen_features

    def getUnigramBigram(self):
        self.features[unigram_bigram] = self.features['unigram'] + self.features['bigram']
    
    def getChosenFeaturesUniBigram(self, all_document, features):
        chosen_features = np.zeros((len(all_document), len(features)))

        i = 0
        for words_in_document, category in all_document:
            words = []
            #tokens = nltk.word_tokenize(words_in_document)
            words += ngrams(words_in_document, 2)
            for word in set(words):
                try:
                    chosen_features[i][features.index(word)] = 1
                except: #PAG WALA
                    pass
            i += 1

        #print(chosen_features)
        return chosen_features

    def getTopUnigrams(self, all_document):
        frequencies = Counter(self.features['unigram'])
        self.features['topunigrams'] = sorted(frequencies, key=frequencies.get, reverse=True)[:2633]

        chosen_features = np.zeros((len(all_document), len(self.features['topunigrams'])))

        i = 0
        for words_in_document, category in all_document:
            for word in set(words_in_document):
                try:
                    chosen_features[i][self.features['topunigrams'].index(word)] = 1
                except: #PAG WALA
                    pass
            i += 1

        #print(chosen_features)
        return chosen_features

    def getUnigramPos(self, all_document):
        unigramspos = {}

        for words_in_document, category in all_document:
            for word, pos in nltk.pos_tag(words_in_document):
                unigramspos[word + "-" + pos] = unigramspos.get(word + "-" + pos, 0) + 1

        for word, count in unigramspos.items():
            if count >= 4:
                self.features['unigram_pos'].append(word)

        #print(len(self.features['unigram_pos']))
        
        return self.features['unigram_pos']

    def getChosenFeaturesPos(self, all_document):
        chosen_features = np.zeros((len(all_document), len(self.features['unigram_pos'])))

        i = 0
        for words_in_document, category in all_document:
            for word, pos in nltk.pos_tag(words_in_document):
                try:
                    chosen_features[i][self.features['unigram_pos'].index(word + "-" + pos)] = 1
                except: #PAG WALA
                    pass
            i += 1

        #print(chosen_features)
        return chosen_features

    def getUnigramAdjective(self, all_document):
        for words_in_document, category in all_document:
            for word, pos in nltk.pos_tag(words_in_document):
                if pos in ['JJ', 'JJR', 'JJS']:
                    self.features['adjectives'].append(word)
        
        return self.features['adjectives']
    
    def getChosenFeaturesAdjective(self, all_documents):
        chosen_features = np.zeros((len(all_document), len(self.features['adjectives'])))
        
        i = 0
        for words_in_document, category in all_document:
            for word in set(words_in_document):
                try:
                    chosen_features[i][self.features['adjectives'].index(word)] = 1
                except: #PAG WALA
                    pass
            i += 1

        #print(chosen_features)
        return chosen_features

    def getUnigramPosition(self, all_document):
        unigramsposition = {}
        
        for words_in_document, category in all_document:
            splitPerQtr = len(words_in_document) // 4
            i = 0
            position = 0
            for word in words_in_document: #START
                if i < splitPerQtr:
                    unigramsposition[word + "-" + str(position)] = unigramsposition.get(word + "-" + str(position), 0) + 1
                if i >= splitPerQtr and i < len(words_in_document) - splitPerQtr: #MIDDLE
                    position = 1
                elif i <= len(words_in_document) - splitPerQtr: #END
                    position = 2
                i += 1 

        for word, count in unigramsposition.items():
            if count >= 4:
                self.features['unigram_position'].append(word)

        #print(len(self.features['unigram_position']))
        
        return self.features['unigram_position']

    def getChosenFeaturesPosition(self, all_document):
        chosen_features = np.zeros((len(all_document), len(self.features['unigram_position'])))

        i = 0
        for words_in_document, category in all_document:
            splitPerQtr = len(words_in_document) // 4
            j = 0
            position = 0
            for word in set(words_in_document):
                try:
                    if j < splitPerQtr:
                        chosen_features[i][self.features['unigram_position'].index(word + "-" + str(position))] = 1
                    if j >= splitPerQtr and j < len(words_in_document) - splitPerQtr: #MIDDLE
                        position = 1
                    elif j <= len(words_in_document) - splitPerQtr: #END
                        position = 2 
                    j += 1
                except: #PAG WALA
                    j += 1
                    pass
            i += 1

        #print(chosen_features)
        
        return chosen_features

In [None]:
from sklearn.model_selection import KFold
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

import numpy as np

class Classify(object):
    def __init__(self):
        self.number_fold = 3

    def splitToMiniBatches(self, negative, positive, indexOfTest):
        batch = {'fold' : [], 'X' : [], 'Y' : []}

        split = len(negative) // self.number_fold

        #Create mini batches
        for i in range(self.number_fold):
            mergedlist = []

            batch['fold'].append(i)
            
            #Split to equal sized batch, maintaining balance from neg and pos
            neg = negative[i * split : i * split + split]
            #print(len(neg))
            pos = positive[i * split : i * split + split]
            #print(len(pos))
            
            mergedlist.extend(neg)
            mergedlist.extend(pos)
            batch['X'].append(mergedlist)
            #print(len(self.batch['X'][i]))

            batch['Y'].append(np.append(np.zeros(split), np.ones(split)))
            #print(len(self.batch['Y'][i]))

        batch['fold'] = np.array(batch['fold'])
        batch['X'] = np.array(batch['X'])
        batch['Y'] = np.array(batch['Y'])

        x_train = batch['X'][batch['fold'] != i]
        y_train = batch['Y'][batch['fold'] != i]

        x_test = batch['X'][i]
        y_test = batch['Y'][i]

        #print(x_test.shape)
        #print(y_test.shape)

        return x_train, x_test, y_train, y_test
    
    def naiveBayesClassifier(self, x_train, x_test, y_train, y_test):
        bernoulliNB = BernoulliNB()
        bernoulliNB.fit(x_train.reshape(x_train.shape[0] * x_train.shape[1], -1), np.ravel(y_train.reshape(y_train.shape[0] * y_train.shape[1], -1)))
        accuracy = accuracy_score(bernoulliNB.predict(x_test), y_test)
        
        #print("Accuracy for 1 fold in Naive Bayes: ", accuracy)

        return accuracy

    def SVMClassifier(self, x_train, x_test, y_train, y_test):
        linearSVM = LinearSVC()
        linearSVM.fit(x_train.reshape(x_train.shape[0] * x_train.shape[1], -1), np.ravel(y_train.reshape(y_train.shape[0] * y_train.shape[1], -1)))
        accuracy = accuracy_score(linearSVM.predict(x_test), y_test)
    
        #print("Accuracy for 1 fold in SVM: ", accuracy)

        return accuracy