# SENTIMENT ANALYSIS OF MOVIE REVIEWS USING THE NLTK PACKAGE

### INTRODUCTION

This tutorial will introduce to you some basic methods for sentiment analysis of documents. Sentiment analysis is the process of determining whether a peice of text is positive, negative or neutral. It is often also called opinion mining. It can be particularly useful to find out the opinion of people about certain topics. 

### NATURAL LANGUAGE TOOLKIT

Natural Language Toolkit is an extremely useful library for dealing with human language data. This is most useful for text processing and text analysis and would  form the basis of my program.

### EXAMINING THE DATA  

The dataset called the sentence polarity dataset v1.0, has been taken from the website of Professor Bo Pang of Cornell University. It essentially consists of 5331 positive movie reviews and 5331 negative movie reviews. He has obtained the dataset from the movie review website Rotten Tomatoes. 



### THE SENTIMENT ANALYSIS PROCESS

The primary package that will be used for this process will be the nltk Package.

The end goal of this process is to assign a positive or negative sentiment to each movie review while evaulating multiple feature selection mechanisms and methods.

In [336]:
import pandas as pd 
from nltk.corpus import stopwords
import numpy as np
import re, math, collections, itertools
import nltk, nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.metrics import BigramAssocMeasures
from nltk.probability import FreqDist, ConditionalFreqDist
from nltk.corpus import subjectivity
from nltk.sentiment.util import * 
import math
import string
from nltk import precision
from nltk import recall

In [337]:
posReviews = open('rt-polarity-pos.txt', 'rU')
negReviews = open('rt-polarity-neg.txt', 'rU')

For a better understanding of what the dataset looks like, I'll print out the first 5 reviews for each of the positive
reviews file and the negative reviews file.


In [338]:
with open("rt-polarity-pos.txt") as myfile:
    head = [next(myfile) for x in xrange(5)]
print ("Positive Reviews\n")
print head,'\n'
with open("rt-polarity-neg.txt") as myfile:
    head = [next(myfile) for x in xrange(5)]
print ("Negative Reviews \n")
print head


Positive Reviews

['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal . \n', 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth . \n', 'effective but too-tepid biopic\n', 'if you sometimes like to go to the movies to have fun , wasabi is a good place to start . \n', "emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one . \n"] 

Negative Reviews 

['simplistic , silly and tedious . \n', "it's so laddish and juvenile , only teenage boys could possibly find it funny . \n", 'exploitative and largely devoid of the depth or sophistication that would make watching such a graphic treatment of the crimes bearable . \n', '[garbus] discards 

Processing the reviews to make them suitable for sentiment analysis. I'll  pass the reviews through the process  function, in which multiple operations would be applied to the text passed to it. 
    
    


In the process function, several operaitons are applied to the text fed to the function. Apart from removing the stopwords, all the punctuations are removed from the reviews. All the text is converted to the lower case. Also, the text is first lemmatized and then tokenized.  Lemmatization is the process of bringing the words to their root form. 


In [339]:
def process(text, lemmatizer=nltk.stem.wordnet.WordNetLemmatizer()):
    """ Normalizes case and handles punctuation
    Inputs:
        text: str: raw text
        lemmatizer: an instance of a class implementing the lemmatize() method
                    (the default argument is of type nltk.stem.wordnet.WordNetLemmatizer)
    Outputs:
        list(str): tokenized text
    """
    doc = text
    doc = doc.lower()
    doc = doc.replace("\'s",'')
    doc = doc.replace("\'",'')
    processed =''
    tokens1 = []
    
    exclude = set(string.punctuation)
    for ch in doc:
        if ch in exclude:
            processed = processed+ ' '
        else:
            processed = processed + ch

    tokens = nltk.word_tokenize(processed)
    new = stopwords.words('english')
    
    tokens = list(set(tokens) - set(new))
    
    for i in tokens:
        try:
            tokens1.append(str(lemmatizer.lemmatize(i)))
        except:
            pass
    return tokens1       

### FEATURE EVALUATION

Now we can start the process of feature evaluation.'Features' in the sentiment analysis context refer to the text that you are analyzing that would correlate to the labels. Features in this case are the words of the movie reviews. The first task after processing the reviews is to evaluate different feature selection mechanisms using different methods to get different subset of the features from the reviews and then evaluating those features. I will build up two variables with positive and negative features each in the next part. Those would contain the output of the feature selection mechanism depending on whether the review it would draw is positive or negative.

Below is a function written for feature evaluation. It essentially first builds up two lists, each with postive features of the positive reviews and negative features of the negative reviews while appending positive and negative to each review according to the dataset they belong to. Then training and test data sets are
built with 75 percent and 25 percent of the data respectively and a Naive Bayes classifier is trained. The classifier is then used to predict labels for the test dataset.  

In the latter part of the function, I iterate through the test dataset with i being an arbitrary identifier, feat being the features and labels(positive or negative), building up two sets, the labelset(actual labels) and testSet(predicted labels). This is done to assist in calculating various metrics for the classification which would help me in measuring how well the trained model did. Various metrics that are published are accuracy, precision and recall. 
Accuracy gives the percentage ratio of the number of reviews correctly labeled.  Precision gives the measure of positive predictions that are correct whereas recall gives a measure of how good a classifier is at detecting the positives.

In [340]:
def feature_evaluation(features):
    
    posReviews = open('rt-polarity-pos.txt', 'rU')
    negReviews = open('rt-polarity-neg.txt', 'rU')
    
    positiveFeatures = []
    negativeFeatures = []
    
    for i in posReviews:
        pWord = process(i)
        positiveWords = [features(pWord), 'positive']
        positiveFeatures.append(positiveWords)
    for i in negReviews:
        nWord = process(i)
        negativeWords = [features(nWord), 'negative']
        negativeFeatures.append(negativeWords)

    pCut = int(math.floor(len(positiveFeatures)*3/4))
    nCut = int(math.floor(len(negativeFeatures)*3/4))
    
    trainData = positiveFeatures[:pCut] + negativeFeatures[:nCut]
    testData = positiveFeatures[pCut:] + negativeFeatures[nCut:]

    #training a Naive Bayes Classifier
    nBClassifier = NaiveBayesClassifier.train(trainData)

    #initiating the label set and the test Set
    labelSet = collections.defaultdict(set)
    testSet = collections.defaultdict(set)

    #puts sentences with correct labels in the labeled Set and the predicted values in the testset
    for i, (feat, lab) in enumerate(testData):
        labelSet[lab].add(i)
        predictedValues = nBClassifier.classify(feat)
        testSet[predictedValues].add(i)	

    print 'training done on %d instances, testing done on %d instances' % (len(trainData), len(testData))
    print 'accuracy = ', nltk.classify.util.accuracy(nBClassifier, testData)
    print 'positive precision = ', precision(labelSet['positive'], testSet['positive'])
    print 'positive recall = ', recall(labelSet['positive'], testSet['positive'])
    print 'negative precision =', precision(labelSet['negative'], testSet['negative'])
    print 'negative recall = ', recall(labelSet['negative'], testSet['negative'])
    nBClassifier.show_most_informative_features(10)

The add_check function below is a basic feature selection mechanism which just builds up a dictionary adds a true boolean check as the value and words passed to it as keys. 

In [341]:
def add_check(words):
    return dict([(word, True) for word in words])

We will now check how effective this feature selection mechanism is.

In [342]:
feature_evaluation(add_check)


training done on 7996 instances, testing done on 2666 instances
accuracy =  0.763315828957
positive precision =  0.765105740181
positive recall =  0.759939984996
negative precision = 0.761549925484
negative recall =  0.766691672918
Most Informative Features
              engrossing = True           positi : negati =     17.0 : 1.0
                   quiet = True           positi : negati =     15.7 : 1.0
                mediocre = True           negati : positi =     13.7 : 1.0
               absorbing = True           positi : negati =     13.0 : 1.0
              refreshing = True           positi : negati =     13.0 : 1.0
                 triumph = True           positi : negati =     13.0 : 1.0
                portrait = True           positi : negati =     12.4 : 1.0
               inventive = True           positi : negati =     12.3 : 1.0
            refreshingly = True           positi : negati =     11.7 : 1.0
                    flat = True           negati : positi =     11.

The add_check mechanism gives us an accuracy of 76.33% which is a good measure. The precision and recall values are close to each other which means the model is predicting everything pretty evenly. 

The most informative features gives us what are the chances of a review being positive or negative with a particular word being in it. For example, if there is the word engrossing in a reviews, there is a 17 to 1 chance of the review being postive.

Next, another way of feature generation is attempted. In this attempted only N most informative features are generated. The N features that convey the most information. Hence, we need to calculate the information gain of each word. The scores_words function will do precisely just that. 

Two frequency distributions are used in the below function. One is a Frequency distribution which is used to store the frequency of all the words and the other is conditional frequency distribution which is used to separately store the frequencies of the positive and negative words.
Chi Squared tests are used to calculate the information gain. More can be read about Chi Squared tests here (http://www2.lv.psu.edu/jxm57/irp/chisquar.html). The postive information gain and negative information gain for  is calculated, added up and stored in a dictionary linked to the word which is then returned by this function. 

In [343]:
def scores_words():
    posReviews = open('rt-polarity-pos.txt', 'rU')
    negReviews = open('rt-polarity-neg.txt', 'rU')
    
    positiveWords = []
    negativeWords = []

    for i in posReviews:

        pWord = process(i)
        positiveWords.append(pWord)

    for i in negReviews:
        nWord = process(i)
        negativeWords.append(nWord)

    #making the positive word list and the negative word list iterable
    positiveWords = list(itertools.chain(*positiveWords))
    negativeWords = list(itertools.chain(*negativeWords))

    wordsFD = FreqDist()
    pnWordsFD = ConditionalFreqDist()

    for k in positiveWords:
        wordsFD[k] += 1
        pnWordsFD['positive'][k] += 1
    for word in negativeWords:
        wordsFD[k] += 1
        pnWordsFD['negative'][k] += 1

    #finding the count of negative and positive words and total word count 
    pWCount = pnWordsFD['positive'].N()
    nWCount = pnWordsFD['negative'].N()
    tWCount = pWCount + nWCount

    scores = {}
    for wd, frequency in wordsFD.iteritems():

        positiveScore = BigramAssocMeasures.chi_sq(pnWordsFD['positive'][wd], (frequency, pWCount), tWCount)
        negativeScore = BigramAssocMeasures.chi_sq(pnWordsFD['negative'][wd], (frequency, nWCount), tWCount)
        scores[wd] = positiveScore + negativeScore

    return scores



The find best words function below finds the best N features, given the set of all the words and scores and the value of N

In [344]:
scores = scores_words()

def findBestWords(scores, no):
    bestValues = sorted(scores.iteritems(), key=lambda (m, n): n, reverse=True)[:no]
    bestWords = set([m for m, n in bestValues])
    return bestWords
    pass

The best features function below will return true only for words that are in the best words list. 

In [345]:
#creates mechanism for feature selection that will only use the best words
def bestFeatures(wds):
    return dict([(wd, True) for wd in wds if wd in bestWords])
    pass

In the next block of code, I will attempt different number of features ranging from 2000 to 15000 to notice how does the accuracy vary with the number of top N informative words.

In [346]:


#no. of features to select

numbers = [2000, 5000, 10000, 15000]

for no in numbers:
  
    print 'evaluating the top %d word features' % (no)
    
    bestWords = findBestWords(scores, no)
    feature_evaluation(bestFeatures)
    

evaluating the top 2000 word features
training done on 7996 instances, testing done on 2666 instances
accuracy =  0.708552138035
positive precision =  0.664887307236
positive recall =  0.84096024006
negative precision = 0.783673469388
negative recall =  0.576144036009
Most Informative Features
              engrossing = True           positi : negati =     17.0 : 1.0
                   quiet = True           positi : negati =     15.7 : 1.0
               absorbing = True           positi : negati =     13.0 : 1.0
              refreshing = True           positi : negati =     13.0 : 1.0
                 triumph = True           positi : negati =     13.0 : 1.0
                portrait = True           positi : negati =     12.4 : 1.0
               inventive = True           positi : negati =     12.3 : 1.0
            refreshingly = True           positi : negati =     11.7 : 1.0
                 culture = True           positi : negati =     11.0 : 1.0
               affecting = Tru

### CONCLUSION

It is observed that accuracy for all the iterations of number of features tested is less than the initial accuracy that was achieved previously in our basic feature generation method where I used all the words. Though unexpected but this is possibly due to the fact that in the process function, the processing that is done removes the stop-words and punctuations in the cleaning up of the data and thus most un-important words have already been removed.

There are other possible ways of increasing the accuracy by better feature generation. There are various operations 
that can be done through within the NLTK. Other feature selection mechanisms can be used and other classification algorithms may be attempted.  There are other operations such as chunking, chinking, n-grams etc which might prove to be useful depending upon the use case. 

One such powerful feature of the NLTK library is the part of speech tagging of words. Part of speech tagging refers to assigning tags to the words whether they are nouns, verbs, adjectives etc. In this next part, part of speech tags will be assigned to the words in the reviews. Assuming that only verbs, nouns, adjectives and adverbs are significant for analyzing sentiment, the remaining parts of speech would be removed from the text to make 
sentiment score more accurate.

### RESOURCES

Various such NLTK operations can be referred to from StreamHacker which is a great source of information and has lots 
of relevant information especially related to Sentiment analysis. Another great resource is https://pythonprogramming.net/sentiment-analysis-module-nltk-tutorial/ which has mutiple useful tutorials and examples for using different features of the NLTK.