<H1>Analyzing Movie Review Sentiment for Classification Purposes</H1>

<H2>Summary</H2>
The project is to analyze sentiment from movie review text, to classify it on a zero to four scale (0,1,2,3,4) from most negative to most postive.  The usefulness of sentiment classification extends to more than just parsing movie reviews and perhaps being able to aggregate and score them ala Metacritic or Rotten Tomatoes.  One example of the extended usefulness is that it holds use for political analysis, both for political statements of politicans, and commentary by the publie (e.g., twitter responses).  Another application is it useful for commercial companies to analyze what people are saying about their products.  Or it could be used by historians to analyze sentiment of text for a given time period to help judge what people thought in a given time period.  The potential uses are endless.

The original thought for this research was to use the libraries from the NLTK to work on the classification problem.  Unfortunately the NTLK libraries are not set up to utilize multi-core processers and were exceptionally slow at classifying. In fact the classifier seemed to follow some sort of exponential growth on time, for example 3000 lines in the training set took about a minute, but 10000 took around <b>12</b> minutes.  Anyhow, the below is the start of the classification tried before given up and moving to use mostly SciKit libraries which take advantage of multi processing and allowed analysis to proceed much quicker.  That research is in the file labeled 'V2'.

The source of the research was a Kaggle competition (closed now) were text from movie reviews were classified to use as a training set.  The location of the source is <A>https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/</a>

The training file for the work was structured like the following:

    PhraseId	SentenceId	Phrase	Sentiment
<ul style="list-style-type:none">
<li>1	1	A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story .	1 </li>
<li>2	1	A series of escapades demonstrating the adage that what is good for the goose	2</li>
<li>3	1	A series	2</li>
<li>4	1	A	2</li>
<li>5	1	series	2</li>
<li>6	1	of escapades demonstrating the adage that what is good for the goose	2</li>
</ul>

The test file provided was the same except the sentiment classifier was not provided.

<H2>Set-up</H2>
Import libraries and load the file.  Shuffle for training.

In [1]:
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import subjectivity
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import *
import pandas as pd
from sklearn.utils import shuffle
import time

lines=[]
dfTxt = pd.read_csv('moviesentiment/train.tsv', sep = '\t')
    
l = dfTxt.Phrase
l  = [l.split() for l in l]  #make each sentence a series of word tuples
lines = [(l[i], dfTxt.Sentiment[i]) for i in range(0,len(l))]

# we don't know the state of the imported data...shuffle to make sure it is random
lines = shuffle(lines, random_state=42)
    

<H2>Analytical Pipeline</H2>
Mark words as negative after a negative phrase.  I.e.:

    "He was not a happy person" 
    
would be transformed into:

    "He was not a_neg happy_neg person_neg"
    
Create unigram feature matrix of words and frequencies.  Note in working with this the min_freq was played with to speed up processing, but overall it was still very slow.
    

In [12]:
def sentimentPipeLine(lines):
    sentimAnalyzer = SentimentAnalyzer()
    all_words_neg = sentimAnalyzer.all_words([mark_negation(line) for line in lines])
    unigramFeats = sentimAnalyzer.unigram_word_feats(all_words_neg, min_freq=4)
    sentimAnalyzer.add_feat_extractor(extract_unigram_feats, unigrams=unigramFeats)
    return(sentimAnalyzer)


<H2>Create Method to Classify</H2>
This runs the pipeline above, then "applies features" to a trainSet and testSet of data, trains it on the trainSet and then evaluates both the train and the test sets to see how well they have performed.

In [4]:
def runClassification(lines, size):
    trainLines = lines[:size]
    testLines = lines[size:size+200]
    startTime = time.time()        
    sentimAnalyzer = sentimentPipeLine(trainLines)     
    trainSet = sentimAnalyzer.apply_features(trainLines)
    testSet = sentimAnalyzer.apply_features(testLines)    
    trainer = NaiveBayesClassifier.train;
    classifier = sentimAnalyzer.train(trainer, trainSet)    
    
    evaluated = sentimAnalyzer.evaluate(trainSet)
    endTime = time.time()    
    print 'Time to train and evaluate: ' + str(endTime - startTime)    
    print 'TrainSetResults'
    for key,value in sorted(evaluated.items()):
        print('{0}: {1}'.format(key, value))

    print 'TestSet Results'
    
    evaluated = sentimAnalyzer.evaluate(testSet)
    for key,value in sorted(evaluated.items()):
        print('{0}: {1}'.format(key, value))

<H2>Run Classifier</H2>
Initial results are poor, but we only ran 3000 lines when training.   More lines would improve, but b/c of slow performance we will leave it at 3000 and run more expansively later.

In [5]:
runClassification(lines,3000)

Training classifier
Evaluating NaiveBayesClassifier results...
Time to train and evaluate: 53.3659999371
TrainSetResults
Accuracy: 0.617
F-measure [0]: 0.404255319149
F-measure [1]: 0.45549132948
F-measure [2]: 0.738652382295
F-measure [3]: 0.427010923535
F-measure [4]: 0.481605351171
Precision [0]: 0.398601398601
Precision [1]: 0.567723342939
Precision [2]: 0.646913580247
Precision [3]: 0.576407506702
Precision [4]: 0.642857142857
Recall [0]: 0.410071942446
Recall [1]: 0.380308880309
Recall [2]: 0.860709592641
Recall [3]: 0.339116719243
Recall [4]: 0.385026737968
TestSet Results
Evaluating NaiveBayesClassifier results...
Accuracy: 0.485
F-measure [0]: 0.315789473684
F-measure [1]: 0.204081632653
F-measure [2]: 0.618644067797
F-measure [3]: 0.268656716418
F-measure [4]: 0.48275862069
Precision [0]: 0.375
Precision [1]: 0.357142857143
Precision [2]: 0.493243243243
Precision [3]: 0.45
Precision [4]: 0.7
Recall [0]: 0.272727272727
Recall [1]: 0.142857142857
Recall [2]: 0.829545454545
Reca

<H2>Standardizing Case</H2>
We lowercased all words to normalize them, the hope was to improve performance by reducing the number of words in the feature sets.  It helped very minimally with speed.

In [6]:
#lowercase string in lines
lines = [([s.lower() for s in line[0]],line[1]) for line in lines]
runClassification(lines,3000)

Training classifier
Evaluating NaiveBayesClassifier results...
Time to train and evaluate: 52.3480000496
TrainSetResults
Accuracy: 0.614333333333
F-measure [0]: 0.404181184669
F-measure [1]: 0.456546929316
F-measure [2]: 0.73556744579
F-measure [3]: 0.419678714859
F-measure [4]: 0.481848184818
Precision [0]: 0.391891891892
Precision [1]: 0.571014492754
Precision [2]: 0.643666830951
Precision [3]: 0.577348066298
Precision [4]: 0.629310344828
Recall [0]: 0.41726618705
Recall [1]: 0.380308880309
Recall [2]: 0.858081471748
Recall [3]: 0.329652996845
Recall [4]: 0.390374331551
TestSet Results
Evaluating NaiveBayesClassifier results...
Accuracy: 0.47
F-measure [0]: 0.3
F-measure [1]: 0.196078431373
F-measure [2]: 0.610169491525
F-measure [3]: 0.25
F-measure [4]: 0.413793103448
Precision [0]: 0.333333333333
Precision [1]: 0.3125
Precision [2]: 0.486486486486
Precision [3]: 0.470588235294
Precision [4]: 0.6
Recall [0]: 0.272727272727
Recall [1]: 0.142857142857
Recall [2]: 0.818181818182
Recall

[(["'s", 'as', 'sorry'], 2),
 (['romantic',
   'comedy',
   'and',
   'dogme',
   '95',
   'filmmaking',
   'may',
   'seem',
   'odd',
   'bedfellows',
   ',',
   'but',
   'they',
   'turn',
   'out',
   'to',
   'be',
   'delightfully',
   'compatible',
   'here'],
  4),
 (['of', 'these', 'days'], 2),
 (['flinch', 'from', 'its', 'unsettling', 'prognosis'], 2),
 (['are', 'clinically', 'depressed'], 1),
 (['just',
   'too',
   'silly',
   'and',
   'sophomoric',
   'to',
   'ensnare',
   'its',
   'target',
   'audience',
   '.'],
  1),
 (['thoughtful', 'without', 'having', 'much', 'dramatic', 'impact'], 1),
 (['manage', 'to', 'be', 'spectacularly', 'outrageous', '.'], 3),
 (['hunk'], 4),
 (['forgotten', '10', 'minutes'], 1),
 (['the',
   'slapstick',
   'is',
   'labored',
   ',',
   'and',
   'the',
   'bigger',
   'setpieces',
   'flat',
   '.'],
  0),
 (['fusty'], 2),
 (['disguise',
   'the',
   'fact',
   'that',
   'the',
   'new',
   'film',
   'is',
   'a',
   'lame',
   'kidd

<H2>Stemming</H2>
Stemming was run below.  Stemming normalizing words without affixes (prefixes and suffixes) so as to improve training and classification.  The idea is that if we only use the roots of words, training will work better as there will likely be more instances of a given root to use for training than all the variations with different prefixes and suffixes.

Our performance did go up a couple of percentage points on the test dataset, so stemming is helping.

In [15]:
nltk.PorterStemmer()
stemmedLines = [([stemmer.stem(s) for s in line[0]],line[1]) for line in lines]
runClassification(stemmedLines,3000)

Training classifier
Evaluating NaiveBayesClassifier results...
Time to train and evaluate: 65.3899998665
TrainSetResults
Accuracy: 0.631333333333
F-measure [0]: 0.43321299639
F-measure [1]: 0.484918793503
F-measure [2]: 0.743357829282
F-measure [3]: 0.457425742574
F-measure [4]: 0.504792332268
Precision [0]: 0.434782608696
Precision [1]: 0.607558139535
Precision [2]: 0.652281746032
Precision [3]: 0.614361702128
Precision [4]: 0.626984126984
Recall [0]: 0.431654676259
Recall [1]: 0.403474903475
Recall [2]: 0.863994743758
Recall [3]: 0.364353312303
Recall [4]: 0.422459893048
TestSet Results
Evaluating NaiveBayesClassifier results...
Accuracy: 0.49
F-measure [0]: 0.117647058824
F-measure [1]: 0.222222222222
F-measure [2]: 0.626086956522
F-measure [3]: 0.314285714286
F-measure [4]: 0.551724137931
Precision [0]: 0.166666666667
Precision [1]: 0.315789473684
Precision [2]: 0.507042253521
Precision [3]: 0.478260869565
Precision [4]: 0.8
Recall [0]: 0.0909090909091
Recall [1]: 0.171428571429
Re

<H2>Different Stemmer Algorithm</H2>
Here we try a different stemmer to see if returns better results with our testdata.  Results were the same.  Granted the test size is small, so perhaps in bigger training sets we see a difference.  But In any case it seems likely that the choice of stemmers would not matter much.

In [16]:
stemmer = nltk.LancasterStemmer()
stemmedlines = [([stemmer.stem(s) for s in line[0]],line[1]) for line in lines]
runClassification(stemmedLines, 3000)

Training classifier
Evaluating NaiveBayesClassifier results...
Time to train and evaluate: 64.6100001335
TrainSetResults
Accuracy: 0.631333333333
F-measure [0]: 0.43321299639
F-measure [1]: 0.484918793503
F-measure [2]: 0.743357829282
F-measure [3]: 0.457425742574
F-measure [4]: 0.504792332268
Precision [0]: 0.434782608696
Precision [1]: 0.607558139535
Precision [2]: 0.652281746032
Precision [3]: 0.614361702128
Precision [4]: 0.626984126984
Recall [0]: 0.431654676259
Recall [1]: 0.403474903475
Recall [2]: 0.863994743758
Recall [3]: 0.364353312303
Recall [4]: 0.422459893048
TestSet Results
Evaluating NaiveBayesClassifier results...
Accuracy: 0.49
F-measure [0]: 0.117647058824
F-measure [1]: 0.222222222222
F-measure [2]: 0.626086956522
F-measure [3]: 0.314285714286
F-measure [4]: 0.551724137931
Precision [0]: 0.166666666667
Precision [1]: 0.315789473684
Precision [2]: 0.507042253521
Precision [3]: 0.478260869565
Precision [4]: 0.8
Recall [0]: 0.0909090909091
Recall [1]: 0.171428571429
Re

<H2>Toooo Slow</H2>
So now that we trained with some very small samples of the dataset (156k lines is the total), lets try a little bit bigger size to train with: 10k.  Unfortunately this too 12 times longer with only slightly more than 3 times the size of the data to train. Performance did improve, and no doubt the performance would improve more with larger data, if we could spend the computation time.  But with the full training set about 15 times bigger, and the growth not linear, this method of classification looks poor.  Therefore we try the SciKit libraries (with some NLTK libraries) in the other file)

In [17]:
runClassification(stemmedLines, 10000)

Training classifier
Evaluating NaiveBayesClassifier results...
Time to train and evaluate: 672.63499999
TrainSetResults
Accuracy: 0.6226
F-measure [0]: 0.43254376931
F-measure [1]: 0.474363215349
F-measure [2]: 0.736106256554
F-measure [3]: 0.478085351788
F-measure [4]: 0.471663619744
Precision [0]: 0.426829268293
Precision [1]: 0.58674304419
Precision [2]: 0.651004636785
Precision [3]: 0.617734724292
Precision [4]: 0.544303797468
Recall [0]: 0.438413361169
Recall [1]: 0.398112159911
Recall [2]: 0.846803377563
Recall [3]: 0.389934148636
Recall [4]: 0.416129032258
TestSet Results
Evaluating NaiveBayesClassifier results...
Accuracy: 0.51
F-measure [0]: 0.235294117647
F-measure [1]: 0.328358208955
F-measure [2]: 0.676056338028
F-measure [3]: 0.4
F-measure [4]: 0.0869565217391
Precision [0]: 0.222222222222
Precision [1]: 0.34375
Precision [2]: 0.595041322314
Precision [3]: 0.551724137931
Precision [4]: 0.111111111111
Recall [0]: 0.25
Recall [1]: 0.314285714286
Recall [2]: 0.782608695652
Re

In [16]:
# rawLines = [line[0] for line in lines]
# bigrams = sentim_analyzer.bigram_collocation_feats(rawLines, min_freq=40)

In [18]:
def sentimentPipeLineBigram(lines):
    sentimAnalyzer = SentimentAnalyzer()
    all_words_neg = sentimAnalyzer.all_words([mark_negation(line) for line in lines])
    unigramFeats = sentimAnalyzer.unigram_word_feats(all_words_neg, min_freq=4)
    sentimAnalyzer.add_feat_extractor(extract_unigram_feats, unigrams=unigramFeats)
    rawLines = [line[0] for line in lines]
    bigrams = sentimAnalyzer.bigram_collocation_feats(rawLines, min_freq=10)
    sentimAnalyzer.add_feat_extractor(extract_bigram_feats, bigrams=bigrams)
    return(sentimAnalyzer)

In [19]:
def runClassificationBigram(lines, size):
    trainLines = lines[:size]
    testLines = lines[size:size+200]
    startTime = time.time()
    
    sentimAnalyzer = sentimentPipeLineBigram(trainLines)     
    trainSet = sentimAnalyzer.apply_features(trainLines)
    testSet = sentimAnalyzer.apply_features(testLines)    
    trainer = NaiveBayesClassifier.train;
    classifier = sentimAnalyzer.train(trainer, trainSet)    
    
    evaluated = sentimAnalyzer.evaluate(trainSet)
    endTime = time.time()    
    print 'Time to train and evaluate: ' + str(endTime - startTime)    
    print 'TrainSetResults'
    for key,value in sorted(evaluated.items()):
        print('{0}: {1}'.format(key, value))

    print 'TestSet Results'
    evaluated = sentimAnalyzer.evaluate(testSet)
    for key,value in sorted(evaluated.items()):
        print('{0}: {1}'.format(key, value))

In [21]:

runClassificationBigram(stemmedLines, 3000)


Training classifier
Evaluating NaiveBayesClassifier results...
Time to train and evaluate: 82.756000042
TrainSetResults
Accuracy: 0.631
F-measure [0]: 0.450909090909
F-measure [1]: 0.487917146145
F-measure [2]: 0.739290780142
F-measure [3]: 0.46368159204
F-measure [4]: 0.509202453988
Precision [0]: 0.455882352941
Precision [1]: 0.603988603989
Precision [2]: 0.650524213679
Precision [3]: 0.628032345013
Precision [4]: 0.597122302158
Recall [0]: 0.446043165468
Recall [1]: 0.409266409266
Recall [2]: 0.856110381078
Recall [3]: 0.367507886435
Recall [4]: 0.44385026738
TestSet Results
Evaluating NaiveBayesClassifier results...
Accuracy: 0.475
F-measure [0]: 0.210526315789
F-measure [1]: 0.22641509434
F-measure [2]: 0.606060606061
F-measure [3]: 0.268656716418
F-measure [4]: 0.533333333333
Precision [0]: 0.25
Precision [1]: 0.333333333333
Precision [2]: 0.48951048951
Precision [3]: 0.45
Precision [4]: 0.727272727273
Recall [0]: 0.181818181818
Recall [1]: 0.171428571429
Recall [2]: 0.7954545454

In [29]:
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

sentences = dfTxt.Phrase
sentiments = dfTxt.Sentiment


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\todddesktop\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
