# Movie Review Sentiment Prediction

This report is written with jupyter notebook and converted to pdf, so if you have jupyter installed, you can run the file report.ipynb.

Use nltk to tokenize and count the number of each words. For information on installation see README.

In [1]:
import os
from collections import Counter
from nltk.tokenize import RegexpTokenizer

### Preprocessing

Use os.listdir to find all file names and then iterate throught them and read each file into a string, omitting linebreaks and apostrophes.

In [2]:
reviews_dir_pos = 'review_polarity/txt_sentoken/pos'
reviews_dir_neg = 'review_polarity/txt_sentoken/neg'
pos_reviews = os.listdir(reviews_dir_pos)
neg_reviews = os.listdir(reviews_dir_neg)

positive_str = []
negative_str = []
# read in positive reviews
for review in pos_reviews:
    with open(os.path.join(reviews_dir_pos, review), 'r') as file:
        review_str = file.read().replace('\n', '').replace("'", '')
        positive_str.append(review_str)
        
for review in neg_reviews:
    with open(os.path.join(reviews_dir_neg, review), 'r') as file:
        review_str = file.read().replace('\n', '')
        negative_str.append(review_str)

Get a list of all words


In [3]:
import pprint

mega_str = ''.join(positive_str + negative_str)
tokenizer = RegexpTokenizer(r'\w+')
all_count = Counter(tokenizer.tokenize(mega_str))

pp = pprint.PrettyPrinter(width=80, compact=True)
print('The most common {} words in all reviews are: '.format(30))
top_common = all_count.most_common(30)
pp.pprint(top_common)

# delete top 60 common words
for key in list(zip(*top_common))[0]:
    del all_count[key]

"""
still too many features, cut out least common ones
"""
feature_count = {key: value for key, value in all_count.items() if value > 500}

print('============================================')
print('Total number of unique words left in the word set: {}'.format(len(feature_count)))

feature_keys = list(feature_count.keys())

The most common 30 words in all reviews are: 
[('the', 76528), ('a', 38100), ('and', 35576), ('of', 34123), ('to', 31937),
 ('is', 25195), ('in', 21822), ('that', 15566), ('it', 14200), ('as', 11378),
 ('with', 10792), ('for', 9961), ('his', 9587), ('this', 9577), ('film', 9196),
 ('s', 9077), ('but', 8634), ('he', 8267), ('i', 8259), ('on', 7382),
 ('are', 6949), ('by', 6261), ('be', 6173), ('one', 5816), ('an', 5744),
 ('movie', 5665), ('not', 5577), ('who', 5548), ('from', 4999), ('at', 4986)]
Total number of unique words left in the word set: 252


At this stage, we've got our feature list.


In [4]:
pos_tokens = []
neg_tokens = []

for review in positive_str:
    count = dict(Counter(tokenizer.tokenize(review)))
    feature_dict = {}
    for key in feature_keys:
        if key in count:
            feature_dict[key] = count[key]
        else:
            feature_dict[key] = 0
    pos_tokens.append(feature_dict)
    
for review in negative_str:
    count = dict(Counter(tokenizer.tokenize(review)))
    feature_dict = {}
    for key in feature_keys:
        if key in count:
            feature_dict[key] = count[key]
        else:
            feature_dict[key] = 0
    neg_tokens.append(feature_dict)

The final step: shuffle the data. and assign the target label to the dataset.

In [None]:
for i in range(0, len(pos_tokens)):
    pos_tokens[i]['@'] = 1
    
for i in range(0, len(neg_tokens)):
    neg_tokens[i]['@'] = -1

In [6]:
from random import shuffle

features = pos_tokens + neg_tokens
print('Total number of reviews: {}, each has {} features'.format(len(features),
                                                                 len(features[0])-1))

shuffle(features)

Total number of reviews: 2000, each has 252 features


In [7]:
labels = []
for i in range(0, len(features)):
    labels.append(features[i]['@'])

for entry in features:
    try:
        del entry['@']
    except KeyError:
        pass

### Training the perceptron

The perceptron classifier takes arrays as input, so we need to turn the dictionary into an python list. The training is set to stop after 5 epochs.

In [8]:
from learner.perceptron import Perceptron

features_list = [list(entry.values()) for entry in features]

In [17]:
clf = Perceptron()
score_dict = clf.score(features_list, labels)
pp.pprint('Average results from 5 fold CV: {}'.format(score_dict))

("Average results from 5 fold CV: {'precision': 0.89983335676495457, "
 "'accuracy': 0.91649999999999987, 'f-beta': 0.91814742994313003, 'recall': "
 '0.93833168000005429}')


As can be seen from above, the 5 fold CV yielded an average test accuracy of 0.916, and a f-beta score of 0.918 at beta equals to 1.

### Training Naive Bayes

Unlike perceptron classifier, the naive bayes takes dictionaries as input.

In [10]:
from learner.bayesian_learner import NaiveBayesClassifier

clf = NaiveBayesClassifier()
clf.train(features[:1500], labels[:1500])
print(clf.score(features, labels))

{'precision': 0.5, 'accuracy': 0.5, 'f-beta': 0.66617541035360561, 'recall': 1.0}
