# Movie Review Sentiment Prediction

This report is written with jupyter notebook and converted to pdf, so if you have jupyter installed, you can run the file report.ipynb.

Use nltk to tokenize and count the number of each words. For information on installation see README.

In [39]:
import numpy as np
import matplotlib.pyplot as plt
import nltk
import os
from nltk.tokenize import RegexpTokenizer

### Preprocessing

Use os.listdir to find all file names and then iterate throught them and read each file into a string, omitting linebreaks and apostrophes.

In [None]:
reviews_dir_pos = 'review_polarity/txt_sentoken/pos'
reviews_dir_neg = 'review_polarity/txt_sentoken/neg'
pos_reviews = os.listdir(reviews_dir_pos)
neg_reviews = os.listdir(reviews_dir_neg)

positive_str = []
negative_str = []
# read in positive reviews
for review in pos_reviews:
    with open(os.path.join(reviews_dir_pos, review), 'r') as file:
        review_str = file.read().replace('\n', '').replace("'", '')
        positive_str.append(review_str)
        
for review in neg_reviews:
    with open(os.path.join(reviews_dir_neg, review), 'r') as file:
        review_str = file.read().replace('\n', '')
        negative_str.append(review_str)

In [None]:
pos_mega_str = ''.join(positive_str)
neg_mega_str = ''.join(negative_str)

Tokenize the strings, remove punctuations at the same time.

In [40]:
tokenizer = RegexpTokenizer(r'\w+')
pos_tokens = tokenizer.tokenize(pos_mega_str)
neg_tokens = tokenizer.tokenize(neg_mega_str)

The next step is to get the count for each word and remove the 35 top most frequent words since they are unlikely to express sentiments.

In [73]:
import pprint
from collections import Counter

pos_count = Counter(pos_tokens)
neg_count = Counter(neg_tokens)

pp = pprint.PrettyPrinter(compact=True, width=50)

print('Top {} most common words in positive reviews: '.format(35))
pp.pprint(pos_count.most_common(35))
print('==============================')
print('Top {} most common words in negative reviews: '.format(35))
pp.pprint(neg_count.most_common(35))

for key in pos_count.most_common(35):
    del pos_count[key]
    
for key in neg_count.most_common(35):
    del neg_count[key]

print('==============================')
print('Total unique words in positive reviews: {}'.format(len(pos_count)))
print('Total unique words in negative reviews: {}'.format(len(neg_count)))

Top 35 most common words in positive reviews: 
[('the', 41470), ('a', 20190), ('and', 19896),
 ('of', 18636), ('to', 16517), ('is', 14059),
 ('in', 11725), ('that', 7763), ('as', 6478),
 ('it', 6444), ('with', 5851), ('his', 5588),
 ('for', 5260), ('film', 4909), ('this', 4647),
 ('but', 4492), ('he', 4339), ('on', 3724),
 ('are', 3713), ('i', 3472), ('by', 3466),
 ('its', 3169), ('an', 3052), ('be', 3028),
 ('one', 3016), ('not', 2926), ('who', 2913),
 ('from', 2731), ('has', 2564), ('at', 2495),
 ('was', 2477), ('her', 2456), ('movie', 2419),
 ('have', 2240), ('you', 2221)]
Top 35 most common words in negative reviews: 
[('the', 35058), ('a', 17910), ('and', 15680),
 ('of', 15487), ('to', 15420), ('is', 11136),
 ('in', 10097), ('s', 8854), ('that', 7803),
 ('it', 7756), ('with', 4941), ('this', 4930),
 ('as', 4900), ('i', 4787), ('for', 4701),
 ('film', 4287), ('but', 4142), ('his', 3999),
 ('he', 3928), ('on', 3658), ('t', 3555),
 ('movie', 3246), ('are', 3236), ('be', 3145),
 ('one