# Rotten Tomatoes Review Analysis

In this analysis, we will use Naive Bayes as classifier to categorize Rotten Tomatoes reviews into rotten and fresh ones. We'll find the top 10 words that predict the fresh or rotten reviews. At the end, we'll add smoothing parameter to improve the accuracy of our model.

In [1]:
import numpy as np
import pandas as pd
import math
import warnings
from sklearn.metrics import confusion_matrix
np.seterr(divide = 'ignore') 
warnings.filterwarnings('ignore')

## 1 Explore and clean the data
First, we load the data, understand the structure and variables, summarize it, and perform data wrangling. We will mainly focus on two variables, `fresh` and `quote`, throughout the analysis.

In [2]:
data = pd.read_csv('rotten-tomatoes.csv.bz2')

In [3]:
data.sample(10)

Unnamed: 0,critic,fresh,imdb,link,publication,quote,review_date,rtid,title
13091,Deborah Young,fresh,151568,http://www.variety.com/review/VE1117752084.htm...,Variety,[A] beautifully crafted and lively romp around...,2008-06-17 00:00:00,13407,Topsy-Turvy
5114,Amy Biancolli,rotten,427968,http://www.chron.com/disp/story.mpl/ent/movies...,Houston Chronicle,I wish the film were true to itself and its qu...,2006-09-01 00:00:00,283051883,Trust the Man
3826,Dave Kehr,fresh,31725,http://onfilm.chicagoreader.com/movies/capsule...,Chicago Reader,The satire may be mostly a matter of easy cont...,2009-02-03 00:00:00,18487,Ninotchka
2187,Janet Maslin,rotten,107096,http://movies.nytimes.com/movie/review?res=9F0...,New York Times,"Mr. Stone tells this tale vigorously, but he h...",2004-06-05 00:00:00,14946,Heaven & Earth
8307,Roger Ebert,rotten,131857,http://www.rogerebert.com/reviews/baseketball-...,Chicago Sun-Times,It's not very funny and tries to buy laughs wi...,2000-01-01 00:00:00,13128,BASEketball
356,James Berardinelli,rotten,113321,http://www.reelviews.net/movies/h/home_holiday...,ReelViews,"Aside from a few effective, low-key scenes, th...",2000-01-01 00:00:00,10161,Home for the Holidays
8479,Janet Maslin,fresh,88161,http://movies.nytimes.com/movie/review?res=940...,New York Times,"Splash may feature a heroine with fins, but it...",2003-05-20 00:00:00,12345,Splash
7492,Keith Simanton,fresh,119094,http://community.seattletimes.nwsource.com/arc...,Seattle Times,"Face/Off is a full-blooded, movie-going experi...",2013-08-02 00:00:00,13172,Face/Off
7893,Jonathan Rosenbaum,rotten,79417,http://onfilm.chicagoreader.com/movies/capsule...,Chicago Reader,Misogynistic claptrap.,2006-12-13 00:00:00,11122,Kramer vs. Kramer
9834,Geoff Andrew,fresh,84602,http://www.timeout.com/film/reviews/76875/rock...,Time Out,"Learning, especially from Scorsese, in his app...",2006-02-09 00:00:00,12172,Rocky III


In [4]:
data.columns

Index(['critic', 'fresh', 'imdb', 'link', 'publication', 'quote',
       'review_date', 'rtid', 'title'],
      dtype='object')

In [5]:
fresh_evaluations = pd.DataFrame(data.fresh.value_counts()).rename(columns={'fresh':'counts'})
fresh_evaluations['percentages']=fresh_evaluations['counts']/sum(fresh_evaluations['counts'])

print('------Summary of the data-------')
print('1. There are %i non-missing values in fresh.' %data[data['fresh']=='none'].fresh.count())
print('2. There are three types of value in fresh/rotten evaluations. The corresponding counts and percentages are:')
print(fresh_evaluations, '\n')
print('3. There are %i zero-length in quote,' %data[data.quote.str.len()==0].quote.count(), 'and %i quotes with only whitespace.' %data[data.quote==' '].quote.count())
print('4. Length of quotes: minimum= %i' %min(data.quote.str.len()), ',maximum= %i' %max(data.quote.str.len()), ',and average= %i.' %data.quote.str.len().mean())
print('5. There are %i duplicate reviews.' %len(data[data.duplicated()]))

------Summary of the data-------
1. There are 23 non-missing values in fresh.
2. There are three types of value in fresh/rotten evaluations. The corresponding counts and percentages are:
        counts  percentages
fresh     8389     0.624089
rotten    5030     0.374200
none        23     0.001711 

3. There are 0 zero-length in quote, and 0 quotes with only whitespace.
4. Length of quotes: minimum= 4 ,maximum= 256 ,and average= 121.
5. There are 596 duplicate reviews.


In [6]:
#remove duplicate data
data = data[~data.duplicated()]
#remove none in fresh variable
data = data[data['fresh']!='none']

## 2 Naive Bayes
In the second part, we implement the Naive Bayes classifier and convert the data into bag-of-words.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
# define vectorizer
vectorizer = CountVectorizer(binary=True)

# vectorize your data. Note: this creates a sparce matrix,
# use .toarray() if you want a dense matrix.
X = vectorizer.fit_transform(data.quote.values)
X_a = X.toarray()

# actual words
words = vectorizer.get_feature_names()

# rating
y = data.fresh.values

# quote
quote = data['quote']

In order to train the model better, we split our data into three sets: training, validation, and testing data. We keep 20% of the data as test data and split the rest 80% into 80% training and 20% testing data.  

In [8]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test, q_train, q_test = train_test_split(X_a, y, quote, test_size=0.2)
X_train, X_valid, y_train, y_valid, q_train, q_valid = train_test_split(X_train, y_train, q_train, test_size=0.2)

Now, we start to work on our training data.  
In the following part, we compute the unconditional (log) probability that the tomato is fresh/rotten, log Pr(F) `lpr_F`, and log Pr(R) `lpr_R`.  
These probabilities are based on the values of fresh alone, not on the words the quotes contain.

In [9]:
# calculate total numbers of fresh and rotten reviews
fresh = np.unique(y_train, return_counts=True)[1][0]
rotten = np.unique(y_train, return_counts=True)[1][1]

# log probability of Fresh review
lpr_F = math.log(fresh/len(y_train))
# log probability of rotten review
lpr_R = math.log(rotten/len(y_train))

print('The log probability for Fresh is', lpr_F, ', for Rotten is', lpr_R)

The log probability for Fresh is -0.4766059273453062 , for Rotten is -0.9699213761044944


For each word w, we compute log Pr(w|F) `lpr_wF` and log Pr(w|R) `lpr_wR`, the (log) probability that the word is present in a fresh/rotten review. These probabilities will be calculated from counts of how many times these words are present for each class.

In [10]:
# create blank array to store the fresh and rotten quotes
fresh_wcount = np.zeros(len(X_train[0]))
rotten_wcount = np.zeros(len(X_train[0]))

# calculate frequency of individual word in fresh and rotten reviews
for i in range(len(y_train)):
    if y_train[i]=='fresh':
        fresh_wcount = fresh_wcount + X_train[i]
    else:
        rotten_wcount = rotten_wcount + X_train[i]

# log probability of word given Fresh review
lpr_wF = np.log(fresh_wcount/fresh)
# log probability of word given rotten review
lpr_wR = np.log(rotten_wcount/rotten)

After calculating the four probabilities, we've fitted our Naive Bayes model. For the next step, we use the model to predict the reviews using validation data. We compute the log-likelihood of being a fresh, `l_F`, or rotten, `l_R`, review for each quote in the validation dataset.  
We create two functions to make the calculation easier.

In [11]:
def log_calF(X_valid, lpr_wF, lpr_F):
    """Take in an array and output the log-likelihood of Fresh"""
    temp_pr_wF = X_valid*lpr_wF
    # change nan to 0
    temp_pr_wF = np.nan_to_num(temp_pr_wF)
    # change -0 to 0
    temp_pr_wF = np.where(np.nan_to_num(temp_pr_wF)==0, 0, temp_pr_wF)
    total = lpr_F + sum(temp_pr_wF)
    return total

def log_calR(X_valid, lpr_wR, lpr_R):
    """Take in an array and output the log-likelihood of Rotten"""
    temp_pr_wR = X_valid*lpr_wR
    # change nan to 0
    temp_pr_wR = np.nan_to_num(temp_pr_wR)
    # change -0 to 0
    temp_pr_wR = np.where(np.nan_to_num(temp_pr_wR)==0, 0, temp_pr_wR)
    total = lpr_R + sum(temp_pr_wR)
    return total

In [12]:
# apply fuctions to the data and calculate the log-likelihood of the quote being a Fresh or Rotten reviews 
l_F = np.apply_along_axis(log_calF, 1, X_valid, lpr_wF, lpr_F)
l_R = np.apply_along_axis(log_calR, 1, X_valid, lpr_wR, lpr_R)

Next, we compare the probabilities of being fresh and rotten so as to see the quote is predicted as fresh or rotten review. For example, if the probability of being a fresh quote is bigger than the probability of being a rotten quote, then the quote is predicted as fresh.  
*We'll ignore the situation where the probabilities are the same for now. (i.e. `l_F` = `l_R`)*

In [13]:
def F_or_R(array):
    """Take in a quote and output whether the quote is predicted as Fresh or Rotten based on the log-likelihood.
    If the log-likelihood of Fresh and Rotten are the same, return Rotten"""
    array = np.where(array == True, 'fresh', 'rotten')
    return array

In [14]:
# apply function to the data and return a prediction array
predicted = np.apply_along_axis(F_or_R, 0, l_F > l_R)

# print out how many cases with the same probabilities
print('There are %i quotes with the same probabilities of Fresh and Rotten. We\'ll ignore that first.' %sum(l_F==l_R))

# use confusion matrix and accuracy to show the performance of the model
from sklearn.metrics import confusion_matrix
print("Confusion matrix:\n",confusion_matrix(y_valid, predicted))
tn, fp, fn, tp = confusion_matrix(y_valid, predicted).ravel()
accuracy = (tn+tp)/(tn+tp+fn+fp)
print("Accuracy =",accuracy)

There are 912 quotes with the same probabilities of Fresh and Rotten. We'll ignore that first.
Confusion matrix:
 [[614 660]
 [203 575]]
Accuracy = 0.5794346978557505


The accuracy isn't high. We can improve the performance of the model in part 4 by adding smoothing parameter.

## 3 Interpretation
In this part, we'll interpret our prediction and try to find the top 10 words that best predict fresh and rotten reviews. In order to get a better and more informative result, we will focus on words that are reasonably frequent, more frequent than 30 times in the data.
### Top 10 words in Fresh and Rotten reviews

In [15]:
# filter out words with frequency less than 30 times
frequent_F = fresh_wcount * np.where((fresh_wcount >30)==True,1,0)
frequent_R = rotten_wcount * np.where((rotten_wcount >30)==True,1,0)

# log probability of Fresh review of frequent word
lpr_fwF = np.log(frequent_F/fresh)
# log probability of rotten review of frequent word
lpr_fwR = np.log(frequent_R/rotten)

In [16]:
# find top 10 words of fresh and rotten reviews
F_top10 = list(words[i] for i in np.argsort(lpr_fwF))[::-1][:10]
R_top10 = list(words[i] for i in np.argsort(lpr_fwR))[::-1][:10]
print('Top 10 words to predict Fresh reviews:',F_top10)
print('Top 10 words to predict Rotten reviews:',R_top10)

Top 10 words to predict Fresh reviews: ['the', 'and', 'of', 'is', 'to', 'it', 'in', 'that', 'with', 'film']
Top 10 words to predict Rotten reviews: ['the', 'and', 'of', 'to', 'is', 'it', 'in', 'that', 'but', 'this']


Since the top 10 words include many stop words and are not informative. We use `nltk` to remove stop words and find the top 10 words for Fresh and Rotten reviews.

In [17]:
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
stop_words = stopwords.words('english')

def remove_stopwords(array): 
    nostop = []
    j = 0
    sorted_fw = list(words[i] for i in np.argsort(array))[::-1]
    while len(nostop) <= 10:
        temp = sorted_fw[j]
        if temp not in stop_words:
            nostop.append(temp)
        j += 1
    return nostop

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/serenalin/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [18]:
F_top10_nostop = remove_stopwords(lpr_fwF)
R_top10_nostop = remove_stopwords(lpr_fwR)
print('Top 10 words to predict Fresh reviews (without stop words):\n',F_top10_nostop)
print('Top 10 words to predict Rotten reviews (without stop words):\n',R_top10_nostop)

Top 10 words to predict Fresh reviews (without stop words):
 ['film', 'movie', 'one', 'best', 'good', 'story', 'like', 'time', 'well', 'comedy', 'director']
Top 10 words to predict Rotten reviews (without stop words):
 ['movie', 'film', 'like', 'one', 'much', 'story', 'even', 'comedy', 'director', 'good', 'little']


Even if we remove stop words, the predictive words for Fresh and Rotten reviews still look very similar. Therefore, we use part-of-speech tagging to identify the top 10 adjective for Fresh and Rotten reviews. 

In [19]:
from nltk.corpus import wordnet as wn

def filter_adj(array):
    adj = []
    j = 0
    sorted_list = list(words[i] for i in np.argsort(array))[::-1]
    while len(adj) <= 10:
        word = sorted_list[j]
        tag = nltk.pos_tag([word])[0][1]
        if tag in ('JJ','JJR','JJS') and word not in stop_words:
            adj.append(word)
        j += 1
    return adj

In [20]:
F_top10_adj = filter_adj(lpr_fwF)
R_top10_adj = filter_adj(lpr_fwR)
print('Top 10 adjective words to predict Fresh reviews (without stop words):\n',F_top10_adj)
print('Top 10 adjective words to predict Rotten reviews (without stop words):\n',R_top10_adj)

Top 10 adjective words to predict Fresh reviews (without stop words):
 ['best', 'good', 'much', 'great', 'new', 'american', 'little', 'old', 'many', 'big', 'high']
Top 10 adjective words to predict Rotten reviews (without stop words):
 ['much', 'good', 'little', 'bad', 'many', 'best', 'hard', 'new', 'real', 'comic', 'old']


### Misclassified quotes
Next, we'll look into some quotes we misclassified.

In [21]:
count = 1
mis_q = q_valid[~(predicted == y_valid)]
mis_p = predicted[~(predicted == y_valid)]
mis_y = y_valid[~(predicted == y_valid)]
mis_len = len(mis_q)
# set seed and randomly pick 5 out of the misclassified quotes
np.random.seed(20)
index = np.random.choice(mis_len,5,replace=False)

for i in index:
    print('%i.' %count, 'Predicted:', mis_p[i], '\tActual:', mis_y[i])
    print(mis_q.iloc[i],'\n')
    count += 1

1. Predicted: rotten 	Actual: fresh
If [The Whole Nine Yards] should not have worked, then the sequel definitely shouldn't work, either. But, once again, it kind of does. 

2. Predicted: rotten 	Actual: fresh
It's the rare kind of movie that makes too much seem like a good idea. 

3. Predicted: rotten 	Actual: fresh
A reasonably enjoyable (for those captivated by this sort of thing) black comedy/noir thriller. 

4. Predicted: fresh 	Actual: rotten
Ostensibly about the banality of youthful evil, Kids is simply about its own banality. 

5. Predicted: rotten 	Actual: fresh
What gives the film its jolt of urgency is its New Orleans setting. Deja Vu is the first major movie to be shot there since the city's devastation. 



## 4 NB with smoothing

In [22]:
def fit(X_train, y_train, alpha):
    fresh = np.unique(y_train, return_counts=True)[1][0]
    rotten = np.unique(y_train, return_counts=True)[1][1]

    # log probability of Fresh review
    lpr_F = math.log((fresh+alpha)/(len(y_train)+alpha*2))
    # log probability of rotten review
    lpr_R = math.log((rotten+alpha)/(len(y_train)+alpha*2))
    
    # create blank array to store the value
    fresh_wcount = np.zeros(len(X_train[0]))
    rotten_wcount = np.zeros(len(X_train[0]))

    # calculate frequency of individual word in fresh and rotten reviews
    for i in range(len(y_train)):
        if y_train[i]=='fresh':
            fresh_wcount = fresh_wcount + X_train[i]
        else:
            rotten_wcount = rotten_wcount + X_train[i]
    
    # log probability of Fresh review
    lpr_wF = np.log((fresh_wcount+alpha)/(fresh+alpha))
    # log probability of rotten review
    lpr_wR = np.log((rotten_wcount+alpha)/(rotten+alpha))
    return lpr_F, lpr_R, lpr_wF, lpr_wR

In [23]:
def predict(X_valid, y_valid, fit_list):
    lpr_F = fit_list[0]
    lpr_R = fit_list[1]
    lpr_wF = fit_list[2]
    lpr_wR = fit_list[3]
    l_F = np.apply_along_axis(log_calF, 1, X_valid, lpr_wF, lpr_F)
    l_R = np.apply_along_axis(log_calR, 1, X_valid, lpr_wR, lpr_R)
    predicted = np.apply_along_axis(F_or_R, 0, l_F > l_R)
    matrix = confusion_matrix(y_valid, predicted)
    tn, fp, fn, tp = matrix.ravel()
    accuracy = (tn+tp)/(tn+tp+fn+fp)
    return matrix, accuracy

In [24]:
from sklearn.model_selection import KFold
def cv(k, fit, predict, X, y, alpha):
    X_len = len(X)
    # create a list of indices to shuffle the data
    i = np.random.choice(X_len,X_len, replace=False)
    XX = X[i]
    yy = y[i]
    # create a k-fold function
    kf = KFold(n_splits=k)
    # creat empty lists to store the metrics
    scores = []
    fscore = []
    # split the training data into k section, and work on one section at a time
    for train_index, validate_index in kf.split(XX):
        X_train, X_validate = XX[train_index], XX[validate_index]
        y_train, y_validate = yy[train_index], yy[validate_index]
        # fit the model
        after_fit = fit(X_train, y_train, alpha)
        # calculate accuracy to measure the performance
        scores.append(predict(X_validate, y_validate, after_fit)[1])
    return round(np.mean(scores),4)

In [25]:
best_alpha = 0
best_accuracy = 0
k=5
power = 0
stop = 1
while stop >= best_accuracy:
    alpha = pow(10,power)
    result = cv(k, fit, predict, X_a, y, alpha)
    stop = result
    print('With alpha = %f,' %alpha, 'the accuracy of the model after %i' %k, 'fold cross validation is %f' %result)
    if result >= best_accuracy:
        best_alpha = alpha
        best_accuracy = result
    power -=1
print('\nWhen alpha = %f,' %best_alpha, 'the model has the best performance with accuracy = %f' %best_accuracy)

With alpha = 1.000000, the accuracy of the model after 5 fold cross validation is 0.738000
With alpha = 0.100000, the accuracy of the model after 5 fold cross validation is 0.742700
With alpha = 0.010000, the accuracy of the model after 5 fold cross validation is 0.726200

When alpha = 0.100000, the model has the best performance with accuracy = 0.742700
