#  Rotten Tomatoes Reviews prediction : Naive-Bayes Classifier

This dataset is a compilation of movie reviews that were obtained from the well-known movie review website Rotten Tomatoes. The dataset consists of the reviews' text and a corresponding label that specifies whether the review was classified as "fresh" or "rotten", based on Rotten Tomatoes' proprietary review aggregation system. 

This dataset is a highly valuable resource for individuals interested in conducting sentiment analysis and natural language processing, including researchers, data analysts, and machine learning practitioners. It contains reviews from a diverse group of critics and publications, encompassing a wide range of movies across various genres and languages.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict

import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn.metrics import classification_report

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/ravirajpurohit/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/ravirajpurohit/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/ravirajpurohit/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


## -------------------------------------- Load the data --------------------------------------

In [2]:
def read_data(filename='./data/rt_reviews.csv'):
    """
    Load the dataset
    
    Parameters
    ----------
    filename - string
    
    Returns
    -------
    reviews - numpy array of strings : feedback written by viewers
    labels - numpy array of strings : fresh / rotten
    """
    reviews = []
    labels = []
    with open(filename, 'r', encoding='latin-1') as f:
        for line in f:
            line = line.split(',')
            label, review = line[0], ''.join(line[1:])

            labels.append(label)
            reviews.append(review)

    ## returning from 1st index; 1st line is just column names in the dataset
    return np.array(labels[1:]), np.array(reviews[1:])

In [3]:
labels, reviews = read_data()

In [4]:
labels.shape, reviews.shape

((480000,), (480000,))

## -------------------------------------- Split the data --------------------------------------

In [5]:
def train_test_val_split(labels, reviews):
    """
    splits the dataset into training, testing and validation datasets
    
    Parameters
    ----------
    labels - numpy array of strings : fresh / rotten
    reviews - numpy array of strings : feedback written by viewers

    
    Returns
    -------
    train_data - numpy array of 70% of reviews chosen randomly
    train_labels - numpy array of 70% of labels chosen randomly
    test_data - numpy array of 20% of reviews chosen randomly
    test_labels - numpy array of 20% of labels chosen randomly
    val_data - numpy array of 10% of reviews chosen randomly
    val_labels - numpy array of 10% of labels chosen randomly
    """
    ## assume that labels and reviews are numpy arrays with the same length
    data_size = len(labels)

    ## shuffle the indices of the data
    shuffled_indices = np.random.RandomState(seed=21).permutation(data_size)

    ## split the indices into train, validation, and test sets
    train_indices = shuffled_indices[:int(0.7 * data_size)]
    val_indices = shuffled_indices[int(0.7 * data_size):int(0.8 * data_size)]
    test_indices = shuffled_indices[int(0.8 * data_size):]

    ## use the indices to extract the corresponding data
    train_data = reviews[train_indices]
    train_labels = labels[train_indices]

    val_data = reviews[val_indices]
    val_labels = labels[val_indices]

    test_data = reviews[test_indices]
    test_labels = labels[test_indices]

    return train_data, train_labels, test_data, test_labels, val_data, val_labels

In [6]:
train_data, train_labels, test_data, test_labels, val_data, val_labels = train_test_val_split(labels, reviews)

## -------------------------------------- Data Exploration --------------------------------------

In [7]:
def get_data_dist(labels):
    """
    prints the distribution of labels in a dataset
    
    Parameters
    ----------
    labels - numpy array of labels
    """
    # assume that train_labels is a numpy array
    unique_labels, label_counts = np.unique(labels, return_counts=True)

    for label, count in zip(unique_labels, label_counts):
        print(f"- {label} : {round(100*count/len(labels),2)}%")
        
    return None

In [8]:
for name, labs in zip(['Train','Test','Validation'],[train_labels, test_labels, val_labels]):
    print(f'\nClass distribution in {name} dataset')
    get_data_dist(labs)


Class distribution in Train dataset
- fresh : 50.04%
- rotten : 49.96%

Class distribution in Test dataset
- fresh : 49.75%
- rotten : 50.25%

Class distribution in Validation dataset
- fresh : 50.2%
- rotten : 49.8%


## -------------------------------------- Data Preprocessing --------------------------------------

In [9]:
def clean_data(data):
    """
    clean data from new line character, empty spaces on the ends, and inverted commas
    
    Parameters
    ----------
    data - numpy array of strings
    
    Returns
    -------
    data - numpy array of strings
    """
    data = np.array([i.replace('"','').replace('\n','').strip() for i in data])
    return data

In [10]:
print(f'----------------------------- Training data before cleaning ----------------------------- \n{train_data}')

----------------------------- Training data before cleaning ----------------------------- 
['" Gloriously daft but with a good deal of heart Fanged Up\'s Hammer in the slammer shtick has a surprising amount of bite. It\'s great entertainment for a night in with good friends and a couple of crates of beer -- unless of course you only drink wine."\n'
 '" The Back-Up Plan represents a major comeback for Jennifer Lopez. Unfortunately she\'s come back to making crap. "\n'
 '" The acting can be so-so the story implausible the camerawork stolid -- none of that really matters if you care about what happens to the characters."\n'
 ...
 '" The film is indeed a bit pat. Sweet and funny - largely thanks to James Corden in the lead role - it\'s never particularly surprising."\n'
 ' More concerned with recruiting the testosterone troubled boys of today than it is rewarding fans of yesteryear.\n'
 '" It strives awfully hard for depth but more often than not comes off too shallow."\n']


In [11]:
train_data = clean_data(train_data)
test_data = clean_data(test_data)
val_data = clean_data(val_data)

In [12]:
print(f'----------------------------- Training data after cleaning ----------------------------- \n{train_data}')

----------------------------- Training data after cleaning ----------------------------- 
["Gloriously daft but with a good deal of heart Fanged Up's Hammer in the slammer shtick has a surprising amount of bite. It's great entertainment for a night in with good friends and a couple of crates of beer -- unless of course you only drink wine."
 "The Back-Up Plan represents a major comeback for Jennifer Lopez. Unfortunately she's come back to making crap."
 'The acting can be so-so the story implausible the camerawork stolid -- none of that really matters if you care about what happens to the characters.'
 ...
 "The film is indeed a bit pat. Sweet and funny - largely thanks to James Corden in the lead role - it's never particularly surprising."
 'More concerned with recruiting the testosterone troubled boys of today than it is rewarding fans of yesteryear.'
 'It strives awfully hard for depth but more often than not comes off too shallow.']


In [13]:
stop_words = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 
              "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 
              'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 
              'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 
              'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 
              'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 
              'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 
              'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 
              'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 
              'about', 'against', 'between', 'into', 'through', 'during', 'before', 
              'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 
              'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 
              'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 
              'each', 'few', 'more', 'most', 'other', 'some', 'such', 'nor', 
              'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 
              'can', 'will', 'just', 'don', 'now', 
              'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 
              'didn', "didn't", 'doesn', "doesn't", 'hadn', 
              "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 
              'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', 
              'wasn', "wasn't", 'weren', "weren't"]

In [14]:
def preprocess(text):
    """
    steps like -
    converts all characters to lower
    splits sentence to words
    removes non-alphabetical words
    
    Parameters
    ----------
    text - string

    Returns
    -------
    text - string (preprocessed)
    """
    text = text.lower()
    ## word_tokenize splits a sentence into words linguistically
    text = word_tokenize(text)
    text = [word for word in text if (word.isalpha() and word not in stop_words)]
    text = ' '.join(text)
    
    return text

In [15]:
## preprocess all the datasets
train_data = np.array([preprocess(i) for i in train_data])
test_data = np.array([preprocess(i) for i in test_data])
val_data = np.array([preprocess(i) for i in val_data])

## -------------------------------------- Create Vocabulary --------------------------------------

In [16]:
## lemmatizer instance to keep words with same meaning as same signature
## for example - converts 'went' to 'go'
lemm = WordNetLemmatizer()

In [17]:
## initialize dictionaries for vocabulary
vocab_all = defaultdict(int)
vocab_fresh = defaultdict(int)
vocab_rotten = defaultdict(int)

for ind, item in enumerate(train_data):
    words_unq = []
    
    ## split sentence by words linguistically
    words = nltk.word_tokenize(item)

    for word in words:
        ## lemmatize words to create a good vocabulary
        words_unq.append(lemm.lemmatize(word, pos='v'))
        words_unq.append(word)
    
    ## only keep unique set of words, remove duplicates
    words_unq = set(words_unq)
        
    if train_labels[ind] == 'fresh':
        for word in words:
            vocab_all[word] += 1
            vocab_fresh[word] += 1
    else:
        for word in words:
            vocab_all[word] += 1
            vocab_rotten[word] += 1
            
vocab_all = {key:value for key, value in vocab_all.items() if value > 10}
vocab_fresh = {key:value for key, value in vocab_fresh.items() if value > 5}
vocab_fresh = {key:value for key, value in vocab_fresh.items() if value > 5}

In [18]:
len(vocab_all), len(vocab_fresh), len(vocab_rotten)

(22941, 57532, 54929)

## -------------------------------------- Model Training-Testing --------------------------------------

In [36]:
## find the counts of fresh and rotten labels in the dataset
counts_fresh, counts_rotten = np.unique(train_labels, return_counts=True)[1]
print(np.unique(train_labels, return_counts=True))

(array(['fresh', 'rotten'], dtype='<U6'), array([168145, 167855]))


In [37]:
def predict(data):
    """
    predicts the label for a review using naive bayes theroem
    
    Parameters
    ----------
    data - string : movie review

    Returns
    -------
    label - string : fresh/rotten
    """
    ## define the local constants
    prob_fresh, prob_rotten, label = 1, 1, None
    
    base1 = (len(vocab_all) + counts_fresh)
    base2 = (len(vocab_all) + counts_rotten)
    
    for word in data.split():
        if word in vocab_fresh:
            prob_fresh *= (vocab_fresh[word] / base1)
        else:
            prob_fresh /= base1

        if word in vocab_rotten:
            prob_rotten *= (vocab_rotten[word] / base2)
        else:
            prob_rotten /= base2
    
    if prob_fresh > prob_rotten:
        label = 'fresh'
    else:
        label = 'rotten'

    return label

In [59]:
def predict_with_smoothing(data):
    """
    predicts the label for a review using naive bayes theroem
    
    Parameters
    ----------
    data - string : movie review

    Returns
    -------
    label - string : fresh/rotten
    """
    ## define the local constants
    prob_fresh, prob_rotten, label = 1, 1, None
    
    base1 = (len(vocab_all) + counts_fresh)
    base2 = (len(vocab_all) + counts_rotten)
    
    for word in data.split():
        if word in vocab_fresh:
            prob_fresh *= ((vocab_fresh[word] + 1) / (base1 + 1))
        else:
            prob_fresh /= (base1 + 1)

        if word in vocab_rotten:
            prob_rotten *= ((vocab_rotten[word] + 1) / (base2 + 1))
        else:
            prob_rotten /= (base2 + 1)
    
    if prob_fresh > prob_rotten:
        label = 'fresh'
    else:
        label = 'rotten'

    return label

In [39]:
pred_train = []
pred_test = []
pred_val = []

for data in train_data:
    pred_train.append(predict(data))

for data in test_data:
    pred_test.append(predict(data))
    
for data in val_data:
    pred_val.append(predict(data))

In [60]:
pred_train_smooth = []
pred_test_smooth = []
pred_val_smooth = []

for data in train_data:
    pred_train_smooth.append(predict(data))

for data in test_data:
    pred_test_smooth.append(predict(data))
    
for data in val_data:
    pred_val_smooth.append(predict(data))

## -------------------------------------- Performance Analysis --------------------------------------

In [50]:
def calc_accuracy(pred, actual):
    """
    return prediction accuracy
    
    pred - list of predicted labels
    actual - list of actual labels
    """
    
    accuracy = round(100*np.mean(pred == actual),2)
    return accuracy

In [51]:
print(f'Accuracy for Training Dataset - {calc_accuracy(pred_train, train_labels)}%')
print(f'Accuracy for Testing Dataset - {calc_accuracy(pred_test, test_labels)}%')
print(f'Accuracy for Validation Dataset - {calc_accuracy(pred_val, val_labels)}%')

Accuracy for Training Dataset - 81.54%
Accuracy for Testing Dataset - 79.25%
Accuracy for Validation Dataset - 79.21%


In [62]:
print(f'Accuracy for Training Dataset - {calc_accuracy(pred_train_smooth, train_labels)}%')
print(f'Accuracy for Testing Dataset - {calc_accuracy(pred_test_smooth, test_labels)}%')
print(f'Accuracy for Validation Dataset - {calc_accuracy(pred_val_smooth, val_labels)}%')

Accuracy for Training Dataset - 82.59%

Accuracy for Testing Dataset - 79.8%

Accuracy for Validation Dataset - 79.93%

## -------------------------------------- Probability of THE occurance ---------------------------------

P[“the”] = num of documents containing ‘the’ / num of all documents

In [68]:
labels, reviews = read_data()

In [78]:
round(100*sum([True for i in reviews if 'the' in i.lower().split()]) / len(reviews))

63

In [80]:
"""
The probability for the word "the" as asked in the assignment is 63%
"""

'\nThe probability for the word "the" as asked in the assignment is 63%\n'

## -------------------------------------- Conditional Probability --------------------------------------
Conditional Probability based on the sentiment

Calculate P[“the”|Positive] = Number of positive documents containing “the” / num of all positive review documents

In [82]:
pos_doc = reviews[labels=='fresh']

In [83]:
round(100*sum([True for i in pos_doc if 'the' in i.lower().split()]) / len(pos_doc),2)

63.56

In [84]:
"""
The conditional probability for the word "the" as asked in the assignment is 63.56%
"""

'\nThe conditional probability for the word "the" as asked in the assignment is 63.56%\n'