# Sentiment Analysis of Movie Review Data with Logistic Regression and Naive Bayes Algorithms

## Objective

The goal of this project is to perform a sentiment analysis of text data from movie reviews using two different classification algorithms - Logistic Regression and Naive Bayes - and compare their performance.

## Data Collection and Preparation

This dataset was organized and labeled by Bo Pang and Lillian Li, and used in their paper "Seeing Stars: Exploiting class relationships for sentiment categorization with respect to rating scales" (2005) from Proceedings of the ACL. It contains 5331 positive and 5331 negative review snippets culled from the Rotten Tomatoes movie review website. Positive reviews were those designated as "fresh" and negative reviews were those designated "rotten" on the website. The full data set can be accessed [here](http://www.cs.cornell.edu/people/pabo/movie-review-data)

### Import and split the data into train, validation, and test sets

In [1]:
# initialize empty lists to store positive and negative reviews
positive_reviews = []
negative_reviews = []

# read the positive and negative reviews into their respective lists from the data files
with open("rt-polaritydata/rt-polaritydata/positive_reviews.txt", "r") as positive_file:
    for line in positive_file:
        # remove leading/trailing whitespaces and add to positive_reviews
        positive_reviews.append(line.strip())

with open("rt-polaritydata/rt-polaritydata/rt-polarity.neg", "r") as negative_file:
    for line in negative_file:
        # remove leading/trailing whitespaces and add to positive_reviews
        negative_reviews.append(line.strip())
        
# print the first few lines from each list as a sample
print("Sample Positive Reviews: ")
print(positive_reviews[:3])
print("Sample Negative Reviews: ")
print(negative_reviews[:3])

Sample Positive Reviews: 
['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .', 'effective but too-tepid biopic']
Sample Negative Reviews: 
['simplistic , silly and tedious .', "it's so laddish and juvenile , only teenage boys could possibly find it funny .", 'exploitative and largely devoid of the depth or sophistication that would make watching such a graphic treatment of the crimes bearable .']


In [2]:
# create the x-variable
x = positive_reviews + negative_reviews
print(len(x))
# create the y-variable
y = [1] * len(positive_reviews) + [0] * len(negative_reviews)
print(len(y))

10662
10662


In [3]:
import numpy as np
from sklearn.model_selection import train_test_split
x_train, x_temp, y_train, y_temp = train_test_split(x, y, test_size = 0.3, random_state = 0)
x_valid, x_test, y_valid, y_test = train_test_split(x_temp, y_temp, test_size = 0.5, random_state = 0)
print("length of x_train = " + str(len(x_train)))
print("length of y_train = " + str(len(y_train)))
print("length of x_valid = " + str(len(x_valid)))
print("length of y_valid = " + str(len(y_valid)))
print("length of x_test = " + str(len(x_test)))
print("length of y_test = " + str(len(y_test)))

y_train = np.array(y_train)
y_valid = np.array(y_valid)
y_test = np.array(y_test)

length of x_train = 7463
length of y_train = 7463
length of x_valid = 1599
length of y_valid = 1599
length of x_test = 1600
length of y_test = 1600


### Import necessary functions for text processing

In [4]:
pip install nltk

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [5]:
import nltk
import re                                   # library for regular expression operations
import string                               # for string operations
from nltk.corpus import stopwords           # module for stop words that comes with NLTK
from nltk.stem import PorterStemmer         # module for stemming
from nltk.tokenize import word_tokenize    # module for tokenizing

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\vscerra\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\vscerra\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Create a function to preprocess the data using the nltk library to remove unnecessary punctuation and hyperlinks, tokenize, stem words, and remove stopwords

In [6]:
def process_text(text):
    ''' 
    Input: 
        text: a string of words or characters 
    Output:
        processed_text: original text with hyperlinks and punctuation removed, made lowercase, and with words tokenized and stemmed    
    
    '''
    # remove hyperlinks using regex
    text = re.sub(r'https\S+','', text)
    
    # remove punctuation and convert to lowercase
    text = re.sub(r'[^\w\s]','', text).lower()
    
    # tokenize the text
    words = word_tokenize(text)
    
    # remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_words = [word for word in words if word not in stop_words]
    
    # stem the words
    stemmer = PorterStemmer()
    stemmed_words = [stemmer.stem(word) for word in filtered_words]
    
    processed_text = stemmed_words
    return processed_text

## Feature Extraction

### Build a frequency dictionary

Both the logistic regression and the Naive Bayes algorithm make use of frequency dictionary, `freqs`, built below. Each word and sentiment pair is counted from the labeled texts, so we can know how many times each word appears in positive and negative reviews. 

In [7]:
# initialize the defaultdict to store word-sentiment frequency counts
freqs = {}

# iterate through the reviews and labels in the training set
for review, label in zip(x_train, y_train):
    words = process_text(review)
    
    for word in words:
        pair = (word, label)
        if pair in freqs:
            freqs[pair] +=1
        else:
            freqs[pair] = 1

In [8]:
print("type(freqs) = " + str(type(freqs)))
print("len(freqs) = " + str(len(freqs.keys())))

type(freqs) = <class 'dict'>
len(freqs) = 16805


## Model Building

### Logistic Regression 

Logistic regression takes a regular linear regression, and applies a sigmoid to the output. This conversion lets us use our regression as a classifier - returning values above or below a threshold, rather than a continuous output.

Regression:
$$z = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + ... \theta_N x_N$$

Logistic regression
$$ h(z) = \frac{1}{1+\exp^{-z}}$$
$$z = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + ... \theta_N x_N$$

#### Sigmoid
The sigmoid function is defined as: 

$$ h(z) = \frac{1}{1+\exp^{-z}} $$

It maps the input 'z' to a value that ranges between 0 and 1, and so it can be treated as a probability. 


In [9]:
# Sigmoid function
def sigmoid(z):
    '''
    Input: 
        z: input, can be scalar or array
    Output: 
        h: the sigmoid of z
    '''
    import numpy as np
    import math
    h = 1 / (1 + np.exp(-z))
    
    return h

#### Cost function and gradient

The cost function used for logistic regression is the average of the log loss across all training examples:

$$J(\theta) = -\frac{1}{m} \sum_{i=1}^m y^{(i)}\log (h(z(\theta)^{(i)})) + (1-y^{(i)})\log (1-h(z(\theta)^{(i)})) $$
* $m$ is the number of training examples
* $y^{(i)}$ is the actual label of training example 'i'.
* $h(z^{(i)})$ is the model's prediction for the training example 'i'.

The loss function for a single training example is
$$ Loss = -1 \times \left( y^{(i)}\log (h(z(\theta)^{(i)})) + (1-y^{(i)})\log (1-h(z(\theta)^{(i)})) \right)$$

* All the $h$ values are between 0 and 1, so the logs will be negative. That is the reason for the factor of -1 applied to the sum of the two loss terms.


#### Update the weights

To update the weight vector $\theta$, we will apply gradient descent to iteratively improve the model's predictions.  
The gradient of the cost function $J$ with respect to one of the weights $\theta_j$ is:

$$\nabla_{\theta_j}J(\theta) = \frac{1}{m} \sum_{i=1}^m(h^{(i)}-y^{(i)})x^{(i)}_j $$
* 'i' is the index across all 'm' training examples.
* 'j' is the index of the weight $\theta_j$, so $x^{(i)}_j$ is the feature associated with weight $\theta_j$

* To update the weight $\theta_j$, we adjust it by subtracting a fraction of the gradient determined by $\alpha$:
$$\theta_j = \theta_j - \alpha \times \nabla_{\theta_j}J(\theta) $$
* The learning rate $\alpha$ is a value that we choose to control how big a single update will be.


#### Gradient descent

* $\mathbf{\theta}$ has dimensions (n+1, 1), where 'n' is the number of features, and there is one more element for the bias term $\theta_0$ (note that the corresponding feature value $\mathbf{x_0}$ is 1).
* The 'logits', 'z', are calculated by multiplying the feature matrix 'x' with the weight vector 'theta'.  $z = \mathbf{x}\mathbf{\theta}$
    * $\mathbf{x}$ has dimensions (m, n+1) 
    * $\mathbf{\theta}$: has dimensions (n+1, 1)
    * $\mathbf{z}$: has dimensions (m, 1)
* The prediction 'h', is calculated by applying the sigmoid to each element in 'z': $h(z) = sigmoid(z)$, and has dimensions (m,1).
* The cost function $J$ is calculated by taking the dot product of the vectors 'y' and 'log(h)'.  Since both 'y' and 'h' are column vectors (m,1), transpose the vector to the left, so that matrix multiplication of a row vector with column vector performs the dot product.
$$J = \frac{-1}{m} \times \left(\mathbf{y}^T \cdot log(\mathbf{h}) + \mathbf{(1-y)}^T \cdot log(\mathbf{1-h}) \right)$$
* The update of theta is also vectorized.  Because the dimensions of $\mathbf{x}$ are (m, n+1), and both $\mathbf{h}$ and $\mathbf{y}$ are (m, 1), we need to transpose the $\mathbf{x}$ and place it on the left in order to perform matrix multiplication, which then yields the (n+1, 1) answer we need:
$$\mathbf{\theta} = \mathbf{\theta} - \frac{\alpha}{m} \times \left( \mathbf{x}^T \cdot \left( \mathbf{h-y} \right) \right)$$

In [10]:
def gradientDescent(x, y, theta, alpha, num_iters):
    '''
    Input:
        x: matrix of features which is (m,n+1)
        y: corresponding labels of the input matrix x, dimensions (m,1)
        theta: weight vector of dimension (n+1,1)
        alpha: learning rate
        num_iters: number of iterations you want to train your model for
    Output:
        J: the final cost
        theta: your final weight vector

    '''
   
    m = np.shape(x)[0]

    for i in range(0, num_iters):
    
        # get z, the dot product of x and theta
        z = np.dot(x,theta)
        
        # get the sigmoid of z
        h = sigmoid(z)
        
        # calculate the cost function
        J = (-1/m) * (np.dot(np.transpose(y), np.log(h)) + np.dot(np.transpose(1-y), np.log(1-h)))

        # update the weights theta
        theta = theta - (alpha / m) * np.dot(np.transpose(x), (h-y))
        
    J = float(J)
    return J, theta

#### Extract features from the test set

In order to apply logistic regression to the review text, we need to go through our corpus and for each review, create a three-element vector representing:  

+ a bias term = 1

+ the cumulative count of all occurrences of the positive words from each review in the corpus

+ the cumulative count of all occurrences of the negative words from each review in the corpus. 

So for each review, we will have a vector that looks like this:  

<center>    
x = [1, positive count, negative count]
</center>

By cycling through the corpus of reviews and stacking the above vectors, we can build the feature matrix for the logistic regression model. 

In [11]:
def extract_features(review, freqs, process_text = process_text):
    '''
    Input: 
        review: list of words for one review
        freqs: a dictionary correspoding to the frequencies of each (word, label) tuple
    Output: 
        x: a feature vector of dimensions (1,3) (bias, positive, negative)
    '''
    # process the review
    word_l = process_text(review)
    
    # preallocate the 3-element counts
    x = np.zeros(3)
    
    # set the bias term to 1
    x[0] = 1
    
    # loop through each word on the list
    for word in word_l: 
        pair_p = (word, 1)
        if pair_p in freqs: 
            # increment the word count for the positive label
            x[1] += freqs[pair_p]
        pair_n = (word, 0)
        if pair_n in freqs: 
            x[2] += freqs[pair_n]
            
    x = x[None, :] # adding batch dimension
    assert(x.shape == (1,3))
    return x

#### Predict review sentiment

Using the extracted features of each review and the sigmoid function, we can get a prediction of the review sentiment. In this case, a prediction between 0 and 0.5 (0 < y<sub>pred</sub> <= 0.5) is coded as a negative review, and a prediction between 0.5 and 1 (0.5 < y<sub>pred</sub> < 1) is coded as a positive review.

In [12]:
def predict_review_LR(review, freqs, theta): 
    '''
    Input: 
        review: a string
        freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
        theta: (3,1) vector of weights
    Output: 
        y_pred: the probability of a tweet being positive or negative
    '''

    # extract the features of the review and store it into x
    x = extract_features(review, freqs)

    # make the prediction using x and theta
    y_pred = sigmoid(np.dot(x, theta))
    
    return y_pred

#### Train the logistic regression model

Training the model is simple once you have the above functions. All that remains is to extract all feature vectors from the processed text, then use the review labels and gradient descent to find the weights ($\theta$) that correspond with the lowest cost (${J}$). 

In [13]:
def train_LR_model(x_train, y_train, alpha, num_iters, extract_features = extract_features, gradientDescent = gradientDescent):
    '''
    Input: 
        x_train: a list of reviews
        y_train: (m,1) vector arrow with the corresponding labels for the list of reviews
        alpha: learning rate
        num_iters: number of iterations you want to train your model for
    Output: 
        J: the final cost
        theta: your final weight vector
    '''
    # collect the features of "x" and stack them into a matrix "X"
    X = np.zeros((len(x_train), 3))
    for i in range(len(x_train)):
        X[i,:] = extract_features(x_train[i], freqs)


    # training labels that correspond to X
    Y = y_train.reshape(-1,1)

    # apply gradient descent
    J, theta = gradientDescent(X, Y, np.zeros((3,1)), alpha, num_iters)
    return J, theta

### Naive Bayes

Naive bayes is an algorithm that could be used for sentiment analysis. It takes a short time to train and also has a short prediction time.

To implement a Naive Bayes model, you need to create a probability that your document falls into one of your classes (in this case, positive and negative, or 2 classes).  

$P(D_{pos})$ is the probability that the document is positive.
$P(D_{neg})$ is the probability that the document is negative.
Use the formulas as follows and store the values in a dictionary:

$$P(D_{pos}) = \frac{D_{pos}}{D}$$

$$P(D_{neg}) = \frac{D_{neg}}{D}$$

Where $D$ is the total number of documents, or reviews in this case, $D_{pos}$ is the total number of positive reviews and $D_{neg}$ is the total number of negative reviews.

#### Prior and Logprior

The prior probability represents the underlying probability in the target population that a review is positive versus negative.  In other words, if we had no specific information and blindly picked a review out of the population set, what is the probability that it will be positive versus that it will be negative? That is the "prior".

The prior is the ratio of the probabilities $\frac{P(D_{pos})}{P(D_{neg})}$.
We can take the log of the prior to rescale it, and we'll call this the logprior

$$\text{logprior} = log \left( \frac{P(D_{pos})}{P(D_{neg})} \right) = log \left( \frac{D_{pos}}{D_{neg}} \right)$$.

Since $log(\frac{A}{B})$ is the same as $log(A) - log(B)$.  So logprior can also be calculated as the difference between two logs:

$$\text{logprior} = \log (P(D_{pos})) - \log (P(D_{neg})) = \log (D_{pos}) - \log (D_{neg})$$

#### Positive and Negative Probability of a Word
To compute the positive probability and the negative probability for a specific word in the vocabulary, we'll use the following inputs:

- $freq_{pos}$ and $freq_{neg}$ are the frequencies of that specific word in the positive or negative class. In other words, the positive frequency of a word is the number of times the word is counted with the label of 1.
- $N_{pos}$ and $N_{neg}$ are the total number of positive and negative words for all reviews respectively.
- $V$ is the number of unique words in the entire set of documents, for all classes, whether positive or negative.

We'll use these to compute the positive and negative probability for a specific word using this formula:

$$ P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V} $$
$$ P(W_{neg}) = \frac{freq_{neg} + 1}{N_{neg} + V} $$

Note that in the above formulas, I include a +1 in the numerator; this is called Laplacian smoothing (or additive smoothing). It is a technique used to address the issue of zero probabilities for certain features, and improves the model's robustness. With Laplacian smoothing, even features that are not found in the training data can have a greater than zero probability in the model, ensuring that the Naive Bayes classifier can make predictions for features not seen during training. 

#### Log likelihood
To compute the loglikelihood of that very same word, we can implement the following equations:

$$\text{loglikelihood} = \log \left(\frac{P(W_{pos})}{P(W_{neg})} \right)$$ 


The above formulas and calculations are included in the `train_naive_bayes` function below, returning the logprior of the corpus and the loglikelihood dictionary of the words and labeled sentiments in the training set.

In [14]:
def train_naive_bayes(freqs, x_train, y_train): 
    '''
    Input:
        freqs: dictionary from (word, label) to how often the word appears
        x_train: a list of reviews
        y_train: a list of labels correponding to the reviews (0,1)
    Output:
        logprior: the log prior. (equation 3 above)
        loglikelihood: the log likelihood of you Naive bayes equation. (equation 6 above)
    '''
    
    loglikelihood = {}
    logprior = 0
    
    # calculate V, the number of unique words in the vocabulary
    vocab = set(d[0] for d in freqs)
    V = len(vocab)
    
    # calculate N_pos, N_neg, V_pos, V_neg
    N_pos = N_neg = 0
    for pair in freqs.keys():
        # if the label is positive
        if pair[1] > 0: 
            # increment the number of positive words by the count for this (word,label) pair
            N_pos += freqs.get(pair, 0)
        else:
            N_neg += freqs.get(pair, 0)
    
    # calculate D, the number of documents/reviews in the set
    D = len(y_train)
    
    # calculate D_pos, the number of positive reviews
    Y = y_train
    D_pos = np.count_nonzero(Y == 1)
    print("D_pos:", D_pos)
    # calculate D_neg, the number of negative reviews
    D_neg = D - D_pos
    print("D_neg:", D_neg)
    # calculate the logprior
    logprior = np.log(D_pos) - np.log(D_neg)
    
    # for each word in the vocabulary
    for word in vocab:
        # get positive and negative frequency of each word
        freq_pos = freqs.get((word, 1), 0)
        freq_neg = freqs.get((word, 0), 0)
        
        # calculate the probability that each word is positive and negative, using Laplacian smoothing
        p_w_pos = (freq_pos + 1) / (N_pos + V)
        p_w_neg = (freq_neg + 1) / (N_neg + V)
        
        # calculate the loglikelihood of the word
        loglikelihood[word] = np.log(p_w_pos / p_w_neg)
    
    return logprior, loglikelihood
            

In [15]:
logprior, loglikelihood = train_naive_bayes(freqs, x_train, y_train)
print("logprior: ", logprior)
print("loglikelihood length", len(loglikelihood))

D_pos: 3762
D_neg: 3701
logprior:  0.0163476774748208
loglikelihood length 12240


In [16]:
def predict_review_NB(review, logprior, loglikelihood):
    '''
    Input:
        review: a string
        logprior: a number
        loglikelihood: a dictionary of words mapping to numbers
    Output:
        p: the sum of all the logliklihoods of each word in the review (if found in the dictionary) + logprior (a number)

    '''
    # process the review to get a list of words
    word_l = process_text(review)
    
    # initialize probability to zero
    p = 0
    
    # add the logprior
    p += logprior
    
    for word in word_l:
        # check if the word exists in the loglikelihood dictionary
        if word in loglikelihood:
            # add the loglikelihood of that word to the probability
            p+= loglikelihood.get(word)
    
    return(p)

## Model Testing

### Test the logistic regression model 

With the trained logistic regression model, you can then input your test set (or validation set) to evaluate the model's performance with new reviews.

The `test_LR_model` function returns two measures of model performance:  
+ **accuracy**: accuracy is simply the sum of correct predictions from the test set / size of test set
+ **F1 score**: F1 score is a common evaluation metric in binary classification that combines the use of both precision and recall performance measures
    + Precision (P): precision is the ratio of correct positive predictions to all positive predictions (including true and false positives). What precision represents is the relevance of positive predicted incidences. P = TP / TP + FP
    + Recall (R): recall is the ratio of correct positive predictions to all true positive cases (including true positives and false negatives. Recall represents the model's ability to correctly identify positive cases. R = TP / TP + FN
    + F1 is the harmonic mean of P and R and is useful when you need a single metric to evaluate the performance of a binary classifier, particularly when data sets are unbalanced. 
    $$F1 = \frac{2 * (P * R)}{P + R}$$

    + F1 scores range between 0 and 1, with 1 representing perfect precision and recall. A higher F1 score indicates a better precision/recall balance in the model
    

In [17]:
def test_LR_model(x, y, freqs, theta, predict_review_LR = predict_review_LR):
    '''
    Input: 
        test_x: a list of reviews
        test_y: (m, 1) vector array with the corresponding labels for the list of reviews
        freqs: a dictionary with the frequency of each pair (or tuple)
        theta: weight vector of dimension (3, 1)
    Output: 
        accuracy: (# of tweets classified correctly) / (total # of tweets)
        f1: 2*(precision * recall) / (precision + recall) 
    '''

    # preallocate the list for storing predictions
    y_hat = []

    for review in x: 
        # get the predicted label
        y_pred = predict_review_LR(review, freqs, theta)
    
        if y_pred > 0.5:
            y_hat.append(1.0)
    
        else: 
            y_hat.append(0.0)
        

    accuracy = sum(y.flatten() == np.asarray(y_hat))/len(y)
    
    from sklearn.metrics import f1_score
    f1 = f1_score(y, y_hat)
    
    return accuracy, f1

#### Tune the learning rate hyperparameter ($\mathbf\alpha$)

Now that we have the ability to train and test the model, we want to use the validation set to find the best learning rate for our gradient descent. Learning rate is a hyperparameter that dictates how large of a step to take down the gradient in the gradient descent algorithm. Selecting too large of a value for &alpha; can result in too large steps that miss the minimum of the cost function and lead to a failure to converge. Conversely, if you set the learning rate too small, convergence might be extremely slow and take the algorithm too long to find the minimum. Selecting the appropriate learning rate for your model improves your model's efficiency, performance, generalizability, and robustness, so it is important to test various values for &alpha; to find what works best.    

Below, I use a regular grid search to select the &alpha; parameter that returns the best accuracy with the model trained on the training set, and tested on the validation set. Then I use the model trained on the training set, with hyperparameter &alpha; tuned with the validation set, to test accuracy and F1 score with the test set. 

In [18]:
# designate some learning rates to chose from
learning_rates = [3e-5, 1e-5, 6e-6, 3e-6, 1e-6, 6e-7, 3e-7, 1e-7, 6e-8, 3e-8, 1e-8, 6e-9, 3e-9, 1e-9]
best_learning_rate = None
best_accuracy = 0
num_iters = 2000

for lr in learning_rates: 
    # fit the model with the training set
    
    J_temp, theta_temp = train_LR_model(x_train, y_train, lr, num_iters, extract_features = extract_features, gradientDescent = gradientDescent)
    accuracy, f1 = test_LR_model(x_valid, y_valid, freqs, theta_temp, predict_review_LR = predict_review_LR)
    print(f"Learning Rate: {lr}, Validation Accuracy: {accuracy}")
    
    # update the best learning rate if needed
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_learning_rate = lr
    
print(f"Best Learning Rate determined with validation set: {best_learning_rate}")

# Train the model on the training set with the best learning rate
J, theta = train_LR_model(x_train, y_train, best_learning_rate, num_iters, extract_features = extract_features, gradientDescent = gradientDescent)



Learning Rate: 3e-05, Validation Accuracy: 0.6529080675422139
Learning Rate: 1e-05, Validation Accuracy: 0.6791744840525328
Learning Rate: 6e-06, Validation Accuracy: 0.6791744840525328
Learning Rate: 3e-06, Validation Accuracy: 0.6791744840525328
Learning Rate: 1e-06, Validation Accuracy: 0.6804252657911195
Learning Rate: 6e-07, Validation Accuracy: 0.6804252657911195
Learning Rate: 3e-07, Validation Accuracy: 0.6797998749218261
Learning Rate: 1e-07, Validation Accuracy: 0.6754221388367729
Learning Rate: 6e-08, Validation Accuracy: 0.6722951844903065
Learning Rate: 3e-08, Validation Accuracy: 0.6666666666666666
Learning Rate: 1e-08, Validation Accuracy: 0.6404002501563477
Learning Rate: 6e-09, Validation Accuracy: 0.6260162601626016
Learning Rate: 3e-09, Validation Accuracy: 0.6047529706066291
Learning Rate: 1e-09, Validation Accuracy: 0.5616010006253909
Best Learning Rate determined with validation set: 1e-06


### Test Naive Bayes

Now that we have the `logprior` and `loglikelihood`, we can test the naive bayes function by making predictions on the test set.

In the logistic regression model, we were classifying outcomes as greater or less than 0.5. With the probabilities returned by the Naive Bayes model, a review with a p > 0 is classified as positive, while a review with p < 0 is classified as negative. Just as with the logistic regression model above, the test function for Naive Bayes returns accuracy and F1 score. 

##### Note
Note we calculate the prior from the training data, and that the training data is evenly split between positive and negative labels (3727 positive and 3726 negative reviews).  This means that the ratio of positive to negative 1, and the logprior is (nearly) 0.

The value of 0.0 means that when we add the logprior to the log likelihood, we're just adding zero to the log likelihood.  However, the logprior is important because whenever the data is not perfectly balanced, the logprior will be a non-zero value.



In [19]:
def test_naive_bayes(x_test, y_test, logprior, loglikelihood, predict_review_NB = predict_review_NB):
    '''
    Input:
        test_x: A list of reviews
        test_y: the corresponding labels for the list of reviews
        logprior: the logprior
        loglikelihood: a dictionary with the loglikelihoods for each word
    Output:
        accuracy: (# of reviews classified correctly)/(total # of reviews)
        f1: 2*(precision * recall) / (precision + recall) 
    '''
    accuracy = 0
    
    y_hats = []
    for review in x_test: 
        # if prediction is > 0
        if predict_review_NB(review, logprior, loglikelihood) > 0:
            # the predicted class is 1
            y_hat_i = 1
        else: 
            y_hat_i = 0
            
        # append the predicted class to the list of y_hats
        y_hats.append(y_hat_i)
        
    # error is the average of the absolute values of the differences between y_hats and test_y
    Y = y_test
    error = sum(abs(y_hats - Y))/len(Y)
    accuracy = 1 - error
    
    from sklearn.metrics import f1_score
    f1 = f1_score(Y, y_hats)
    
    return accuracy, f1

### Evaluating model performance on the test set

In [20]:
# Evaluate the final model on the test set
test_accuracy_LR, f1_score_LR = test_LR_model(x_test, y_test, freqs, theta, predict_review_LR = predict_review_LR)
print(f"Test accuracy of LR model with best learning rate: {test_accuracy_LR:.3f}, and F1 score: {f1_score_LR:.3f}")
        

# Evaluate the final model on the test set
accuracy_NB, f1_NB = test_naive_bayes(x_test, y_test, logprior, loglikelihood)
print(f"Test accuracy of NB model with Laplacian smoothing: {accuracy_NB:.3f}, and F1 score: {f1_NB:.3f}")
     

Test accuracy of LR model with best learning rate: 0.673, and F1 score: 0.672
Test accuracy of NB model with Laplacian smoothing: 0.760, and F1 score: 0.758


## Error Analysis

While the Naive Bayes model clearly outperformed the Logistic Regression model, both left many reviews misclassified. Let's look at the misclassified reviews to see if there's anything obviously driving the misclassification. 

In [21]:
misclassified_LR = []
print('Truth Predicted Review')
for x, y in zip(x_test, y_test):
    y_hat = predict_review_LR(x, freqs, theta)
    if y_hat > 0.5:
        pred = 1.0
    else:
        pred = 0.0
    if y != pred:
        print('%d\t%0.2f\t%s' % (y, pred, ' '.join(process_text(x)).encode('ascii', 'ignore')))
        misclassified_LR.append(process_text(x))

Truth Predicted Review
0	1.00	b'director chri columbu take hatinhand approach rowl stifl creativ allow film drag nearli three hour'
0	1.00	b'effort sincer result honest film bleak hardli watchabl'
0	1.00	b'niftiest trick perpetr import earnest alchem transmogrif wild austenand hollywood austen'
0	1.00	b'motion pictur portray ultim passion other creat ultim thrill men black ii achiev ultim insignific scifi comedi spectacl whifflebal epic'
0	1.00	b'blanket statement dimestor rumin vaniti worri rich sudden wisdom film becom sermon run time'
1	0.00	b'walk away new version e hope would moist eye'
0	1.00	b'bravo reveal true intent film care select interview subject construct portrait castro predominantli charit seen propaganda'
0	1.00	b'cheer enough immin forgett ripoff besson earlier work'
0	1.00	b'incess loung music play film background may mistak love liza adam sandler chanukah song'
0	1.00	b'though film wellintent one could rent origin get love stori parabl'
1	0.00	b'go movi littl like c

1	0.00	b'may captiv mood subtli transform star still wonder paul thoma anderson ever inclin make sincer art movi adam sandler probabl ever appear'
0	1.00	b'look like high school film project complet day due'
0	1.00	b'present good case fail provid reason us care beyond basic dictum human decenc'
1	0.00	b'fast funni highli enjoy movi'
1	0.00	b'safe conduct long movi 163 minut fill time drama romanc tragedi braveri polit intrigu partisan sabotag viva le resist'
1	0.00	b'without de niro citi sea would slip wave drag back singlehand'
0	1.00	b'believ make real deal leftov enron stock doubl valu week friday'
1	0.00	b'delight minor pastri movi'
1	0.00	b'scorses bold imag gener smart cast ensur gang never letharg movi hinder central plot that pepper fals start popul charact nearli imposs care'
1	0.00	b'fill alexandr desplat haunt sublim music movi complet transfix audienc'
1	0.00	b'pure propaganda work unabash hero worship nonetheless like inadvert time invalu implicit remind role u foreign pol

1	0.00	b'great cast wonder sometim confus flashback movi grow dysfunct famili'
0	1.00	b'incoher jumbl film that rare entertain could'
1	0.00	b'pretti good littl movi'
0	1.00	b'tone shift abruptli tens celebratori soppi'
0	1.00	b'human natur short isnt nearli funni think neither smart'
0	1.00	b'except act exercis except dark joke wonder anyon saw film allow get made'
0	1.00	b'pivot narr point ripe film cant help go soft stinki'
1	0.00	b'arm game support cast pitchperfect forster alway hilari meara levi like mike shoot score namesak proud'
0	1.00	b'joli perform vanish somewher hair lip'
1	0.00	b'addit score high origin plot put togeth familiar theme famili forgiv love new way lilo stitch number asset commend movi audienc innoc jade'
1	0.00	b'larg mr kilmer movi strongest perform sinc door'
0	1.00	b'put primit murder insid hightech space station unleash pandora box special effect run gamut cheesi cheesier cheesiest'
0	1.00	b'dumb cheesi may cartoon look almost shakespearean depth breadth 

1	0.00	b'feelgood movi feel movi feel good feel sad feel piss end feel aliv'
0	1.00	b'perri fist bull moor farm matter time get upper hand matter heart'
0	1.00	b'hit andmiss affair consist amus outrag funni cho may intend imagin one might hope'
1	0.00	b'like best godard movi visual ravish penetr impenetr'
1	0.00	b'movi thesi eleg technolog mass surprisingli refresh'
0	1.00	b'certain base level blue crush deliv promis well enough recommend'
0	1.00	b'long repetit take way mani year resolv total winner'
0	1.00	b'deliber devotedli construct far heaven pictur postcard perfect neat new pinlik obvious recreat reson'
1	0.00	b'audienc advis sit near back squint avoid notic truli egregi lipnonsynch otherwis product suitabl eleg'
0	1.00	b'textbook live quiet desper'
1	0.00	b'grossout gag color set piec cours stultifyingli contriv styliz half still get job done sleepi afternoon rental'
1	0.00	b'like less dizzili gorgeou companion mr wong mood love much hong kong movi despit mainland set'
1	0.00	b'

In [22]:
misclassified_NB = []
print('Truth Predicted Review')

for x, y in zip(x_test, y_test):
    y_hat = predict_review_NB(x, logprior, loglikelihood)
    if y != (np.sign(y_hat) > 0):
        misclassified_NB.append(process_text(x))
        print('%d\t%0.2f\t%s' % (y, np.sign(y_hat) > 0, ' '.join(process_text(x)).encode('ascii', 'ignore')))

Truth Predicted Review
0	1.00	b'effort sincer result honest film bleak hardli watchabl'
0	1.00	b'movi lie cheat love friend betray'
0	1.00	b'even accept right frame mind provid much lenienc'
0	1.00	b'motion pictur portray ultim passion other creat ultim thrill men black ii achiev ultim insignific scifi comedi spectacl whifflebal epic'
0	1.00	b'blanket statement dimestor rumin vaniti worri rich sudden wisdom film becom sermon run time'
1	0.00	b'total success someth true film addict want check bet'
0	1.00	b'bravo reveal true intent film care select interview subject construct portrait castro predominantli charit seen propaganda'
1	0.00	b'birthday girl doesnt tri surpris us plot twist rather seem enjoy transpar'
0	1.00	b'group selfabsorb women mother daughter featur film dont think noth wrong perform whiney charact bug'
0	1.00	b'predict bland comfort food appeal film pleasant enough dish'
0	1.00	b'mindless junk like make appreci origin romant comedi like punchdrunk love'
0	1.00	b'american

0	1.00	b'wellmad mushheart'
0	1.00	b'bartlett hero remain reactiv cipher open man head heart imagin reason film made'
0	1.00	b'despit strong perform never rise level telanovela'
0	1.00	b'perplex watch unfold astonish lack passion uniqu'
1	0.00	b'realli happen question philosoph filmmak filmmak need engag audienc'
0	1.00	b'juli davi kathi lee gifford film director sadli prove ego doesnt alway go hand hand talent'
0	1.00	b'definit crowdpleas roman colosseum'
0	1.00	b'film center hold'
1	0.00	b'trap wont score point polit correct may caus parent sleepless hour sign effect'
0	1.00	b'cut hollywood satir instead fresh last week issu varieti'
0	1.00	b'movi that overbear overthetop famili depict'
1	0.00	b'legendari irish writer brendan behan memoir borstal boy given love screen transferr'
1	0.00	b'harsh work piec storytel intellectu exercis unpleas debat that given drive narr that act believ noth less provoc piec work'
0	1.00	b'influenc chiefli human greatest shame realiti show realiti show go

0	1.00	b'much movi joint promot nation basketbal associ teenag rap adolesc posterboy lil bow wow'
1	0.00	b'remain solid somewhat heavyhand account neardisast done howard steadi imagin hand'
1	0.00	b'great parti'
0	1.00	b'see peopl thought hard mothman propheci'
0	1.00	b'would total loss two support perform take place movi edg'
1	0.00	b'despit titl punchdrunk love never heavyhand jab employ short care place deadcent'
0	1.00	b'rashomonfordipstick tale'
1	0.00	b'one left even aw act commit overwhelm sad feel made way bloodstream'
1	0.00	b'rude black comedi catalyt effect holi fool upon around cutthroat world children televis'
0	1.00	b'malon gift gener nightmarish imag hard burn brain movi narr hook way muddl effect chill guilti pleasur'
1	0.00	b'even climact hourlong cricket match boredom never take hold'
0	1.00	b'glaze tawdri bmovi scum'
1	0.00	b'may last tango pari'
0	1.00	b'name say jackass vulgar cheaplook version candid camera stage marqui de sade set'
1	0.00	b'beauti anim epic never

0	1.00	b'even diehard fan japanes anim find one challeng'
1	0.00	b'thought tom hank ordinari bigscreen star wait youv seen eight stori tall'
0	1.00	b'film sort cinemat high crime one bring militari courtroom drama low'
1	0.00	b'quit divert nonsens'
0	1.00	b'unspool like highbrow lowkey 102minut infomerci blend entrepreneuri zeal testimoni satisfi custom'
1	0.00	b'move quickli adroitli without fuss doesnt give time reflect inan cold war dated premis'
0	1.00	b'director byler may yet great movi charlott sometim half one'
1	0.00	b'georg clooney first directori effort present utterli ridicul shaggi dog stori one creativ energet origin comedi hit screen year'
1	0.00	b'depend upon reaction movi may never abl look red felt sharpi pen without disgust thrill giggl'
0	1.00	b'jaglom offer nonetooorigin premis everyon involv moviemak con artist liar'
1	0.00	b'ouv got love disney pic littl cleavag one heroin feisti principl jane'
1	0.00	b'downward spiral come pass auto focu bear typic junki opera'
1

Both LR and NB are straightforward machine learning algorithms; they make calculations based on how often (or likely) a word is included in positive reviews and negative reviews and build predictions based on those frequencies. That being the case, misclassified reviews most often are those in which words that are most often found in positive reviews are used in negative reviews, and vice versa. To test this hypothesis, I created a negative review that uses positive sounding words, and ran it through both algorithms to see how it was classified. 

In [23]:
# test on a new review
test_review_1 = 'It is amazing the way that Tarantino can take a beautiful idea and make it unpalatable and hateful'
test_review_NB = predict_review_NB(test_review_1, logprior, loglikelihood)
test_review_LR = predict_review_LR(test_review_1, freqs, theta)
print(f"Test review probability, Logistic Regression: {test_review_LR}")
print(f"Test review probability, Naive Bayes: {test_review_NB:.3f}")

Test review probability, Logistic Regression: [[0.68460643]]
Test review probability, Naive Bayes: 0.990


In both of models, the negative review above was classified as positive (LR sigmoid value > 0.5, NB p > 0). Let's try it again with a positive review that includes words most often seen in negative reviews:

In [24]:
# test on a new review
test_review_2 = 'A movie with this level of grime, grit, unrestrained emotion, and toxic masculinity deserves our praise'
test_review_NB = predict_review_NB(test_review_2, logprior, loglikelihood)
test_review_LR = predict_review_LR(test_review_2, freqs, theta)
print(f"Test review probability, Logistic Regression: {test_review_LR}")
print(f"Test review probability, Naive Bayes: {test_review_NB:.3f}")

Test review probability, Logistic Regression: [[0.27031279]]
Test review probability, Naive Bayes: -1.161


Just as above, with both models, a review that is positive, but contains negative words is misclassified as negative (LR sigmoid value < 0.5, NB p < 0). 

This confusion is a consequence of taking each word out of context and simply tallying frequency (LR) or probability (NB) with no regard for the neighboring words in the corpus. 

## Conclusions

While logistic regression and Naive Bayes are excellent models to know and understand, they lack some of the nuance and holistic digestion of a text corpus that you can get more advanced models using neural networds or graph-based models. 

That being said, Naive Bayes, which uses a probabilistic approach, handily outperformed the frequentist logistic regression model. 