# Lab 3: Sentiment Analysis on Movie Reviews 🤩

Working on this lab should be a **collaborative effort**. We encourage you to work togheter with your group. If you do not work on your own notebook, make sure you demo to the TA/instructor as a group and share your work across the group after the lab.

> Remeber, to indicate the names of your group members,if you use some of the collectivley developed code in a future homework. 

## Learning Objectives
1. Experience the full data science workflow from data aquisition, pre-processing, to building a model and presenting the results. 
![DSworkflow](utility/pics/DSworkflow.png)
2. Work with free-form text data.
3. Learn and understand two approches to sentiment analysis. 
4. Explore model evaluation techniques and analyze errors. 

## Outline

0. [Recap: From Buisness Problem to Sentiment Analysis](#Recap:-From-Buisness-Problem-to-Sentiment-Analysis)
1. [Rule-Based Sentiment Prediction](#1.-Rule-Based-Sentiment-Prediction)
    1. [Toy Example](#Toy-Example)
    2. [Movie Reviews: Test Yourself](#Movie-Reviews:-Test-Yourself)
    3. [Analyzing the Results](#Analyzing-the-Results)
3. [Limitations and Introduction to Machine Learning](#2.-Limitations-and-Introduction-to-Machine-Learning)
    1. [Quick Introduction to Sentiment Classification and Scikit-Learn](#Quick-Introduction-to-Sentiment-Classification-and-Scikit-Learn)
    2. [Coding Task: Evaluate the Sentiment Classifier](#Coding-Task:-Evaluate-the-Sentiment-Classifier)
    3. [Things to Try](#Things-to-Try)
4. [Comparison](#3.-Comparison) 

## Recap: From Buisness Problem to Sentiment Analysis

Today we want to look at ways to help buisnesses make the most out of their customers feedback, which oftentimes comes as textual reviews or comments. Sentiment Analysis, or categorizing attitudes towards something, is quite relevant today. Amazon, for example, sells products of all sorts; those who purchase these items are able to leave reviews and comments. Besides the ratings that are given, how would a company be able to tell which products are well-liked and which ones should be removed?

An easy way is through sentiment analysis, where the goal is to predict the sentiment or positivity/negativity of a product or service solely based on the text provided as comments and reviews. Throughout this lab, we will explore two different ways to predict and understand the sentiment of text data. First, we will work through a simple rule-based algorithm, looking at positive and negative words to determine the classification of reviews. Following this, we will work through a more sophisticated machine learning based approach, allowing us to scale our classification to much larger datasets.

## 1. Rule-Based Sentiment Prediction

Rule-based sentiment prediction is the easier of the two algorithms to learn and implement. In short, we have a list of positive words and a list of negative words, both of which will be used to calculate a "sentiment score" for the review.

### Toy Example

For example, let's say we have two sets of words, positive_words and negative_words:

In [1]:
positive_words = ['great', 'awesome', 'happy', 'good', 'exciting', 'love']
negative_words = ['bad', 'dislike', 'sad', 'boring', 'awful', 'poor']


We can also have a set of reviews or text that we want to analyze:

In [2]:
reviews = ['I thought the movie was great! I was very happy I could see it.',
           'I did not like the movie; boring acting, poor attitudes, bad lighting.',
           'The movie was pretty exciting overall, but the sound quality was bad.']

We then go through each review and add or subtract to the sentiment score based on the number of positive or negative words. If there is a positive word, then we add one to the score; a negative word subtracts one.

In [3]:
sentiment_scores = []
for review in reviews:
    sentiment_score = 0
    for word in review.split(' '):
        if word in positive_words:
            sentiment_score += 1
        if word in negative_words:
            sentiment_score -= 1
    sentiment_scores.append(sentiment_score)

We can print out these results to see the overall scores in order of the reviews.

In [4]:
print(sentiment_scores)

[1, -3, 1]


If we do this by hand, we see that the scores don't add up correctly. Why is this? The words of the reviews are split by spaces. Take the first review for example. If we split it by spaces and look at the words, we see that the word great still has the exclamation point with it!

In [5]:
first_review = 'I thought the show was great! I was very happy I could see it.'
first_review_words = first_review.split(' ')
print(first_review_words)

['I', 'thought', 'the', 'show', 'was', 'great!', 'I', 'was', 'very', 'happy', 'I', 'could', 'see', 'it.']


Having the words split only by spaces causes some words to include punctuation, which is something we don't want. We won't touch on this too much, but preprocessing data to make sure words or numbers are functioning correctly can increase performance and accuracy greatly. Making sure that punctuation is removed as well as standardizing to lowercase gives much more control over the text data at hand.

In [6]:
import string

# .lower() changes first_review to all lowercase
# .translate(str.maketrans(input, output, delete)) will replace characters from input with respective 
#      characters in output and deletes what's in delete. 
#      --> for example: translate(str.maketrans(“aeiou”, “12345", "!")) will replace vowels with their respective 
#          numbers and deletes all exclamation marks
# .split(' ') splits the words into an array based on ' ', or a space
# other functions include:
# .replace(target, new), which will replace all matches of the target string with the new string

new_first_words = first_review.lower().translate(str.maketrans("", "", string.punctuation)).split(" ")
print(new_first_words)

['i', 'thought', 'the', 'show', 'was', 'great', 'i', 'was', 'very', 'happy', 'i', 'could', 'see', 'it']


We can now re-run this code on the reviews to see the appropriate scores that should be allocated.

In [8]:
sentiment_scores = []
for review in reviews:
    sentiment_score = 0
    for word in review.lower().translate(str.maketrans("", "", string.punctuation)).split(" "):
        if word in positive_words:
            sentiment_score += 1
        if word in negative_words:
            sentiment_score -= 1
    sentiment_scores.append(sentiment_score)

print(sentiment_scores)

[2, -3, 0]


Great! We now have a working function to assign sentiment scores to reviews. The final step is simply to assign a sentiment to the reviews. There are several ways to approach this, depending on what the user is attempting to do. We could work this as a Binary Classification, where each review is either positive or negative, and cannot be anything else. For this, we would assign "Negative" to any review with a score less than zero, and "Positive" to every other review.

In [9]:
review_sentiments = []

for score in sentiment_scores:
    if score >= 0:
        review_sentiments.append("Positive")
    if score < 0:
        review_sentiments.append("Negative")
        
print(review_sentiments)

['Positive', 'Negative', 'Positive']


However, we could also use Multi-class classification, including a "Neutral" class for the reviews that have a score of zero.

In [10]:
review_sentiments = []

for score in sentiment_scores:
    if score > 0:
        review_sentiments.append("Positive")
    if score < 0:
        review_sentiments.append("Negative")
    if score == 0:
        review_sentiments.append("Neutral")
        
print(review_sentiments)

['Positive', 'Negative', 'Neutral']


With all of this in mind, there are no limits to the number of classes or splits that could be made for text data. We could adjust the range for neutral to be any reviews between -1 and 1, or perhaps add in more classes ("Slightly Positive", "Slightly Negative", "Very Positive", "Very Negative", etc...). As long as the data is preprocessed correctly and you have a good set of positive and negative words, you will be able to run sentiment analysis on the majority of text files.

#### Caution!

Though Rule-Based Sentiment Analysis is quick and on the easier side to implement, there are several drawbacks that can render this method inefficient. This method does not take into account misspellings, nor does it take into account context. Take the two following reviews for example.

"The movie was not good, it was bad" and "The movie was not bad, it was good". 

Both of these reviews would end up with the same sentiment score, but are clearly different reviews. This is partly due to the nature of the method; we are only looking at one word at a time, and not pairs of words. We will not look at this specifically, but looking at pairs of words or groups of three word (called bi-grams or tri-grams or in general n-grams) can help alleviate mistakes in our analysis.

Rule-Based Sentiment Analysis also does not take into account the length of the review. If we have a very long review that uses a mix of positive and negative words, it may end up being classified as something it is not. Likewise, a short but very strongly opinionated review may not receive the same sentiment as a longer, equally opinionated review.

### Movie Reviews: Test Yourself

In the following code blocks, work through to analyze **real-life movie reviews**! Some of the code is written for you, some will have to be filled in yourself.


In [11]:
# Setup - This cell block is needed to set up everything for this testing section
# No need to edit this cell

import os
import string
import zipfile
import shutil

# Unzip folder with negative reviews
if not os.path.exists('utility/data/neg'):
    zip_ref = zipfile.ZipFile('utility/data/neg.zip', 'r')
    zip_ref.extractall('utility/data/')
    zip_ref.close()
    print('Unzipped Negative')

# Unzip folder with postive reviews
if not os.path.exists('utility/data/pos'):
    zip_ref = zipfile.ZipFile('utility/data/pos.zip', 'r')
    zip_ref.extractall('utility/data/')
    zip_ref.close()
    print('Unzipped Positive')
    
# Create folder for testing
pos_test = ['357_10p.txt', '347_10p.txt', '1697_10p.txt']  
neg_test = ['1919_1n.txt', '54_1n.txt', '1819_1n.txt']  
if not os.path.exists('utility/data/test'):
    os.mkdir('utility/data/test')
    shutil.copy('utility/data/pos/'+pos_test[0],'utility/data/test')
    shutil.copy('utility/data/pos/'+pos_test[1],'utility/data/test')
    shutil.copy('utility/data/pos/'+pos_test[2],'utility/data/test')

    shutil.copy('utility/data/neg/'+neg_test[0],'utility/data/test')
    shutil.copy('utility/data/neg/'+neg_test[1],'utility/data/test')
    shutil.copy('utility/data/neg/'+neg_test[2],'utility/data/test')
    print('Created test folder.')
    
# Create list of positive words from given file
with open('utility/data/negative-words.txt') as f:
    negative_words = [word.strip() for word in f.readlines() if word[0] not in [';', '\n']]
    print('Created list of negtaive words: negative_words')

# Create list of negative words from given file
with open('utility/data/positive-words.txt') as f:
    positive_words = [word.strip() for word in f.readlines() if word[0] not in [';', '\n']]
    print('Created list of postive words: postitive_words')

Unzipped Negative
Unzipped Positive
Created test folder.
Created list of negtaive words: negative_words
Created list of postive words: postitive_words


The bulk of the code will be executed in the following function. Fill in what needs to be filled in to perform rule-based sentiment prediction and test the function on a small number of reviews. 

In [16]:
def get_sentiment_scores(path2folder):

    # Create a blank sentiment_scores list
    sentiment_scores = []

    print('...computing sentiment scores on '+path2folder +'...')

    for file in os.listdir(path2folder):
        path_start = path2folder + '/'
    
        # Create the sentiment_score variable for this review, and set it to zero
        
        # your code here
        sentiment_score = 0
    
        with open(path_start + file, encoding = "utf-8") as f:
        
        
        

            # Pull the words into a words array
            
            # The reviews include the string "<br />" quite a few times; the data looks cleaner if replaced
            # with a space!

            # Hint: Remember to read, lower, replace, translate, and split!

            # your code here
            words = f.read()
            for word in words.lower().replace("<br />", " ").translate(str.maketrans("", "", string.punctuation)).split(" "):
                if word in positive_words:
                     sentiment_score += 1
                if word in negative_words:
                     sentiment_score -= 1
            sentiment_scores.append(sentiment_score)


            # Loop through the words to generate the sentiment score

            # your code here


            # Append the sentiment_score to the sentiment_scores array!

            # your code here

        
    print('Done Running \n')
    f.close()
    return sentiment_scores

test_scores = get_sentiment_scores('utility/data/test')
print(test_scores)

...computing sentiment scores on utility/data/test...
Done Running 

[-2, 5, 6, -6, 2, -2]


> The results you should get for those test reviews are `[-2, 5, 6, -6, 2, -2]`.

Once the code is running correctly, perform rule-based sentiment prediction on the **_postive reviews_**! Running this function will take a little while as it needs to go through all the reviews and count the postivie and negative words in oder to get the sentiment score.  

In [38]:
# Create a variable called file_path for the folder of positive reviews
# Hint: Data is stored in the folder 'data' under 'utility', with two subfolders being 'neg' or 'pos'
scores_pos_reviews = None

# your code here
file_path = 'utility/data/pos'
scores_pos_reviews = get_sentiment_scores(file_path)

...computing sentiment scores on utility/data/pos...
Done Running 



### Analyzing the Results
Now, we can see how our apporach predicts the sentiment for those reviews. This phase is a crucial part in the data science workflow and is called **model evaluation**.  

In [41]:
# Positive Predicted Reviews:
percent_pos = sum([1 for score in scores_pos_reviews if score >= 0]) / len(scores_pos_reviews)*100
print("%.2f%% true positive reviews (those are predicted correctly)" % (percent_pos))

# Negative Predicted Reviews:
percent_neg = sum([1 for score in scores_pos_reviews if score < 0]) / len(scores_pos_reviews)*100
print("%.2f%% false negative reviews (those are actually positive reviews)" % (percent_neg))


84.40% true positive reviews (those are predicted correctly)
15.60% false negative reviews (those are actually positive reviews)
1688


What do these number mean? Explain whether our approach works well or not.

Repeat the above computations for the **_negative reviews_** and compare the results with the ones above. Is our approach better in predictiong postivie reviews correctly or negative ones? 


In [36]:
scores_neg_reviews = None

# your code here


# your code here
file_path = 'utility/data/neg'
scores_neg_reviews = get_sentiment_scores(file_path)

# Positive Predcited Reviews:
percent_pos = sum([1 for score in scores_neg_reviews if score >= 0]) / len(scores_neg_reviews)*100
print("%.2f%% false positive reviews (those are actually negtaive reviews)" % (percent_pos))

# Negative Predicted Reviews:
percent_neg = sum([1 for score in scores_neg_reviews if score < 0]) / len(scores_neg_reviews)*100
print("%.2f%% true negative reviews (those are predicted correctly)" % (percent_neg))
print(sum([1 for score in scores_neg_reviews if score < 0]))

...computing sentiment scores on utility/data/neg...
Done Running 

27.30% false positive reviews (those are actually negtaive reviews)
72.70% true negative reviews (those are predicted correctly)
1454


What is the overall performance of our rule-based sentiment predictor? Compute the percentage of correctly predicted reviews over *all* reviews (this measure is also called _accuracy_), and the percentage of incorrectly predicted reviews over *all* reviews (this measure is also called _error rate_). As a sanity check, make sure both measures add up to 100%.  

In [50]:
# your code here
percent_correct = sum([1 for score in scores_neg_reviews if score < 0]) + sum([1 for score in scores_pos_reviews if score >= 0])
percent_correct = percent_correct / (len(scores_pos_reviews) + len(scores_neg_reviews))
percent_correct
print(len(scores_neg_reviews))
print(len(scores_pos_reviews))

2000
2000


## 2. Limitations and Introduction to Machine Learning
The rule-based sentiment predictor has many advantages including that is so simple to implement. With just a couple of extensions to our version (such as negation handling) we could actually make this production ready. However, the main drawback of this approach is that we need **hand engineered** lists of positive and negative expressions, which are non-trivial to create and also static. That means they don't adapt automatically to the domain they are being used for. For example, in formal language expressions might have different meaning than in a colloquial context. 

![rule-based](utility/pics/rule-based1.png)

How can we overcome this problem? Can we maybe learn what expressions are used in a positive versus a negative review? The answer is '_yes - we can!_'

### Quick Introduction to Sentiment Classification and Scikit-Learn

Instead of working with lists of positive and negative expressions we will now look at reviews with known ratings and use thoese to learn what positive versus negative reviews are. With those known set of positive and negative reviews, we can build a model just like so: 
![machine-learning](utility/pics/ml_train1.png)

And then use it on new comments and reviews to determine a customer's attitudes. This approach is a **machine learning** approach commonly know as _classification_. Just like so: 
![predict](utility/pics/ml_predict1.png)

Okay, let's do it. We will be using the Scikit-Learn Python package (https://scikit-learn.org/stable/). Check [**PDSH**] Ch5 for a quick introduction to Scikit-Learn (p343-359).  

In [44]:
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Remove test folder 
if os.path.exists('utility/data/test'):
    shutil.rmtree('utility/data/test')  
    
# Load data (folders will be considered as classes (target variable) 0,1,... # subfolders)    
data_folder = "utility/data/"
dataset = load_files(data_folder, shuffle=False)
docs_raw = dataset.data

## Text preprocessing
docs_all = []
for doc in docs_raw:
    docs_all.append(doc.decode('utf-8', errors='replace')) # prevent UnicodeDecodeError
y_all = dataset.target

# Text tokenizing and filtering of stopwords
count_vect = CountVectorizer(min_df=5)  
X_all_counts = count_vect.fit_transform(docs_all)

# Number of docs and number of words
print("Number of documents: " + str(X_all_counts.shape[0])) 
print("Number of words: " + str(X_all_counts.shape[1])) 
    # X_all_counts data representation (* = occurrence count):
    #    - - - - -
    #  |
    #  |  *        <- document
    #  |
    #  |
    #     ^
    #    word index

Number of documents: 4000
Number of words: 8870


After **preprocessing** the text documents, we **split** our data into two parts one for building the model (_training set_) 
and one for testing/evaluating it (_test/evaluation set_). Then we will **build the model** using the _training set_ and use the model to **predict** the sentiment of the documents in the _testing set_. 

In [45]:
# Split the data into two parts 
X_train, X_test, y_train, y_test = train_test_split(X_all_counts, y_all, train_size = .8, test_size = .2, random_state = 16)

print("Size of the training set: " + str(X_train.shape[0]))
print("Size of the test/evaluation set: " + str(X_test.shape[0]))

# Build the model using a linear classification model
model = LogisticRegression(max_iter=1000).fit(X_train,y_train)

# Use the classification model for predictions
predicted_target = model.predict(X_test)

Size of the training set: 3200
Size of the test/evaluation set: 800


### Coding Task: Evaluate the Sentiment Classifier
Write a function that will go through all the test data and compare the predicted class and the actual class. If an entry is put into the wrong class by the model, this function will add one to the respective variable: `fneg_error_count` if it was a _false negative_, `fpos_error_count` if it is a _false positive_. From these values you can compute  
* the total _number of mistakes made_, 
* the _error rate_, and 
* the _accuracy_ 

of the machine learning approach. 


Then, this function will print out how many total errors, how many _false negatives_, and how many _false positives_ were found and the rates (which important to get the relative measure based on the number of positive/negtaive test examples). 

The inputs for this function are the **predicted classification** for each review generated by the model and the **actual classification** from the dataset.

In [None]:
def test_predictions(predictions, actual):
    
    fneg_error_count = 0
    fpos_error_count = 0
    
    pos = y_test[y_test==1]
    num_pos = pos.size
    neg = y_test[y_test==0]
    num_neg = neg.size

    mistakes = 0
    error_rate = 0
    accuracy = 0
     
    
    # your code here
    
        
    
    print("There were a total of " + str(mistakes) + " errors out of " + str(len(predictions)) + " testpoints.")
    print("There were " + str(fneg_error_count) + " false negative errors")
    print("There were " + str(fpos_error_count) + " false positive errors")
    print("The algorithm was wrong in " + str(error_rate) + "% of the test cases." )
    print("The algorithm was correct in " + str(accuracy) + "% of the test cases.\n" )
    
    #false negative and false postive rates
    fnr = fneg_error_count/num_pos *100
    fpr = fpos_error_count/num_neg *100
    print("There was a %.2f%% false negative error rate." % fnr)
    print("There was a %.2f%% false positive error rate." % fpr)

Now, we can call this function using our predicted sentiments and the ground truth sentiments as input: 

In [None]:
test_predictions(predicted_target, y_test)

### Things to Try
* Play with the train/test split sizes. we used a 80/20 split, but you can change this and see if it has an effect on the results. 
* Play with the random seed, to create differnt train/test splits. How does this affect the results? 
* Use a different classifier, for example, NaiveBayes or a Support Vector Machine (SVM). Code examples are below - **replace** the model computation in the cell above with the respective lines to train these different models. Do these models produce different results?

In [None]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB().fit(X_train, y_train)

from sklearn.svm import LinearSVC
model = LinearSVC().fit(X_train,y_train)

So, it turns out that this performs quite well. Of course, we can do more fancy things with the text data, instead of only counting word occurrences. 

[**Challenge**] In practice people also use the counts of _pairs of words_ (so-called _bi-grams_) or even _n-grams_ (counts of tuples of n words), or a feature called _TF-IDF_, which is very powerful in practice. If you still have time, check-out this tutorial explaining how to compute those: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html. Adapt the features used, create a new train/test split, train the model again, and evaluate the performace using your new features. 

## 3. Comparison 
**Group Discussion:** Compare the two apporaches **rule-based sentiment prediction** versus **sentiment classification**. What are the main differences in terms of... 
* required data?
* quality of the results? 
* efficiency of the computation?
* possibilities to extend the basic algorithms? 

Make a list of _pros_ and _cons_ for both approaches and also think of use cases/applications for either technique.

**Answer:**

### Clean-up
Please run the following cell in order to clean up some of the files on your computer. While not mandatory, it will certainly save some space (over 4000 files are already unzipped, this will clear space).

In [None]:
# Run this to clean folders (unless you want to keep several thousand text files on your computer!)

if os.path.exists('utility/data/neg'):
    shutil.rmtree('utility/data/neg')
if os.path.exists('utility/data/pos'):
    shutil.rmtree('utility/data/pos')  

In [55]:
help(str.translate)

Help on method_descriptor:

translate(self, table, /)
    Replace each character in the string using the given translation table.
    
      table
        Translation table, which must be a mapping of Unicode ordinals to
        Unicode ordinals, strings, or None.
    
    The table must implement lookup/indexing via __getitem__, for instance a
    dictionary or list.  If this operation raises LookupError, the character is
    left untouched.  Characters mapped to None are deleted.

