# Notebook 4 - Supervised Learning and the Naive Bayes Classifier

CSI4106 Artificial Intelligence  
Fall 2021  
Version 1 (2020) prepared by Julian Templeton.  Version 2 (2021) prepared by Caroline Barrière.

***PROBLEM DEFINITION***:

The supervised classification task tackled in this notebook is **polarity detection**, which is one possible activity within the quite popular trend of *Opinion Mining* in AI.  Many companies want to know whether there are positive or negative reviews about them.  Reviews can be on hotels, restaurants, movies, customer service of any kind, etc.  

In polarity detection, we use two classes: ***positive*** and ***negative***.  This is different from sentiment analysis for example, in which the classes might be (sad, happy, anxious, angry, etc).  It's also more restricted than *rating* in which we would like assign a value (0..5) to evaluate a particular service. 

In this notebook we will use the Naive Bayes Machine Learning algorithm for polarity detection of a large movie review dataset. 

An important goal of this notebook is to make you better understand how to put in place an ***experimental set-up*** for supervised machine learning.  The notion of training set, test set, evaluation, etc.  The notebook also introduces the notion of comparative evaluation.  To say if a method is good or not, we often compare it to a *baseline* approach.  


***PROGRAMMING ENVIRONMENT***:

This notebook makes use of a really nice and popular machine learning package, called **scikit-learn** (http://scikit-learn.org/stable/).  It contains many pre-coded machine learning algorithms which you can call. 

The Naive Bayes algorithm is part of this scikit-learn package that we will be using. And in future notebooks, we'll explore other learning algorithms included in scikit-learn.

To use this package, you must download it. You will also need to download **Pandas** which is a great tool for manipulating data to use in Machine Learning algorithms and **Numpy** which is often used by machine learning libraries.  At the command prompt, type ***pip install numpy***, ***pip install sklearn*** and ***pip install pandas*** to download the packages.  

ATTENTION: If you run into any installation problems for these packages on your own computer, I strongly suggest that you move to the online environment provided by Google, called ***Colab***.  Just go to the Colab site (https://colab.research.google.com/) and you will be able to upload your notebook and run it.  All packages required in the notebook are already installed. But if a package was missing, you could install it inline by including "!pip install package_name".

***DATASET and RESOURCES***:

You will need to download the movie review dataset from the following shared Google Drive:
https://drive.google.com/file/d/1w1TsJB-gmIkZ28d1j7sf1sqcPmHXw352/view

This is a dataset of reviews from Rotten Tomatoes along with the Freshness of the review (Fresh or Rotten). We will be using this dataset throughout the notebook so be sure to place it in the same directory as this notebook. It contains 480000 reviews with half of them being rotten and the other half being fresh. We will only use a subset of these due to the large computation time of the Baseline learner.

If you work on your own machine, you will be able to access the file locally once you downloaded it.  If you decide to work on ***Colab***, you can upload the dataset to colab (click the files icon on the left pane to access to the upload). 

We also use two resource files containing positive and negative words.  These will also need to be uploaded to colab or put in your local repertory.


***HOMEWORK***:  

Go through the notebook by running each cell, one at a time.  
Look for **(TO DO)** for the tasks that you need to perform. Do not edit the code outside of the questions which you are asked to answer unless specifically asked. Once you're done, write your name at the end of the notebook, rename the notebook StudentNumber-LastName-Notebook4.ipynb, and submit it.  

*The notebook will be marked on 25.  
Each **(TO DO)** has a number of points associated with it.*
***

**1. Application domain:  Movie reviews**  

Polarity detection could be used on reviews of anything.  In this notebook, we wish to apply polarity detection within the domain of movies.  Movie reviewers give a review accompanied by a score for movies that they review. The website Rotten Tomatoes is a website that collects movie reviews and the accompanied ratings, where the ratings are can be classified as "Rotten" for a low review score or "Fresh" for a higher review score. We will be using the dataset *rt_reviews.csv* that you downloaded earlier to perform polarity detection on.

The first thing to do is to setup the training and testing sets for our models. We will build these sets by importing the data from the dataset using pandas, then use that dataframe along with sckikit learn's train_test_split function that will separate the data into a training set and a test set. These will be used later on by the models that will be created/used.

We **SHOULD NOT** use this test set to build our model later on. The test set (unseen data) is to test the model after we train it with the training set.  In the lecture videos (part 4 - Evaluation, of Supervised Learning series), I do mention that this is called a "Validation set" even if in sckit-learn, they call it a test set.

In [1]:
# Import the libraries that we will use to help create the train and test sets
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
# Put the dataset in a dataframe
# Make sure you put rt_reviews.csv in your colab space, or locally
df = pd.read_csv("rt_reviews.csv", encoding="ISO-8859-1")

The first step after loading the data is to look at it to get an idea of the task at hand and the quality of the annotation provided. The annotated set is our reference ("gold standard"). This is what we will try to learn with our algorithms. Remember that with some tasks there is a dose of subjectivity.

Pandas offers the two useful functions df.head() and df.tail() which allow you to visualize the top and the bottom of your data frame.

In [3]:
pd.options.display.max_colwidth = 100
df.head(10) # Show the first ten reviews of the dataset to understand the dataframe's structure

Unnamed: 0,Freshness,Review
0,fresh,"Manakamana doesn't answer any questions, yet makes its point: Nepal, like the rest of our plane..."
1,fresh,"Wilfully offensive and powered by a chest-thumping machismo, but it's good clean fun."
2,rotten,It would be difficult to imagine material more wrong for Spade than Lost & Found.
3,rotten,"Despite the gusto its star brings to the role, it's hard to ride shotgun on Hector's voyage of ..."
4,rotten,"If there was a good idea at the core of this film, it's been buried in an unsightly pile of fla..."
5,rotten,"Gleeson goes the Hallmark Channel route, damaging an intermittently curious entry in the time t..."
6,fresh,"It was the height of satire in 1976: dark as hell, but patently absurd and surely nowhere close..."
7,rotten,"Everyone in ""The Comedian"" deserves a better movie than ""The Comedian."""
8,rotten,Actor encourages grumpy Christians to embrace the season.
9,fresh,"Slight, contained, but ineffably soulful."


**(TO DO) Q1 - 2 marks** \
(a) Output the last 20 reviews. \
(b) Find some adjectives used in the reviews that would indicate (in your view) that the movie was rotten or fresh. Give 3-4 adjectives with their association (fresh or rotten).  For example, above, "good" (review 1) is associated with fresh, or "wrong" (review 2) is associated with rotten.

In [4]:
# ANSWER Q1 (part a)
#
df.iloc[-20:,]

Unnamed: 0,Freshness,Review
479980,fresh,"This is a very postmodern sci-fi, with its downbeat approach to the monsters themselves, but wi..."
479981,rotten,"The Saw franchise has become a weekend-before-Halloween tradition; ironically, though, it has c..."
479982,rotten,"Even worse than Kangaroo Jack, and that's a statement I was hoping not to make for at least a f..."
479983,fresh,In the midst of all the ridiculousness of a movie where three mutant monsters destroy everythin...
479984,fresh,"RPO is a movie with a tsunami of these moments from about 40 years of film, T.V and gaming. If ..."
479985,rotten,"It's a mixed bag, and it's mostly mixed toward the not-good."
479986,rotten,The most disappointing aspect of The Women is how few laughs it delivers.
479987,rotten,"For the most part, [Hart] is more subdued than usual and aiming to hit deeper notes - call it H..."
479988,rotten,"Reaching for Terrence Malick territory, the voice-over narration (""Ma says it was the devil tha..."
479989,fresh,"The members of the generally excellent ensemble cast were almost entirely fresh faces for me, w..."


**ANSWER Q1 - Part b** \
...
good -> fresh
fun -> fresh
interesting -> fresh
charming -> fresh
joyless -> rotten
disappointing -> rotten
offensive ->rotten

The second step is to take a random sample of the movies.  The dataset is quite large (480000 reviews), so the processing time will be quite long if we keep it all.  We'll select 50000 at random.  If you work in colab, 50K might be reasonable, if you work locally and that is too large, you can reduce the sample to 10K.

In [5]:
# Randomly select 10000 fresh examples from the dataframe
dfFresh = df[df["Freshness"] == "fresh"].sample(n=50000, random_state=8)
# Randomly select 10000 rotten examples from the dataframe
dfRotten = df[df["Freshness"] == "rotten"].sample(n=50000, random_state=5)
# Combine the results to make a small random subset of reviews to use
dfPartial = dfFresh.append(dfRotten)

The third step is to create a training set and a test set (validation) with our data.

In [6]:
# Split the data such that 90% is used for training and 10% is used for testing (separating the review
# from the freshness scores that we will use as the labels)
# Recall that we do not use this test set when building the model, only the training set
# We use the parameter stratify to split the training and testing data equally to create
# a balanced dataset
train_reviews, test_reviews, train_tags, test_tags = train_test_split(dfPartial["Review"],
                                                                      dfPartial["Freshness"],
                                                                      test_size=0.1, 
                                                                      random_state=3,
                                                                      stratify=dfPartial["Freshness"])
train_tags = train_tags.to_numpy()
train_reviews = train_reviews.to_numpy()
# Testing set (what we will use to test the trained model)
test_tags = test_tags.to_numpy()
test_reviews = test_reviews.to_numpy()

**2. Using available resources**  

In Question 1, you were asked to highlight a few adjectives leading to fresh or rotten.  We can think of these words as positive or negative words.  It would be long to find all possible positive or negative words ourselves, so it is good to search for such resources.

In fact, the field of polarity detection is so popular, that for their research, some researchers have established lists of positive and negative words.  The ones used in this notebook have been downloaded from [here](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html) (a website on Opinion Mining by renowned research Bing Lu) and stored locally.  The files *positive-words.txt* and *negative-words.txt* are in the Notebook item of the Module 4 checklist in Brightspace.  Make sure you place these files in the same repertory as your notebook if you work locally, or that they have been uploaded to colab if you work in colab.

In [7]:
# Read the positive words

with open("positive-words.txt", encoding = "ISO-8859-1") as f:
    posWords = f.readlines()
posWords = [p[0:len(p)-1] for p in posWords if p[0].isalpha()] 

# print the first 50 words
print(posWords[:100])

['a+', 'abound', 'abounds', 'abundance', 'abundant', 'accessable', 'accessible', 'acclaim', 'acclaimed', 'acclamation', 'accolade', 'accolades', 'accommodative', 'accomodative', 'accomplish', 'accomplished', 'accomplishment', 'accomplishments', 'accurate', 'accurately', 'achievable', 'achievement', 'achievements', 'achievible', 'acumen', 'adaptable', 'adaptive', 'adequate', 'adjustable', 'admirable', 'admirably', 'admiration', 'admire', 'admirer', 'admiring', 'admiringly', 'adorable', 'adore', 'adored', 'adorer', 'adoring', 'adoringly', 'adroit', 'adroitly', 'adulate', 'adulation', 'adulatory', 'advanced', 'advantage', 'advantageous', 'advantageously', 'advantages', 'adventuresome', 'adventurous', 'advocate', 'advocated', 'advocates', 'affability', 'affable', 'affably', 'affectation', 'affection', 'affectionate', 'affinity', 'affirm', 'affirmation', 'affirmative', 'affluence', 'affluent', 'afford', 'affordable', 'affordably', 'afordable', 'agile', 'agilely', 'agility', 'agreeable', 'ag

In [8]:
# Read the negative words

with open("negative-words.txt", encoding = "ISO-8859-1") as f:
    negWords = f.readlines()
negWords = [p[0:len(p)-1] for p in negWords if p[0].isalpha()] 

# Print the first 50 negative words
print(negWords[:150])

['abnormal', 'abolish', 'abominable', 'abominably', 'abominate', 'abomination', 'abort', 'aborted', 'aborts', 'abrade', 'abrasive', 'abrupt', 'abruptly', 'abscond', 'absence', 'absent-minded', 'absentee', 'absurd', 'absurdity', 'absurdly', 'absurdness', 'abuse', 'abused', 'abuses', 'abusive', 'abysmal', 'abysmally', 'abyss', 'accidental', 'accost', 'accursed', 'accusation', 'accusations', 'accuse', 'accuses', 'accusing', 'accusingly', 'acerbate', 'acerbic', 'acerbically', 'ache', 'ached', 'aches', 'achey', 'aching', 'acrid', 'acridly', 'acridness', 'acrimonious', 'acrimoniously', 'acrimony', 'adamant', 'adamantly', 'addict', 'addicted', 'addicting', 'addicts', 'admonish', 'admonisher', 'admonishingly', 'admonishment', 'admonition', 'adulterate', 'adulterated', 'adulteration', 'adulterier', 'adversarial', 'adversary', 'adverse', 'adversity', 'afflict', 'affliction', 'afflictive', 'affront', 'afraid', 'aggravate', 'aggravating', 'aggravation', 'aggression', 'aggressive', 'aggressiveness'

**(TO DO) Q2 - 3 marks** \
What do you think of these positive and negative words?  If there were a third category "neutral", would you move some of these words to the "neutral" category.  Mention a few and why.  You can change the code above to print more than 50 words if you don't see any ones you would change in the first 50.

**ANSWER Q2** \
...
adjustable
ameliorate
adamant
ambush
Becuause they are both not postivite and negative,it will be better for between positive and negative. Which means the moive is not very good but it is also not very bad.

**3. Baseline approach**  

Before we evaluate the performances of a supervised learning approach, we can start by establishing a very simple baseline approach.  It's always good to start simple.  A baseline allows us to measure whether the additional complexity of the various models we develop is worth it or not.

***3.1 Algorithm***

The *baseline algorithm* we will use simply counts the number of positive and negative words in the review and outputs the category corresponding to the maximum.  This approach DOES NOT LEARN anything.  It just uses a particular *reasoning* (strategy at test time).  You might be surprised to find out how many *AI start-ups* within the area of Opinion Mining, do use this kind of simple approach.  

In [9]:
# First let's define methods to count positive and negative words

def countPos(text):
    count = 0
    for t in text.split():
        if t in posWords:
            count += 1
    return count

def countNeg(text):
    count = 0
    for t in text.split():
        if t in negWords:
            count += 1
    return count

In [10]:
# Simple counting algorithm as baseline approach to polarity detection
def baselinePolarity(review):
    numPos = countPos(review)
    numNeg = countNeg(review)
    if numPos > numNeg:
        return "fresh"   
    else:
        return "rotten"   

In [11]:
# Test the baseline method
print("Testing baselinePolarity with the review:", train_reviews[0])
print("baselinePolarity result:", baselinePolarity(train_reviews[0]))
print("Expected result:", train_tags[0])
print(" ")
print("Testing baselinePolarity with the review:", train_reviews[1])
print("baselinePriority result:", baselinePolarity(train_reviews[1]))
print("Expected result:", train_tags[1])

Testing baselinePolarity with the review:  t's a coming-of-age-slash-first-love story, with beguiling leads, a frank, bittersweet script, a lived-in, rustic setting (the outskirts of Athens, Georgia), an intense awareness of class difference and a mood that's all its own. 
baselinePolarity result: rotten
Expected result: fresh
 
Testing baselinePolarity with the review:  A remarkably assured debut for Coen.
baselinePriority result: fresh
Expected result: fresh


***3.2 Qualitative evaluation***

In a Qualitative evaluation, we take a small number of results and investigate them.  To do so, we need to take a few results (test_reviews 0 to 5) and for each one look at what made the algorithm say "fresh" or "rotten" (what did it based its decision on).

**(TO DO) Q3 - 2 marks** \
(a) Define the methods printPos and printNeg which will print all the positive words and the negative words found in a text.\
(b) After testing with 5 reviews (code is already provided), discuss the results. Do you think the found words are wrong? Do you think the predicted result is correct and that it is rather the expected one that is not? Remember in the video lecture on Evaluation, we did talk of subjectivity... sometimes we don't agree with the "gold standard".

In [12]:
# ANSWER Q3 - Part a
def printPos(text):
    for t in text.split():
        if t in posWords:
            print(t)

def printNeg(text):
    for t in text.split():
        if t in negWords:
            print(t)

In [13]:
# Showing the review, the result and the positive and negative words contained
for i in range(5):
  print('-----------------------')
  print(test_reviews[i])
  print("baselinePolarity result:", baselinePolarity(test_reviews[i]))
  print("Expected result:", test_tags[0])
  print("Positives words found: ")
  printPos(test_reviews[i])
  print("Negatives words found: ")
  printNeg(test_reviews[i])
  print('-----------------------')

-----------------------
 The film intelligently portrays how, even years later, the memories of trauma can be enough to inhibit a person from finding happiness - and consequently, how essential love is.
baselinePolarity result: fresh
Expected result: fresh
Positives words found: 
enough
happiness
love
Negatives words found: 
trauma
inhibit
-----------------------
-----------------------
 Fight Club confirms Edward Norton's ascent into the pantheon of actors like Robert DeNiro and Robert Duvall.
baselinePolarity result: fresh
Expected result: fresh
Positives words found: 
like
Negatives words found: 
-----------------------
-----------------------
 Not that anything's inherently wrong with a patriotic war movie, but there's a distastefulness in the film's simpleton view of mass suffering.
baselinePolarity result: rotten
Expected result: fresh
Positives words found: 
patriotic
Negatives words found: 
wrong
-----------------------
-----------------------
 The grim film is sure to make the

**ANSWER Q3 - Part b**\
Most of the word found are wrong, the word founded are not according to the sentences. For example, the view two, '....actors like Robert DeNiro and.....' are in correct, "like" are founded but it is not related to the movie. 

***3.2. Quantitative evaluation***

In a quantitative evaluation, we are interested in numbers, and for these numbers to be significant, we need a large enough test set.  Although qualitative evaluation is a good analysis tool, quantitative evaluation is a good comparative tool allowing the comparison of various algorithms.

To test our baseline algorithm we use the test set, defined earlier and calculate the number of correct assignments.

In [14]:
# Function takes a one dimensional array of reviews and a one dimensional array of
# tags as input and prints the number of correct assignments when running the baseline approach
# on the reviews.
# Let's establish the polarity for each review
def correctReviews(reviews, tags):
    nbCorrect = 0
    count = 0
    for i in range(len(reviews)):
        polarity = baselinePolarity(reviews[i])
        if (count < 10):
            print(reviews[i] + " -- Prediction: " + polarity + ". Actually: " + tags[i] + " \n")
            count += 1
        if (polarity == tags[i]):
            nbCorrect += 1

    print('There are %s correct predictions out of %s total predictions' %(nbCorrect, len(tags)))    

In [15]:
# This may take a minute to run (depending on the size of your test set, and whether you are on colab or not)
correctReviews(test_reviews, test_tags)

 The film intelligently portrays how, even years later, the memories of trauma can be enough to inhibit a person from finding happiness - and consequently, how essential love is. -- Prediction: fresh. Actually: fresh 

 Fight Club confirms Edward Norton's ascent into the pantheon of actors like Robert DeNiro and Robert Duvall. -- Prediction: fresh. Actually: fresh 

 Not that anything's inherently wrong with a patriotic war movie, but there's a distastefulness in the film's simpleton view of mass suffering. -- Prediction: rotten. Actually: rotten 

 The grim film is sure to make the audience uncomfortable. -- Prediction: rotten. Actually: fresh 

 Talking about class can be ugly. Yet as AWOL asserts, when you dare to comment, sometimes it frees up room for beauty to unfurl. -- Prediction: fresh. Actually: fresh 

 High energy fun with just the right tone, My Super Ex Girlfriend is fresh and frivolous escapism. -- Prediction: fresh. Actually: fresh 

 While the premise of vigilante just

**(TO DO) Q4 - 3 marks**  
The result calculated in the method "correctReviews" does not provide results for individual classes (fresh, rotten), so we don't really know if the algorithm does well or not on each class.  We can instead calculate the **Recall** per class.

(a) Modify the code above so it would provide recall results per class. \
(b) Discuss if the algorithm has differences between classes.

In [16]:
# ANSWER Q4 - Part (a)
# Write code (you can use the last few lines to guide you.. but you can change them if you want)
def recallPerClass(reviews, tags):
    nbCorrectRotten = 0
    nbCorrectFresh = 0
    nbTotalRotten = 0
    nbTotalFresh = 0
    recallForRotten = 0
    recallForFresh = 0
    for j in range(len(tags)):
        if tags[j] == 'fresh':
            nbTotalFresh += 1
        else:
            nbTotalRotten += 1
    for i in range(len(reviews)):
        polarity = baselinePolarity(reviews[i])
        if (polarity == tags[i]):
            if(polarity == 'fresh'):
                nbCorrectFresh += 1
            else:
                nbCorrectRotten += 1
    recallForRotten = nbCorrectRotten / nbTotalRotten
    recallForFresh = nbCorrectFresh / nbTotalRotten
    #print('There are %s correct predictions out of %s total predictions' %(nbCorrect, len(tags))) 

    print('There are %s correct in Rotten class out of %s total predictions' %(nbCorrectRotten, nbTotalRotten))
    print('Recall for Rotten is %s' %(recallForRotten)  )    
    print('There are %s correct in Fresh class out of %s total predictions' %(nbCorrectFresh, nbTotalFresh))  
    print('Recall for Fresh is %s' %(recallForFresh) )      

In [17]:
# This may take a minute to run (depending on the size of your test set, and whether you are on colab or not)
recallPerClass(test_reviews, test_tags)

There are 3585 correct in Rotten class out of 5000 total predictions
Recall for Rotten is 0.717
There are 2523 correct in Fresh class out of 5000 total predictions
Recall for Fresh is 0.5046


**ANSWER Q4 - Part (b)**\
Algorithm has better recall for Rotten class than Fresh class. Since the number of total Rotten class and Fresh class in gold standard, algorithm for finding Rotten class is better. Combine with two class, total recall are about 60% which is close to the answer before.

**4. Supervised learning method**

We will now train a supervised learning model for polarity detection.

***4.1 Training data***  

In supervised learning, we need training data.  This training data must be *different* but *representative* of the eventual test data. At the beginning of the notebook we defined the training data and the test data to be a subset of the entire dataset (20K (or 100K in colab) total rows from the 480K rows). We did this due to the large computation time of the Baseline Approach used within this notebook. In reality we would want to use the entire dataset and ensure that we have trained our models with a large enough training set. This would ensure that when predicting unseen data that we have learned most of the examples that we expect to ever predict.

Usually a training set should be as large and varied as possible.  Training sets are very valuable, but they are costly to obtain, as they require tagging (human annotation) to generate them and the data itself may need to be cleaned. Once again, the training set is used to train the model and the testing set is used to test how well the trained model performs on unseen examples.

In [18]:
# Looking at the shapes of the train and test datasets that we will be using
print(train_reviews.shape)
print(test_reviews.shape)

(90000,)
(10000,)


***4.2 Pre-processing of input data*** 

This Machine Learning package, *scikit-learn*, is somewhat particular in the way the data must be formatted to be used by the training algorithms.  So, we must perform some preprocessing on the sentences above.  Luckily *scikit-learn* provides some pre-defined functions for doing text pre-processing.  

We easily transform each sentence into a list of indexes into a dictionary.  The dictionary is built from the words in the sentences.  The keys of the dictionary are the words, and the value is an index.

In [19]:
from sklearn.feature_extraction.text import CountVectorizer

# The CountVectorizer builds a dictionary of all words (count_vect.vocabulary_), 
# and generates a matrix (train_counts), to represent each sentence
# as a set of indices into the dictionary. The words in the dictionary are the words found in train_reviews.

count_vect = CountVectorizer()
train_counts = count_vect.fit_transform(train_reviews)

To understand what the code above does, first let's print the vocabulary gathered from the sentences in train_reviews.  You will see a set of words and indexes.  For example ('with' : 4809) would mean the word 'with' is associated with index 4809.

In [20]:
# print the vocabulary (dictionary of words)
print(count_vect.vocabulary_)



Then, let's print the *train_counts*.  You will see a list of tuples with a count.  For example (0,8790) 1 would mean that the sentence 0, contains the word with index 8790 1 time.  Or (0, 31422) 3 would mean that sentence 0 contains the word with index 31422, 3 times.

In [21]:
# print the content of the training examples in terms of frequency of words (each word represented by its index)
print(train_counts)

  (0, 8790)	1
  (0, 31422)	3
  (0, 1203)	1
  (0, 41503)	1
  (0, 16967)	1
  (0, 26893)	1
  (0, 43528)	1
  (0, 50427)	1
  (0, 4165)	1
  (0, 25914)	1
  (0, 17818)	1
  (0, 4747)	1
  (0, 39821)	1
  (0, 26572)	1
  (0, 22611)	1
  (0, 38925)	1
  (0, 40363)	1
  (0, 45533)	1
  (0, 32086)	1
  (0, 2979)	1
  (0, 18739)	1
  (0, 1881)	1
  (0, 23416)	1
  (0, 3318)	1
  (0, 8141)	1
  :	:
  (89998, 12832)	1
  (89999, 50427)	1
  (89999, 4747)	1
  (89999, 45533)	2
  (89999, 1942)	1
  (89999, 45524)	1
  (89999, 23875)	1
  (89999, 6606)	1
  (89999, 2233)	1
  (89999, 45623)	1
  (89999, 4453)	1
  (89999, 30787)	1
  (89999, 4128)	1
  (89999, 21860)	1
  (89999, 14370)	1
  (89999, 12955)	1
  (89999, 15919)	1
  (89999, 26746)	1
  (89999, 12433)	1
  (89999, 25959)	1
  (89999, 2869)	1
  (89999, 5906)	1
  (89999, 5680)	1
  (89999, 46404)	1
  (89999, 41061)	1


So the train_counts contain for each sentence, the BOW (Bag-Of-Words) associated with that sentence, but in the form of a list of indexes (each index corresponding to a word).

***4.3 Naive Bayes learning***

With the data preprocessed, we are ready to test the Naive Bayes algorithm provided by scikit-learn.  That algorithm required the training data to be represented in terms of *train counts* which is why we did the pre-processing above.

Since scikit-learn contains a Bayes classifier algorithm, using it will be quite easy. It is as easy as performing *fit*, as you see below, to train the model.  But you know what's underneath!!!  It creates prior probabilities for classes (fresh, rotten) and posterior probabilities of words (features) per class (e.g. P(awful|fresh) or P(awful|rotten).  All these probabilities are used in Bayes Theorem.  

In [22]:
# Test of a naive bayes algorithm, the "fit" is the training
from sklearn.naive_bayes import MultinomialNB

# Training the model
clf = MultinomialNB().fit(train_counts, train_tags)

***4.4 Evaluation of Naive Bayes***

Let's first look at how the model performs on the training set, on which it learned.  To apply the model for classification (prediction), we use the *predict* method below.

In [23]:
# Testing on training set
predicted = clf.predict(train_counts)
# Print the first ten predictions
for doc, category in zip(train_reviews[:10], predicted[:10]):   # zip allows to go through two lists simultaneously
    print('%r => %s\n' % (doc, category))
correct = 0
for tag, pred in zip(train_tags, predicted):   # zip allows to go through two lists simultaneously
    if (tag == pred):
        correct += 1
print("Correctly classified %s total training examples out of %s examples" %(correct, train_tags.size))

" t's a coming-of-age-slash-first-love story, with beguiling leads, a frank, bittersweet script, a lived-in, rustic setting (the outskirts of Athens, Georgia), an intense awareness of class difference and a mood that's all its own. " => fresh

' A remarkably assured debut for Coen.' => fresh

' All we get is sullen Kevin Dillon subbing for Steve and a bigger, nastier lump that bloodies up its victims in stomach-churning close-ups.' => rotten

' I wonder how Gibney operated the camera while on his back.' => rotten

' Jim Carrey, as a queer con artist motivated by undying passion for a cellmate, brings a euphoric energy to this insubordinate farce.' => fresh

" Like the soulful, saucer-eyed Precious Moments Terminatrix it frames as a protagonist, 'Alita' is a high-tech hodgepodge [...] that paradoxically feels triumphant in the moments it works because you can see how much of it shouldn't have worked at all. " => fresh

' The whole film needs cordoning off with safety rope and "Keep Out"

Usually, the results on the training are pretty good.... It gives an optimistic result.  But we should test on a real **test set**, namely test_reviews and test_tags.

**(TO DO) Q5 - 4 marks**  

Test the model trained with the training data on the test set. (attention, do not retrain on the test set) \
(a) Write the code below to do so.  Before testing, each test set must be transformed through the preprocessing steps (as we did for the training set before), so their format is compatible with the learner. \
(b) Discuss if the results are better on the training set or the test set.  What is the difference? 

In [25]:
# ANSWER Q5 - Part a 
#test_counts = count_vect.fit_transform(test_reviews)
test_counts = count_vect.transform(test_reviews)
predictedT = clf.predict(test_counts)
for doc, category in zip(test_reviews[:10], predictedT[:10]):   # zip allows to go through two lists simultaneously
    print('%r => %s\n' % (doc, category))
correct = 0
for tag, pred in zip(test_tags, predictedT):   # zip allows to go through two lists simultaneously
    if (tag == pred):
        correct += 1
print("Correctly classified %s total training examples out of %s examples" %(correct, test_tags.size))

' The film intelligently portrays how, even years later, the memories of trauma can be enough to inhibit a person from finding happiness - and consequently, how essential love is.' => fresh

" Fight Club confirms Edward Norton's ascent into the pantheon of actors like Robert DeNiro and Robert Duvall." => fresh

" Not that anything's inherently wrong with a patriotic war movie, but there's a distastefulness in the film's simpleton view of mass suffering." => rotten

' The grim film is sure to make the audience uncomfortable.' => fresh

' Talking about class can be ugly. Yet as AWOL asserts, when you dare to comment, sometimes it frees up room for beauty to unfurl.' => rotten

' High energy fun with just the right tone, My Super Ex Girlfriend is fresh and frivolous escapism.' => fresh

" While the premise of vigilante justice might be one of honest intentions, D'Silva sullies the cause by refusing to fill up the giant craters in his script." => rotten

" Beer and Niney do solid work, but

**ANSWER Q5 - Part b** \
Results are better on trainging set,training set are 20% higher in correct rate than testing set. Which means there may have some overfitting problem. 

***4.5 Evaluation: Recall, Precision and F-measure***\
We explore the two evaluation measures presented in the video lectures:  Recall and precision.  And we introduce a new measure called F-measure.


**(TO DO) Q6 - 2 marks**   
A common **Evaluation Measure** in Machine Learning is **Recall**. You have performed recall of the baseline algorithm in Question 4.

Recall is the number of correct predictions for a class of interest (called the True Positives) divided by the total number of instances that are actually labelled as that class of interest (True Positives + False Negatives).   For example, if the test set contains 5 fresh examples and the algorithm only found 2, then the recall for the class fresh is 2/5.  

Write a small method below that will calculate a class' recall.  It will receive three parameters: 
1. The set of correct tags (e.g. (fresh, rotten, fresh)), 
2. The predictions (e.g (fresh, fresh, rotten)), and
3. The class of interest (e.g. fresh).  It will return the recall of that class (e.g. 50%).

In [44]:
# ANSWER Q6 - 
# Can be implemented in any way as long as it works correctly
def recall(actualTags, predictions, classOfInterest):
    count = 0
    total = 0
    for i in range(len(predictions)):
        if predictions[i] == classOfInterest:
            total += 1
            if actualTags[i] == predictions[i]:
                count += 1
    return count/total

You can obtain results of your method with the test below.

In [45]:
# Need to get the recall for the test set predictions from Naive Bayes
print(recall(test_tags, predictedT, "fresh"))
print(recall(test_tags, predictedT, "rotten"))

0.8038551154416437
0.7717370714150408


**(TO DO) Q7 - 2 marks**   
Another common **Evaluation Measure** in Machine Learning is called **Precision**. Precision is the number of correct predictions for a class of interest (True Positives) divided by the total number of times that class of interest was predicted (True Positives + False Positives). For example, if the test set (ground truth) contains 3 fresh examples (1 to 3) and 1 rotten example (4) and the algorithm predicted examples (1, 2, 4) as *Fresh* and example (3) as *Rotten*, then the Precision for the class *Fresh* is 2/3.  Write a small method below that will calculate a class' precision.  It will receive three parameters: 
1. The set of correct tags (e.g. (fresh, rotten, fresh)), 
2. The predictions (e.g (fresh, fresh, rotten)), and
3. The class of interest (e.g. fresh).  It will return the precision of that class (e.g. 50%).

In [46]:
# ANSWER Q7
def precision(actualTags, predictions, classOfInterest):
    count = 0
    total = 0
    for j in range(len(actualTags)):
        if actualTags[j] == classOfInterest:
            total += 1
    for i in range(len(actualTags)):
        if actualTags[i] == predictions[i] and actualTags[i] == classOfInterest:
            count += 1
            i+=1
        else:
            i+=1
    return count/total

You can obtain results of your method with the test below.

In [47]:
# Get the precision for the test set predictions from Naive Bayes
print(precision(test_tags, predictedT, "fresh"))
print(precision(test_tags, predictedT, "rotten"))

0.759
0.8148


**(TO DO) Q8 - 2 marks**

Another popular Evaluation Measure in Machine Learning is F-measure. 

F-Measure provides a way to combine both precision and recall into a single measure that captures both properties.Once precision and recall have been calculated for a binary or multiclass classification problem, the two scores can be combined into the calculation of the F-Measure.

The traditional F measure is calculated as follows:
F-Measure = (2 * Precision * Recall) / (Precision + Recall) 
This is the harmonic mean of the two fractions. This is sometimes called the F-Score or the F1-Score and might be the most common metric used on imbalanced classification problems.

You have already implemented functions for precision and recall. Now reuse this two values to implement the F-measure function for both classes.


In [48]:
# ANSWER Q8
def fmeasure(precision_score, recall_score):
    result = (2 * precision_score * recall_score)/(precision_score + recall_score)
    return result

In [49]:
precision_score = precision(test_tags, predictedT, "fresh")
recall_score = recall(test_tags, predictedT, "fresh")
print (fmeasure(precision_score, recall_score))
precision_score = precision(test_tags, predictedT, "rotten")
recall_score = recall(test_tags, predictedT, "rotten")
print (fmeasure(precision_score, recall_score))


0.7807838699722252
0.7926841132405876


**(TO DO) Q9 - 2 marks**

Now that you have the results for Precision, Recall, and F-Measure, discuss those results. Here are three questions to help you:

*   Is there a big difference between the Fresh and Rotten class for the various results? What could explain this?
*   Is there a big difference between recall and precision? What could explain this?
*   Do you think the results are good enough to use this algorithm? If so, why, if not why?

**ANSWER Q9** \
The difference between Fresh class and Rotten class is small, because the sample of fresh view and rotten view is 1 : 1.
The difference between recall and precision is also small.
I don't think it is good enough for this algorithm, the F-measure for the algorithm is about 79%, which should be higher than 96% to achieve our daliy use.

***4.6 Comparing to the baseline***

Now that we have tested the Naive Bayes algorithm, we can evaluate its value by comparing it to our baseline approach.

**(TO DO) Q10 - 3 marks**       
Is the Naive Bayes approach performing better than the baseline approach, if so by how much?  You have programmed only the Recall (not the precision) for the Baseline approach, so you can express your comparison using the Recall only. Indicate how much training and test data has been used. Say whether you find the increase (if any) significant or not.

**ANSWER Q10**   

Using X samples in the training set and Y samples in the test set, we obtain:

Recall for baseline on Fresh: 0.72Y
Recall for naive bayes on Fresh: 0.8Y

Recall for baseline on Rotten: 0.5Y
Recall for naive bayes on Rotten: 0.77Y

Gains are 8% on Fresh, 27% on Rotten

The gains are important? negligeable?
The gains are important.

***SIGNATURE:***
My name is Tan Chen.
My student number is 300072995.
I certify being the author of this assignment.