# Notebook 4 - Supervised Learning

CSI4106 Artificial Intelligence  
Fall 2020  
Prepared by Julian Templeton and Caroline Barrière

***INTRODUCTION***:

The supervised classification task tackled in this notebook is **polarity detection**, which is one possible activity within the quite popular trend of *Opinion Mining* in AI.  Many companies want to know whether there are positive or negative reviews about them.  Reviews can be on hotels, restaurants, movies, customer service of any kind, etc.

This notebook will allow you to better understand an ***experimental set-up*** for supervised machine learning.  The notion of training set, test set, evaluation, etc.  The notebook also introduces the notion of comparative evaluation.  To say if a method is good or not, we often compare it to a *baseline* approach.  

This notebook makes use of a really nice and popular machine learning package, called **scikit-learn** (http://scikit-learn.org/stable/).  It contains many pre-coded machine learning algorithms which you can call.  To use this package, you must download it. You will also need to download **Pandas** which is a great tool for manipulating data to use in Machine Learning algorithms and **Numpy** which is often used by machine learning libraries.  At the command prompt, type ***pip install numpy***, ***pip install sklearn*** and ***pip install pandas*** to download the packages.  

In this notebook we will use the Naive Bayes Machine Learning algorithm for polarity detection of a large movie review dataset, but we will explore other ML algorithms included in scikit-learn in future notebooks.  

You will need to download the movie review dataset from the following shared Google Drive:
https://drive.google.com/file/d/1w1TsJB-gmIkZ28d1j7sf1sqcPmHXw352/view

This is a dataset of reviews from Rotten Tomatoes along with the Freshness of the review (Fresh or Rotten). We will be using this dataset throughout the notebook so be sure to place it in the same directory as this notebook. It contains 480000 reviews with half of them being rotten and the other half being fresh. We will only use a subset of these due to the large computation time of the Baseline learner.


***HOMEWORK***:  
Go through the notebook by running each cell, one at a time.  
Look for **(TO DO)** for the tasks that you need to perform. Do not edit the code outside of the questions which you are asked to answer unless specifically asked. Once you're done, Sign the notebook (at the end of the notebook), and submit it.  

*The notebook will be marked on 20.  
Each **(TO DO)** has a number of points associated with it.*
***

**1. Polarity detection**  

In polarity detection, we use two classes: positive and negative.  This is different from sentiment analysis for example, in which the classes might be (sad, happy, anxious, angry, etc).  It's also more restricted than *rating* in which we would like assign a value (0..5) to evaluate a particular service.  So, the polarity detection task aims to assign either *positive* or *negative* to a statement.

**2. Application domain:  Movie reviews**  

Polarity detection could be used on reviews of anything.  In this notebook, we wish to apply polarity detection within the domain of movies.  Movie reviewers give a review accompanied by a score for movies that they review. The website Rotten Tomatoes is a website that collects movie reviews and the accompanied ratings, where the ratings are can be classified as "Rotten" for a low review score or "Fresh" for a higher review score. We will be using the dataset *rt_reviews.csv* that you downloaded earlier to perform polarity detection on.

The first thing to do is to setup the training and testing sets for our models. We will build these sets by importing the data from the dataset using pandas, then use that dataframe along with sckikit learn's train_test_split function that will separate the data into a training set and a test set. These will be used later on by the models that will be created/used.

We **SHOULD NOT** use this test set to build our model later on. The test set (unseen data) is to test the model after we train it with the training set.

In [1]:
# Import the libraries that we will use to help create the train and test sets
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Import the dataset, need to use the ISO-8859-1 encoding due to some invalid UTF-8 characters
df = pd.read_csv("rt_reviews.csv", encoding="ISO-8859-1")

The first step after loading the data is to take a quick look at it. Pandas offers the two useful functions df.head() and df.tail() which allow you to visualize the top and the bottom of your data frame.

In [2]:
df.head(5) # Show the first five reviews of the dataset to understand the dataframe's structure

Unnamed: 0,Freshness,Review
0,fresh,"Manakamana doesn't answer any questions, yet ..."
1,fresh,Wilfully offensive and powered by a chest-thu...
2,rotten,It would be difficult to imagine material mor...
3,rotten,Despite the gusto its star brings to the role...
4,rotten,If there was a good idea at the core of this ...


In [3]:
# Randomly select 10000 fresh examples from the dataframe
dfFresh = df[df["Freshness"] == "fresh"].sample(n=10000, random_state=8)
# Randomly select 10000 rotten examples from the dataframe
dfRotten = df[df["Freshness"] == "rotten"].sample(n=10000, random_state=5)
# Combine the results to make a small random subset of reviews to use
dfPartial = dfFresh.append(dfRotten)

In [4]:
# Split the data such that 90% is used for training and 10% is used for testing (separating the review
# from the freshness scores that we will use as the labels)
# Recall that we do not use this test set when building the model, only the training set
# We use the parameter stratify to split the training and testing data equally to create
# a balanced dataset
train_reviews, test_reviews, train_tags, test_tags = train_test_split(dfPartial["Review"],
                                                                      dfPartial["Freshness"],
                                                                      test_size=0.1, 
                                                                      random_state=3,
                                                                      stratify=dfPartial["Freshness"])
train_tags = train_tags.to_numpy()
train_reviews = train_reviews.to_numpy()
# Testing set (what we will use to test the trained model)
test_tags = test_tags.to_numpy()
test_reviews = test_reviews.to_numpy()

**3. Using available resources**  

For polarity detection, some researchers have established lists of positive and negative words.  The ones used in this notebook have been downloaded from [here](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html) (a website on Opinion Mining by renowned research Bing Lu) and stored locally.  The files *positive-words.txt* and *negative-words.txt* are in the Jupyter Notebook module in Brightspace.  Make sure you place these files in the same repertory as your notebook.

In [5]:
# Read the positive words
# to fix encoding problems, you might need to replace the line below
# with open("positive-words.txt", encoding = "ISO-8859-1") as f: 

with open("positive-words.txt") as f:
    posWords = f.readlines()
posWords = [p[0:len(p)-1] for p in posWords if p[0].isalpha()] 

# print the first 50 words
print(posWords[:50])

['a+', 'abound', 'abounds', 'abundance', 'abundant', 'accessable', 'accessible', 'acclaim', 'acclaimed', 'acclamation', 'accolade', 'accolades', 'accommodative', 'accomodative', 'accomplish', 'accomplished', 'accomplishment', 'accomplishments', 'accurate', 'accurately', 'achievable', 'achievement', 'achievements', 'achievible', 'acumen', 'adaptable', 'adaptive', 'adequate', 'adjustable', 'admirable', 'admirably', 'admiration', 'admire', 'admirer', 'admiring', 'admiringly', 'adorable', 'adore', 'adored', 'adorer', 'adoring', 'adoringly', 'adroit', 'adroitly', 'adulate', 'adulation', 'adulatory', 'advanced', 'advantage', 'advantageous']


In [6]:
# Read the negative words
# to fix encoding problems, you might need to replace the line below
# with open("negative-words.txt", encoding = "ISO-8859-1") as f: 

with open("negative-words.txt") as f:
    negWords = f.readlines()
negWords = [p[0:len(p)-1] for p in negWords if p[0].isalpha()] 

# Print the first 50 negative words
print(negWords[:50])

['abnormal', 'abolish', 'abominable', 'abominably', 'abominate', 'abomination', 'abort', 'aborted', 'aborts', 'abrade', 'abrasive', 'abrupt', 'abruptly', 'abscond', 'absence', 'absent-minded', 'absentee', 'absurd', 'absurdity', 'absurdly', 'absurdness', 'abuse', 'abused', 'abuses', 'abusive', 'abysmal', 'abysmally', 'abyss', 'accidental', 'accost', 'accursed', 'accusation', 'accusations', 'accuse', 'accuses', 'accusing', 'accusingly', 'acerbate', 'acerbic', 'acerbically', 'ache', 'ached', 'aches', 'achey', 'aching', 'acrid', 'acridly', 'acridness', 'acrimonious', 'acrimoniously']


**4. Baseline approach**  

Before we evaluate the performances of a supervised learning approach, we can start by establishing a very simple baseline approach.  It's always good to start simple.  A baseline allows us to measure whether the additional complexity of the various models we develop is worth it or not.

The *baseline algorithm* we will use simply counts the number of positive and negative words in the review and outputs the category corresponding to the maximum.  This approach DOES NOT LEARN anything.  It just uses a particular *reasoning* (strategy at test time).  You might be surprised to find out how many *AI start-ups* within the area of Opinion Mining, do use this kind of simple approach.  

In [7]:
# First let's define methods to count positive and negative words

def countPos(text):
    count = 0
    for t in text.split():
        if t in posWords:
            count += 1
    return count

def countNeg(text):
    count = 0
    for t in text.split():
        if t in negWords:
            count += 1
    return count

In [8]:
# Simple counting algorithm as baseline approach to polarity detection
def baselinePolarity(review):
    numPos = countPos(review)
    numNeg = countNeg(review)
    if numPos > numNeg:
        return "fresh"   
    else:
        return "rotten"   

In [9]:
# Test the baseline method
print("Testing baselinePolarity with the review:", train_reviews[0])
print("baselinePolarity result:", baselinePolarity(train_reviews[0]))
print("Actual result:", train_tags[0])
print(" ")
print("Testing baselinePolarity with the review:", train_reviews[1])
print("baselinePolarity result:", baselinePolarity(train_reviews[1]))
print("Actual result:", train_tags[1])

Testing baselinePolarity with the review:  "Captain Marvel" ultimately feels more obligatory than inspired, a movie that basically gets the job done and little more.
baselinePolarity result: rotten
Actual result: fresh
 
Testing baselinePolarity with the review:  As the coffee-fueled Phil lurches further and further over the edge of soccer coach impropriety, you can sense Ferrell struggling to break through the lame script
baselinePolarity result: rotten
Actual result: rotten


**5. Evaluation of the Baseline Approach**  
We saw in class that there are different ways of evaluating an algorithm. In the case of classification, a common evaluation method is simply to calculate *number of wrong choices*. Similarly, it is possible to compute the *accuracy* of a model as the total number of correct predictions divided by the total number of predictions made. We will explore *accuracy* in later notebooks.    

To test our *baseline algorithm* we use the test set, defined earlier and calculate the number of wrong assignments.

In [88]:
# Function takes a one dimensional array of reviews and a one dimensional array of
# tags as input and prints the number of incorrect assignments when running the baseline approach
# on the reviews.
# Let's establish the polarity for each review
def incorrectReviews(reviews, tags):
    nbWrong = 0
    count = 0
    for i in range(len(reviews)):
        polarity = baselinePolarity(reviews[i])
        if (count < 10):
            print(reviews[i] + " -- Prediction: " + polarity + ". Actually: " + tags[i] + " \n")
            count += 1
        if (polarity != tags[i]):
            nbWrong += 1

    print('There are %s wrong predictions out of %s total predictions' %(nbWrong, len(tags)))    

In [12]:
# This may take a minute to run
incorrectReviews(test_reviews, test_tags)

 tries to be both a comedy and an action film, but only succeeds as a comedy. It's extremely funny, with Hill and Tatum displaying a disturbing amount of chemistry. -- Prediction: rotten. Actually: fresh 

 The result is like a two-stage booster rocket, propelled by a tricky structure, deft style and original scenario before settling into the sort of orbit any fanboy could recognize and appreciate. -- Prediction: fresh. Actually: fresh 

 Cursed feels like a series of curveballs designed to confuse the audience into thinking they're watching anything other than bad dialogue. -- Prediction: rotten. Actually: rotten 

 The Wrestler transcends professional wrestling as Requiem for a Heavyweight transcends boxing ... [it] raises the broken figure of the Ram to the level of a tragic hero... -- Prediction: rotten. Actually: fresh 

 Where the brilliance of Strange Weather lies - in great part through Hunter's stunningly human and subtly searing performance - is in its constant grappling with

In [89]:
incorrectReviews(train_reviews, train_tags)

 "Captain Marvel" ultimately feels more obligatory than inspired, a movie that basically gets the job done and little more. -- Prediction: rotten. Actually: fresh 

 As the coffee-fueled Phil lurches further and further over the edge of soccer coach impropriety, you can sense Ferrell struggling to break through the lame script -- Prediction: rotten. Actually: rotten 

 Come for the poster, stay for the end credits. -- Prediction: rotten. Actually: rotten 

 "2 Days in New York" leaves you feeling drained. and its conclusion is very clichï¿½ï¿½. Not only could the movie have ended nearly 45-minutes beforehand, but it's like shaking a Magic-8 Ball and getting"Ask again later" response. -- Prediction: fresh. Actually: rotten 

 Amenabar, at 25, has created a film that deserves comparison to Hitchcock's best work. -- Prediction: fresh. Actually: fresh 

 The HIGHly-anticipated sequel to 2004's Harold and Kumar Go to White Castle comes just how you'd expect it: raunchy, wild, disgusting and

**(TO DO) Q1 - 1 mark**  
Look at the ten outputs above which provide predictions from the Baseline approach for specified reviews along with their actual review class.
From the output, give the prior probabilities (no code needed) for each class based on the output given by the Baseline approach and based on the actual review class.

***Answer here***  
For the Baseline predictions:     
P(fresh) = 0.3     
P(rotten) = 0.7    

For the actual outputs:     
P(fresh) = 0.7    
P(rotten) = 0.3   

#### 6. Supervised learning method

We will now train a supervised learning model for polarity detection.

***6.1 Training data***  

In supervised learning, we need training data.  This training data must be *different* but *representative* of the eventual test data. At the beginning of the notebook we defined the training data and the test data to be a subset of the entire dataset (20000 total rows from the 480000). We did this due to the large computation time of the Baseline Approach used within this notebook. In reality we would want to use the entire dataset and ensure that we have trained our models with a large enough training set. This would ensure that when predicting unseen data that we have learned most of the examples that we expect to ever predict.

Usually a training set should be as large and varied as possible.  Training sets are very valuable, but they are costly to obtain, as they require tagging (human annotation) to generate them and the data itself may need to be cleaned. Once again, the training set is used to train the model and the testing set is used to test how well the trained model performs on unseen examples.

In [13]:
# Looking at the shapes of the train and test datasets that we will be using
print(train_reviews.shape)
print(test_reviews.shape)

(18000,)
(2000,)


***6.2 Pre-processing of input data*** 

This Machine Learning package, *scikit-learn*, is somewhat particular in the way the data must be formatted to be used by the training algorithms.  So, we must perform some preprocessing on the sentences above.  Luckily *scikit-learn* provides some pre-defined functions for doing text pre-processing.  

We easily transform each sentence into a list of indexes into a dictionary.  The dictionary is built from the words in the sentences.  The keys of the dictionary are the words, and the value is an index.

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

# The CountVectorizer builds a dictionary of all words (count_vect.vocabulary_), 
# and generates a matrix (train_counts), to represent each sentence
# as a set of indices into the dictionary. The words in the dictionary are the words found in train_reviews.

count_vect = CountVectorizer()
train_counts = count_vect.fit_transform(train_reviews)

To understand what the code above does, first let's print the vocabulary gathered from the sentences in train_reviews.  

In [15]:
# print the vocabulary (dictionary of words)
print(count_vect.vocabulary_)



For example, you can interpret the output above as: 

'coffee':4558  to mean that the word 'coffee' has been assigned index 4558  
'highly':11255 to mean that the word 'highly' has been assigned index 11255

Then, let's print the *train_counts*.  

In [16]:
# print the content of the training examples in terms of frequency of words (each word represented by its index)
print(train_counts)

  (0, 3623)	1
  (0, 14707)	1
  (0, 25034)	1
  (0, 8860)	1
  (0, 15601)	2
  (0, 16476)	1
  (0, 24064)	1
  (0, 12399)	1
  (0, 15713)	1
  (0, 24072)	1
  (0, 2106)	1
  (0, 10032)	1
  (0, 24075)	1
  (0, 12952)	1
  (0, 7046)	1
  (0, 1102)	1
  (0, 14044)	1
  (1, 24075)	3
  (1, 1102)	1
  (1, 1528)	1
  (1, 4558)	1
  (1, 9696)	1
  (1, 17725)	1
  (1, 14323)	1
  (1, 9761)	2
  :	:
  (17999, 12027)	1
  (17999, 12725)	1
  (17999, 3399)	1
  (17999, 10033)	1
  (17999, 13119)	2
  (17999, 24150)	1
  (17999, 2172)	1
  (17999, 26653)	1
  (17999, 1067)	1
  (17999, 21544)	1
  (17999, 26404)	1
  (17999, 12990)	1
  (17999, 15873)	1
  (17999, 14484)	1
  (17999, 3979)	1
  (17999, 26183)	1
  (17999, 12651)	1
  (17999, 24036)	1
  (17999, 8232)	1
  (17999, 16520)	1
  (17999, 22763)	1
  (17999, 5538)	1
  (17999, 10334)	1
  (17999, 16625)	1
  (17999, 7857)	1


You can interpret each line above as:  

(0, 14044) 1  -- sentence 0 (in train_reviews) has 1 instance(s) of word 14044 (index of the word in count_vect.vocabulary, that is the word 'little')  
(17999, 13119) 2  -- sentence 17999 (in train_reviews) has 2 instance(s) of word 13119 (index of the word in count_vect.vocabulary, that is the word 'just')  

So the train_counts contain for each sentence, the BOW associated with that sentence, but in the form of a list of indexes (each index corresponding to a word).

***6.3 Naive Bayes learning***

With the data preprocessed, we are ready to test the Naive Bayes algorithm provided by scikit-learn.  That algorithm required the training data to be represented in terms of *train counts* which is why we did the pre-processing above.

It's as easy as performing *fit*, as you see below, to train the model.  But you know what's underneath!!!  It creates prior probabilities for classes (fresh, rotten) and posterior probabilities of words (features) per class (e.g. P(awful|fresh) or P(awful|rotten).  All these probabilities are used in Bayes Theorem.  

**(TO DO) Q2 - 2 marks**  
Before training the model, what are the prior probabilities of the fresh and rotten classes using the training set above?

In [44]:
# Find the prior probabilities for the fresh and rotten classes in the train set (train_tags) and the test set (test_tags)
# that we will be using.
# You must calculate it from the train and test sets, then print the calculated result
# Print the prior probabilities as: <TRAIN_OR_TEST>: P(class) = value
train_count_fresh = 0
train_count_rotten = 0
for i in range (len(train_tags)):
    if train_tags[i] == 'fresh':
        train_count_fresh += 1
    elif train_tags[i] == 'rotten':
        train_count_rotten += 1
print('<TRAIN>: P(fresh) =',train_count_fresh/len(train_tags))
print('<TRAIN>: P(rotten) =',train_count_rotten/len(train_tags))

test_count_fresh = 0
test_count_rotten = 0
for j in range (len(test_tags)):
    if test_tags[j] == 'fresh':
        test_count_fresh += 1
    elif test_tags[j] == 'rotten':
        test_count_rotten += 1
print('<TEST>: P(fresh) =',test_count_fresh/len(test_tags))
print('<TEST>: P(rotten) =',test_count_rotten/len(test_tags))

<TRAIN>: P(fresh) = 0.5
<TRAIN>: P(rotten) = 0.5
<TEST>: P(fresh) = 0.5
<TEST>: P(rotten) = 0.5


In [41]:
# Test of a naive bayes algorithm, the "fit" is the training
from sklearn.naive_bayes import MultinomialNB

# Training the model
clf = MultinomialNB().fit(train_counts, train_tags)   

***6.4 Evaluation of Naive Bayes***

Let's first look at how the model performs on the training set, on which it learned.  To apply the model for classification (prediction), we use the *predict* method below.

In [42]:
# Testing on training set
predicted = clf.predict(train_counts)
# Print the first ten predictions
for doc, category in zip(train_reviews[:10], predicted[:10]):   # zip allows to go through two lists simultaneously
    print('%r => %s\n' % (doc, category))
correct = 0
for tag, pred in zip(train_tags, predicted):   # zip allows to go through two lists simultaneously
    if (tag == pred):
        correct += 1
print("Correctly classified %s total training examples out of %s examples" %(correct, train_tags.size))

' "Captain Marvel" ultimately feels more obligatory than inspired, a movie that basically gets the job done and little more.' => fresh

' As the coffee-fueled Phil lurches further and further over the edge of soccer coach impropriety, you can sense Ferrell struggling to break through the lame script' => rotten

' Come for the poster, stay for the end credits.' => rotten

' "2 Days in New York" leaves you feeling drained. and its conclusion is very clichï¿½ï¿½. Not only could the movie have ended nearly 45-minutes beforehand, but it\'s like shaking a Magic-8 Ball and getting"Ask again later" response.' => rotten

" Amenabar, at 25, has created a film that deserves comparison to Hitchcock's best work." => fresh

" The HIGHly-anticipated sequel to 2004's Harold and Kumar Go to White Castle comes just how you'd expect it: raunchy, wild, disgusting and completely absurd." => rotten

' While it builds palpable tension early on, No One Lives de-evolves into a generic torture porn gorefest in 

Unsurprisingly, on the training set we get most of the examples correct....  But we should test on a real **test set**, namely test_reviews and test_tags.

**(TO DO) Q3 - 4 marks**  
Test the trained model on the test set.  Write the code below to do so.  Before testing, each test set must be transformed through the preprocessing steps, so their format is compatible with the learner.

In [49]:
# Pre-process test set test_reviews
# Note, we use transform and NOT fit_transform since this we do not want to re-fit the vecotrizer
# that we used to train the model
test_counts = count_vect.transform(test_reviews)
# Predict the results
test_predicted = clf.predict(test_counts)
# Print the first ten predictions
for doc, category in zip(test_reviews[:10], test_predicted[:10]):  
    print('%r => %s\n' % (doc, category))
# Print the total correctly classified instances out of the total instances
correct = 0
for tag, pred in zip(test_tags, test_predicted):
    if (tag == pred):
        correct += 1
print("Correctly classified %s testing instances out of %s total instances" %(correct, test_tags.size))

" tries to be both a comedy and an action film, but only succeeds as a comedy. It's extremely funny, with Hill and Tatum displaying a disturbing amount of chemistry." => rotten

' The result is like a two-stage booster rocket, propelled by a tricky structure, deft style and original scenario before settling into the sort of orbit any fanboy could recognize and appreciate.' => fresh

" Cursed feels like a series of curveballs designed to confuse the audience into thinking they're watching anything other than bad dialogue." => rotten

' The Wrestler transcends professional wrestling as Requiem for a Heavyweight transcends boxing ... [it] raises the broken figure of the Ram to the level of a tragic hero...' => fresh

" Where the brilliance of Strange Weather lies - in great part through Hunter's stunningly human and subtly searing performance - is in its constant grappling with the nonlinear process of grief. " => fresh

' Hosking removed the provocation [of The Greasy Strangler] but kept

***6.5 More Evaluation!***


**(TO DO) Q4 - 2 marks**   
A common **Evaluation Measure** in Machine Learning is **Recall**. Recall is the number of correct predictions for a class of interest (called the True Positives) divided by the total number of instances that are actually labelled as that class of interest (True Positives + False Negatives).   For example, if the test set contains 5 fresh examples and the algorithm only found 2, then the recall for the class fresh is 2/5.  Write a small method below that will calculate a class' recall.  It will receive three parameters: 
1. The set of correct tags (e.g. (fresh, rotten, fresh)), 
2. The predictions (e.g (fresh, fresh, rotten)), and
3. The class of interest (e.g. fresh).  It will return the recall of that class (e.g. 50%).

In [84]:
# Number wrong
# CANNOT USE ANY FUNCTIONS FROM LIBRARIES TO DIRECTLY GET THE RECALL
def recall(actualTags, predictions, classOfInterest):
    if len(actualTags) != len(predictions):
        print('The set of correct tags length and the predictions length should be similar')
    T_p = 0
    F_n = 0
    for i in range (len(actualTags)):
        if (actualTags[i] == classOfInterest) and (predictions[i] == classOfInterest):
            T_p +=1
        elif (actualTags[i] == classOfInterest) and (predictions[i] != classOfInterest):
            F_n +=1
    recall = T_p/(T_p + F_n)
    print('For the class', classOfInterest,'the recall value is:', recall)
    return recall

**(TO DO) Q5 - 2 marks**   
Use the recall method to calculate the recall on the test set (both classes) for the Naive Bayes learner.  Print those recalls.   
Hint: You can test if recall() works correctly by testing with the provided example above

In [85]:
# Recall
fresh_recall = recall(test_tags, test_predicted, 'fresh')
rotten_recall = recall(test_tags, test_predicted, 'rotten')

For the class fresh the recall value is: 0.76
For the class rotten the recall value is: 0.753


**(TO DO) Q6 - 2 marks**   
Another common **Evaluation Measure** in Machine Learning is called **Precision**. Precision is the number of correct predictions for a class of interest (True Positives) divided by the total number of times that class of interest was predicted (True Positives + False Positives). For example, if the test set (ground truth) contains 3 fresh examples and 1 rotten example and the algorithm correctly labelled two of these as fresh, incorrectly labelled one of these as rotten, and incorrectly labelled one of these as fresh, then the Precision for the class fresh is 2/3.  Write a small method below that will calculate a class' precision.  It will receive three parameters: 
1. The set of correct tags (e.g. (fresh, rotten, fresh)), 
2. The predictions (e.g (fresh, fresh, rotten)), and
3. The class of interest (e.g. fresh).  It will return the precision of that class (e.g. 50%).

In [81]:
# CANNOT USE ANY FUNCTIONS FROM LIBRARIES TO DIRECTLY GET THE PRECISION
def precision(actualTags, predictions, classOfInterest):
    if len(actualTags) != len(predictions):
        print('The set of correct tags length and the predictions length should be similar')
    T_p = 0
    F_p = 0
    for i in range (len(actualTags)):
        if (actualTags[i] == classOfInterest) and (predictions[i] == classOfInterest):
            T_p +=1
        elif (actualTags[i] != classOfInterest) and (predictions[i] == classOfInterest):
            F_p +=1
    precision = T_p/(T_p + F_p)
    print('For the class', classOfInterest,'the precision value is:', precision)
    return precision

**(TO DO) Q7 - 2 marks**   
Use the precision method to calculate the precision on the test set (both classes) for the Naive Bayes learner.  Print those precision values.     
Hint: You can test if precision() works correctly by testing with the provided example above

In [86]:
# Precision
fresh_precision = precision(test_tags, test_predicted, 'fresh')
rotten_precision = precision(test_tags, test_predicted, 'rotten')

For the class fresh the precision value is: 0.7547169811320755
For the class rotten the precision value is: 0.7583081570996979


#### 7. Discussion     
An important step in the Machine Learning experiment process is to analyze and discuss the results after an experiment. In this section you will be answering four different questions to provide insight into the results obtained from the above experiments.    

**(TO DO) Q8 (a) - 1 mark**      
Is the Naive Bayes approach performing better than the baseline approach, if so by how much?  

ANSWER HERE FOR Q8 (a)    
Baseline approach: There are 779 wrong predictions out of 2000 total predictions (1221 correct predictions) == %61.05
Naive Bayes approach: Correctly classified 1513 testing instances out of 2000 total instances == %75.65

The prediction correctness for Naive Bayes approach is %75.65 and the prediction correctness for Baseline approach is %61.05. 

This means the Naive Bayes approach has %14.60 better performance in comparison to the Baseline approach.

**(TO DO) Q8 (b) - 2 marks**      
If we used the training data on the Baseline approach, how would you theorize those results would compare to those from the test data (better, worse, maybe both)? Explain how the comparison of the train and test data predictions from the Baseline model may or may not (depending on your previous answer) compare to the results obtained from the Naive Bayes algorithm. 

ANSWER HERE FOR Q8 (b)    
incorrectReviews(train_reviews, train_tags): There are 7030 wrong predictions out of 18000 total predictions. This means the percentage of wrong prediction in baseline method for the training data is: %39.05 (%60.95 correct prediction)

incorrectReviews(test_reviews, test_tags): There are 779 wrong predictions out of 2000 total predictions. This means the percentage of wrong prediction in baseline method for the test data is: %38.95 (%61.05 correct prediction)

The baseline method for train data has %0.1 higher wrong prediction comparing to the baseline method for test data. This means the baseline method worked worse on the train data rahter than test data.

********************************************************************************
predicted = clf.predict(train_counts): Correctly classified 16184 total training examples out of 18000 examples. This means the percentage of correct prediction in Naive Bayes method for the training data is: %89.91 

predicted = clf.predict(test_counts): Correctly classified 1513 testing instances out of 2000 total instances. This means the percentage of correct prediction in Naive Bayes method for the training data is: %75.65

The Naive Bayes method for train data has %14.26 higher correct prediction comparing to the Naive Bayes method for test data. This means the Naive Bayes method worked better on the train data rahter than test data.

********************************************************************************
In conclusion the Naive Bayes method perdicts better (%14.26) on train data compare to the Baseline method on train data. In contrast, the Baseline method perdicts betther (%0.1) on the test data compare to the Naive Bayes method.

**(TO DO) Q8 (c) - 1 mark**      
Present and discuss the overall results obtained through this notebook below (such as the precision and recall comparisons between movie tags).  

ANSWER HERE FOR Q8 (c)    
For the class fresh the recall value is: 0.76
For the class rotten the recall value is: 0.753
This means the recall value for fresh tag is 0.0070 higher than rotten tag.

For the class fresh the precision value is: 0.7547169811320755
For the class rotten the precision value is: 0.7583081570996979
This means the precision value for fresh tag is 0.0036 lower than rotten tag.

*********************************************************************************
In conclusion the fresh tage has higher recall value (~0.007) and lower precision value (~0.004) compare to the rotten tag. 

In general:
recall means: When a method is actually the positive result, how often does it predict correctly?
precision means: When a method predicts the positive result, how often is it correct?

So:
Through this notebook when actually the tag is fresh, there is higher chance to predict them correctly rather than rotten tags.

**(TO DO) Q8 (d) - 1 mark**      
Give two suggestions to help the Naive Bayes approach within the context of our experiment of polarity detection for movie reviews.

ANSWER HERE FOR Q8 (d)    
1- Instead of two classes of fresh and rotten define a class in between as "nutral" to help voting and better judgement
2- Having an scalar rating numbers beside the classification approach helps to define and rank better the voting and judgement
3- Having more features in classification rather than just positive and negative words 
4- There should be a method to remove redundant features/data from the data set 

**Optional - No marks** 
For your own interest, create a local copy of the notebook and redo the questions using the entire dataset with the same train, test split. Also, you can play around with different algorithms to get a better understanding of the sklearn library. We will be exploring some other algorithms within the next two notebooks.

***SIGNATURE:***
My name is Fatemeh Soltani.
I certify being the author of this assignment.