# Notebook 3 - Supervised Learning

CSI4106 Artificial Intelligence  
Fall 2019  
Prepared by Caroline Barrière and Julian Templeton

***INTRODUCTION***:

The supervised classification task tackled in this notebook is **polarity detection**, which is one possible activity within the quite popular trend of *Opinion Mining* in AI.  Many companies want to know whether there are positive or negative reviews about them.  Reviews can be on hotels, restaurants, movies, customer service of any kind, etc.

This notebook will allow you to better understand an ***experimental set-up*** for supervised machine learning.  The notion of training set, test set, evaluation, bias, etc.  The notebook also introduces the notion of comparative evaluation.  To say if a method is good or not, we often compare it to a *baseline* approach.  

This notebook makes use of a really nice and popular machine learning package, called **scikit-learn** (http://scikit-learn.org/stable/).  It contains many pre-coded machine learning algorithms which you can call.  To use this package, you must download it. You will also need to download **Pandas** which is a great tool for manipulating data to use in Machine Learning algorithms.  At the command prompt, type ***pip install sklearn*** and ***pip install pandas*** to download the packages.  

In this notebook we will use the Naive Bayes implementation and the Support Vector Machine (SVM) implementation for polarity detection of a large movie review dataset, but we will explore other ML algorithms included in scikit-learn in future notebooks.  

You will need to download the movie review dataset from the following shared Google Drive:
https://drive.google.com/file/d/1w1TsJB-gmIkZ28d1j7sf1sqcPmHXw352/view

This is a dataset of reviews from Rotten Tomatoes along with the Freshness of the review (Fresh or Rotten). We will be using this dataset throughout the notebook so be sure to place it in the same directory as this notebook. It contains 480000 reviews with half of them being rotten and the other half being fresh. We will only use a subset of these due to the large computation time of the Baseline and SVM learners.


***HOMEWORK***:  
Go through the notebook by running each cell, one at a time.  
Look for **(TO DO)** for the tasks that you need to perform. Do not edit the code outside of the questions which you are asked to answer unless specifically asked. Once you're done, Sign the notebook (at the end of the notebook), and submit it.  

*The notebook will be marked on 20.  
Each **(TO DO)** has a number of points associated with it.*
***

**1. Polarity detection**  

In polarity detection, we use two classes: positive and negative.  This is different from sentiment analysis for example, in which the classes might be (sad, happy, anxious, angry, etc).  It's also more restricted than *rating* in which we would like assign a value (0..5) to evaluate a particular service.  So, the polarity detection task aims to assign either *positive* or *negative* to a statement.

**2. Application domain:  Movie reviews**  

Polarity detection could be used on reviews of anything.  In this notebook, we wish to apply polarity detection within the domain of movies.  Movie reviewers give a review accompanied by a score for movies that they review. The website Rotten Tomatoes is a website that collects movie reviews and the accompanied ratings, where the ratings are can be classified as "Rotten" for a low review score or "Fresh" for a higher review score. We will be using the dataset *rt_reviews.csv* that you downloaded earlier to perform polarity detection on.

The first thing to do is to setup the training and testing sets for our models. We will build these sets by importing the data from the dataset using pandas, then use that dataframe along with sckikit learn's train_test_split function that will separate the data into a training set and a test set. These will be used later on by the models that will be created/used.

We **SHOULD NOT** use this test set to build our model later on. The test set (unseen data) is to test the model after we train it with the training set.

In [None]:
# Import the libraries that we will use to help create the train and test sets
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Import the dataset, need to use the ISO-8859-1 encoding due to some invalid UTF-8 characters
df = pd.read_csv("rt_reviews.csv", encoding="ISO-8859-1")

The first step after loading the data is to take a quick look at it. Pandas offers the two useful functions df.head() and df.tail() which allow you to visualize the top and the bottom of your data frame.

In [None]:
df.head(5) # Show the first five reviews of the dataset to understand the dataframe's structure

In [None]:
# Randomly select 10000 fresh examples from the dataframe
dfFresh = df[df["Freshness"] == "fresh"].sample(n=10000, random_state=5)
# Randomly select 10000 rotten examples from the dataframe
dfRotten = df[df["Freshness"] == "rotten"].sample(n=10000, random_state=3)
# Combine the results to make a small random subset of reviews to use
dfPartial = dfFresh.append(dfRotten)

In [None]:
# Split the data such that 90% is used for training and 10% is used for testing (separating the review
# from the freshness scores that we will use as the labels)
# Recall that we do not use this test set when building the model, only the training set
# We use the parameter stratify to split the training and testing data equally to create
# a balanced dataset
train_reviews, test_reviews, train_tags, test_tags = train_test_split(dfPartial["Review"],
                                                                      dfPartial["Freshness"],
                                                                      test_size=0.1, 
                                                                      random_state=10,
                                                                      stratify=dfPartial["Freshness"])
train_tags = train_tags.to_numpy()
train_reviews = train_reviews.to_numpy()
# Testing set (what we will use to test the trained model)
test_tags = test_tags.to_numpy()
test_reviews = test_reviews.to_numpy()

**3. Bias:  Available resources**  

For polarity detection, some researchers have established lists of positive and negative words.  The ones used in this notebook have been downloaded from [here](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html) (a website on Opinion Mining by renowned research Bing Lu) and stored locally.  The files *positive-words.txt* and *negative-words.txt* are in the Jupyter Notebook module in Brightspace.  Make sure you place these files in the same repertory as your notebook.

As discussed in class, using any external resource is somewhat of a *bias* that we introduce in the study of a problem. Although in this particular case, the lists themselves have been compiled from data by other researchers.

In [None]:
# Read the positive words
# to fix encoding problems, you might need to replace the line below
# with open("positive-words.txt", encoding = "ISO-8859-1") as f: 

with open("positive-words.txt") as f:
    posWords = f.readlines()
posWords = [p[0:len(p)-1] for p in posWords if p[0].isalpha()] 

# print the first 50 words
print(posWords[:50])

In [None]:
# Read the negative words
# to fix encoding problems, you might need to replace the line below
# with open("negative-words.txt", encoding = "ISO-8859-1") as f: 

with open("negative-words.txt", encoding = "ISO-8859-1") as f:
    negWords = f.readlines()
negWords = [p[0:len(p)-1] for p in negWords if p[0].isalpha()] 

print(negWords[:50])

**4. Baseline approach**  

Before we evaluate the performances of a supervised learning approach, we can start by establishing a very simple baseline approach.  It's always good to start simple.  A baseline allows us to measure whether the additional complexity of the various models we develop is worth it or not.

The *baseline algorithm* we will use simply counts the number of positive and negative words in the review and outputs the category corresponding to the maximum.  This approach DOES NOT LEARN anything.  It just uses a particular *reasoning* (strategy at test time).  You might be surprised to find out how many *AI start-ups* within the area of Opinion Mining, do use this kind of simple approach.  

In [None]:
# First let's define methods to count positive and negative words

def countPos(text):
    count = 0
    for t in text.split():
        if t in posWords:
            count += 1
    return count

def countNeg(text):
    count = 0
    for t in text.split():
        if t in negWords:
            count += 1
    return count

In [None]:
# Simple counting algorithm as baseline approach to polarity detection
def baselinePolarity(review):
    numPos = countPos(review)
    numNeg = countNeg(review)
    if numPos > numNeg:
        return "fresh"   
    else:
        return "rotten"   

In [None]:
# Test the baseline method
print("Testing baselinePolarity with the review:", train_reviews[0])
print("baselinePriority result:", baselinePolarity(train_reviews[0]))
print("Actual result:", train_tags[0])
print(" ")
print("Testing baselinePolarity with the review:", train_reviews[1])
print("baselinePriority result:", baselinePolarity(train_reviews[1]))
print("Actual result:", train_tags[1])

**5. Evaluation of the Baseline Approach**  
We saw in class that there could be multiple ways of evaluating an algorithm.  In the case of classification, a common evaluation method is simply to calculate *number of wrong choices*.

To test our *baseline algorithm* we use the test set, defined earlier and calculate the number of wrong assignments.

In [None]:
# Function takes a one dimensional array of reviews and a one dimensional array of
# tags as input and prints the number of incorrect assignments when running the baseline approach
# on the reviews.
# Let's establish the polarity for each review
def incorrectReviews(reviews, tags):
    nbWrong = 0
    count = 0
    for i in range(len(reviews)):
        polarity = baselinePolarity(reviews[i])
        if (count < 10):
            print(reviews[i] + " -- Prediction: " + polarity + ". Actually: " + tags[i] + " \n")
            count += 1
        if (polarity != tags[i]):
            nbWrong += 1

    print('There are %s wrong predictions out of %s total predictions' %(nbWrong, len(tags)))    

In [None]:
# This may take a minute to run
incorrectReviews(test_reviews, test_tags)

**(TO DO) Q1 - 1 marks**  
Look at the ten outputs above which provide predictions from the Baseline approach for specified reviews along with their actual review class.
From the output, give the prior probabilities (no code needed) for each class based on the output given by the Baseline approach and based on the actual review class.

***Answer here***  
For the Baseline predictions:
P(fresh) = 4/10
P(rotten) = 6/10

For the actual outputs:
P(fresh) = 4/10
P(rotten) = 6/10

#### 6. Supervised learning method

We will now train a supervised learning model for polarity detection.

***6.1 Training data***  

In supervised learning, we need training data.  This training data must be *different* but *representative* of the eventual test data. At the beginning of the notebook we defined the training data and the test data to be a subset of the entire dataset (20000 total rows from the 480000). We did this due to the large computation time of the Baseline Approach and the SVM approach that we will use later in this notebook. In reality we would want to use the entire dataset and ensure that we have trained our models with a large enough training set. This would ensure that when predicting unseen data that we have learned most of the examples that we expect to ever predict.

Usually a training set should be as large and varied as possible.  Training sets are very valuable, but they are costly to obtain, as they require tagging (human annotation) to generate them. Once again, the training set is used to train the model and the testing set is used to test how well the trained model performs on unseen examples.

In [None]:
# Looking at the shapes of the train and test datasets that we will be using
print(train_reviews.shape) # 90% train
print(test_reviews.shape) #10% test

***6.2 Pre-processing of input data*** 

This Machine Learning package, *scikit-learn*, is somewhat particular in the way the data must be formatted to be used by the training algorithms.  So, we must perform some preprocessing on the sentences above.  Luckily *scikit-learn* provides some pre-defined functions for doing text pre-processing.  

We easily transform each sentence into a list of indexes into a dictionary.  The dictionary is built from the words in the sentences.  The keys of the dictionary are the words, and the value is an index.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# The CountVectorizer builds a dictionary of all words (count_vect.vocabulary_), 
# and generates a matrix (train_counts), to represent each sentence
# as a set of indices into the dictionary. The words in the dictionary are the words found in train_reviews.

count_vect = CountVectorizer()
train_counts = count_vect.fit_transform(train_reviews)

To understand what the code above does, first let's print the vocabulary gathered from the sentences in train_reviews.  

In [None]:
# print the vocabulary (dictionary of words)
print(count_vect.vocabulary_)

For example, you can interpret the output above as: 

'again':690  to mean that the word 'again' has been assigned index 690  
'poses':18445 to mean that the word 'poses' has been assigned index 18445

Then, let's print the *train_counts*.  

In [None]:
# print the content of the training examples in terms of frequency of words (each word represented by its index)
print(train_counts)

You can interpret each line above as:  

(0, 10829) 1  -- sentence 0 (in train_reviews) has 1 instance(s) of word 10829 (index of the word in count_vect.vocabulary, that is the word 'gunk')  
(17999, 24372) 2  -- sentence 17999 (in train_reviews) has 2 instance(s) of word 24372 (index of the word in count_vect.vocabulary, that is the word 'the')  

So the train_counts contain for each sentence, the BOW associated with that sentence, but in the form of a list of indexes (each index corresponding to a word).

***6.3 Naive Bayes learning***

With the data preprocessed, we are ready to test the Naive Bayes algorithm provided by scikit-learn.  That algorithm required the training data to be represented in terms of *train counts* which is why we did the pre-processing above.

It's as easy as performing *fit*, as you see below, to train the model.  But you know what's underneath!!!  It creates prior probabilities for classes (fresh, rotten) and posterior probabilities of words (features) per class (e.g. P(awful|fresh) or P(awful|rotten).  All these probabilities are used in Bayes Theorem.  

**(TO DO) Q2 - 2 marks**  
Before training the model, what are the prior probabilities of the fresh and rotten classes using the training set above?

In [None]:
# Find the prior probabilities for the fresh and rotten classes in the train set (train_tags) and the test set (test_tags)
# that we will be using.
#incorrectReviews(test_reviews, test_tags)
# You must calculate it from the train and test sets, then print the calculated result
test = np.count_nonzero(test_tags == "fresh")
test_prior = test/len(test_tags)
test_rotten = np.count_nonzero(test_tags == "rotten")
test_rotten_prior = test_rotten/len(test_tags)
train = np.count_nonzero(train_tags=="fresh")
train_prior = train/len(train_tags)
train_rotten = np.count_nonzero(train_tags == "rotten")
train_rotten_prior = train_rotten/len(train_tags)
# Print the prior probabilities as: <TRAIN_OR_TEST>: P(class) = value
print("Train: P(rotten) = 9000/18000 or " + str(train_rotten_prior))
print("Train: P(fresh) = 9000/18000 or "+str(train_prior))
print("Test: P(rotten) = 1000/2000 or "+ str(test_rotten_prior))
print("Test: P(fresh) = 1000/2000 or " + str(test_prior))

In [None]:
# Test of a naive bayes algorithm, the "fit" is the training
from sklearn.naive_bayes import MultinomialNB

# Training the model
clf = MultinomialNB().fit(train_counts, train_tags)   

***6.4 Evaluation of Naive Bayes***

Let's first look at how the model performs on the training set, on which it learned.  To apply the model for classification (prediction), we use the *predict* method below.

In [None]:
# Testing on training set
predicted = clf.predict(train_counts)
# Print the first ten predictions
for doc, category in zip(train_reviews[:10], predicted[:10]):   # zip allows to go through two lists simultaneously
    print('%r => %s\n' % (doc, category))
correct = 0
for tag, pred in zip(train_tags, predicted):   # zip allows to go through two lists simultaneously
    if (tag == pred):
        correct += 1
print("Correctly classified %s total training examples out of %s examples" %(correct, train_tags.size))

Unsurprisingly, on the training set we get most of the examples correct....  But we should test on a real **test set**, namely test_reviews and test_tags.

**(TO DO) Q3 - 2 marks**  
Test the trained model on the test set.  Write the code below to do so.  Before testing, each test set must be transformed through the preprocessing steps, so their format is compatible with the learner.

In [None]:
# Pre-process test set test_reviews
# Note, we use transform and NOT fit_transform since this we do not want to re-fit the vecotrizer
# that we used to train the model
test_reviews_counts = count_vect.transform(test_reviews)
# Predict the results
predict_test = clf.predict(test_reviews_counts)
# Print the first ten predictions
for doc, category in zip(test_reviews[:10], predict_test[:10]):   
    print('%r => %s\n' % (doc, category))
# Print the total correctly classified instances out of the total instances
for tag, pred in zip(test_tags, predict_test):   
    if (tag == pred):
        correct += 1
print("Correctly classified %s total training examples out of %s examples" %(correct, test_tags.size))

***6.5 Support Vector Machine (SVM) learning***

Now that we have tested the Naive Bayes Classifier, we are ready to test the SVM algorithm provided by scikit-learn.  This notebook will not be explaining all of the parameters and behind the scenes of an SVM classifier. However, below is a link to the official documentation of scikit learn's implementations along with a good article to explain SVMs.

https://scikit-learn.org/stable/modules/svm.html
https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC
https://medium.com/machine-learning-101/chapter-2-svm-support-vector-machine-theory-f0812effc72

In [None]:
# Test of a SVM, the "fit" is the training
from sklearn import svm

# Training the model with an SVM using a linear kernel (less computationally intensive)
# For the purpose of this Notebook we will stick to this simple model and stop after 3500
# iterations to save time (would take much longer otherwise, feel free to duplicate the notebook
# and test different parameters for yourself to see how much better it does!)
# This will take several minutes to run on less powerful machines, so be patient!
clf_svm = svm.SVC(kernel="linear", random_state=0, max_iter=3500).fit(train_counts, train_tags)   

***6.6 Evaluation of SVM***

Let's first look at how the model performs on the training set, on which it learned.  To apply the model for classification (prediction), we use the *predict* method below.

In [None]:
# May take a few minutes to run on weaker machines
# Testing on training set
predicted_svm = clf_svm.predict(train_counts)
# Print the first ten predictions
for doc, category in zip(train_reviews[:10], predicted_svm[:10]):   # zip allows to go through two lists simultaneously
    print('%r => %s\n' % (doc, category))
correct = 0
for tag, pred in zip(train_tags, predicted_svm):   # zip allows to go through two lists simultaneously
    if (tag == pred):
        correct += 1
print("Correctly classified %s total training examples out of %s examples" %(correct, train_tags.size))

We can see that this model does not perform amazing on the training data. This is due to the parameter choice.  But we should test on a real **test set**, namely test_reviews and test_tags.

**(TO DO) Q4 - 2 mark**  
Test the trained SVM model on the test set.  Write the code below to do so.  Before testing, each test set must be transformed through the preprocessing steps, so their format is compatible with the learner (can just repeat what you did above for this).

In [None]:
# Pre-process test set test_reviews
# Note, we use transform and NOT fit_transform since this we do not want to re-fit the vecotrizer
# that we used to train the model
test_reviews_counts = count_vect.transform(test_reviews)
# Predict the results with the SVM
test_predicted_svm = clf_svm.predict(test_reviews_counts)
# Print the first ten predictions
for doc, category in zip(test_reviews[:10], test_predicted_svm[:10]):   
    print('%r => %s\n' % (doc, category))
# Print the total correctly classified instances out of the total instances
correct = 0
for tag, pred in zip(test_tags, test_predicted_svm):   
    if (tag == pred):
        correct += 1
print("Correctly classified %s total testing examples out of %s examples" %(correct, test_tags.size))

***6.7 More Evaluation!***


**(TO DO) Q5 - 2 marks**   
A common **Evaluation Measure** in Machine Learning is **Recall**. Recall is the number of correct predictions for a class of interest (called the True Positives) divided by the total number of instances that are actually labelled as that class of interest (True Positives + False Negatives).   For example, if the test set contains 5 fresh examples and the algorithm only found 2, then the recall for the class fresh is 2/5.  Write a small method below that will calculate a class' recall.  It will receive three parameters: 
1. The set of correct tags (e.g. (fresh, rotten, fresh)), 
2. The predictions (e.g (fresh, fresh, rotten)), and
3. The class of interest (e.g. fresh).  It will return the recall of that class (e.g. 50%).

In [None]:
# Number wrong
# CANNOT USE ANY FUNCTIONS FROM LIBRARIES TO DIRECTLY GET THE RECALL
def tp_fp_fn_fp(actualTags, predictions, classOfInterest):
    truePositive =  trueNegative = falsePositive = falseNegative = 0
    confusionMatrix = {}
    for i in range(len(actualTags)):
        if predictions[i] == actualTags[i]: #if predicted correctly with class of interest then true positive
            if predictions[i] == classOfInterest:
                truePositive += 1
            else: #predicted correctly without class of interest then trueNegative
                trueNegative += 1
        elif (predictions[i] != classOfInterest): #predicted wrong without class of interest then falseNegative
            falseNegative += 1
        else:  #predicted wrong with class of interest then falsePositive
            falsePositive += 1
    confusionMatrix["tp"] = truePositive
    confusionMatrix["tn"] = trueNegative
    confusionMatrix["fp"] = falsePositive
    confusionMatrix["fn"] = falseNegative
    return confusionMatrix
def recall(actualTags, predictions, classOfInterest):
    # recall = tp / (tp + fn) = 1/(1+1) = 1/2 = 50% for the example above
    confusion_matrix = tp_fp_fn_fp(actualTags, predictions, classOfInterest)
    recall = confusion_matrix["tp"] / (confusion_matrix["tp"] + confusion_matrix["fn"]) 
    return ("{0:.1%}".format(recall))
    
    
            
    

**(TO DO) Q6 - 2 marks**   
Use the recall method to calculate the recall on the test set (both classes) for the Naive Bayes and SVM learners.  Print those recalls.   
Hint: You can test if recall() works correctly by testing with the provided exampel above

In [None]:
# Recall
print("Test set for Naive Bayes(fresh): \nActual Tags: {},\nPredictions: {},\nRecall: {}".format(test_tags, predict_test, recall(test_tags, predict_test, "fresh")))
print("Test set for Naive Bayes(rotten): \nActual Tags: {},\nPredictions: {},\nRecall: {}".format(test_tags, predict_test, recall(test_tags, predict_test, "rotten")))
print("Test set for SVM Learners(fresh): \nActual Tags: {}, \nPredictions: {}, \nRecall: {}".format(test_tags, test_predicted_svm, recall(test_tags, test_predicted_svm, "fresh")))
print("Test set for SVM Learners(rotten): \nActual Tags: {}, \nPredictions: {}, \nRecall: {}".format(test_tags, test_predicted_svm, recall(test_tags, test_predicted_svm, "rotten")))

**(TO DO) Q7 - 2 marks**   
Another common **Evaluation Measure** in Machine Learning is called **Precision**. Precision is the number of correct predictions for a class of interest (True Positives) divided by the total number of times that class of interest was predicted (True Positives + False Positives). For example is the test set (ground truth) contains 3 fresh examples and 1 rotten example and the algorithm correctly labelled two of these as fresh, incorrectly labelled one of these as rotten, and incorrectly labelled one of these as fresh, then the Precision for the class fresh is 2/3.  Write a small method below that will calculate a class' precision.  It will receive three parameters: 
1. The set of correct tags (e.g. (fresh, rotten, fresh)), 
2. The predictions (e.g (fresh, fresh, rotten)), and
3. The class of interest (e.g. fresh).  It will return the precision of that class (e.g. 50%).

In [None]:
# CANNOT USE ANY FUNCTIONS FROM LIBRARIES TO DIRECTLY GET THE PRECISION
def precision(actualTags, predictions, classOfInterest):
    # precision = tp / (tp + fp) = 1/ (1+1) = 1/2 = 50% for the example above
    confusion_matrix = tp_fp_fn_fp(actualTags, predictions, classOfInterest)
    precision = confusion_matrix["tp"] / (confusion_matrix["tp"] + confusion_matrix["fp"]) 
    return ("{0:.1%}".format(precision))


**(TO DO) Q8 - 2 marks**   
Use the precision method to calculate the precision on the test set (both classes) for the Naive Bayes and SVM learners.  Print those precision values.

In [None]:
# Precision
print("Test set for Naive Bayes(fresh): \nActual Tags: {},\nPredictions: {},\nPrecision: {}".format(test_tags, predict_test, precision(test_tags, predict_test, "fresh")))
print("Test set for Naive Bayes(rotten): \nActual Tags: {},\nPredictions: {},\nPrecision: {}".format(test_tags, predict_test, precision(test_tags, predict_test, "rotten")))
print("Test set for SVM Learners(fresh): \nActual Tags: {}, \nPredictions: {}, \nPrecision: {}".format(test_tags, test_predicted_svm, precision(test_tags, test_predicted_svm, "fresh")))
print("Test set for SVM Learners(rotten): \nActual Tags: {}, \nPredictions: {}, \nPrecision: {}".format(test_tags, test_predicted_svm, precision(test_tags, test_predicted_svm, "rotten")))

#### 7. Discussion

**(TO DO) Q9 - 5 marks**  
<b>Are the Naive Bayes and SVM approaches performing better than the baseline approach, if so by how much? </b>
Just looking at it generally, they are performing better than the baseline approach. The baseline approach classified 1240 correctly out of 2000. The Naive Bayes approach classified 1525 correctly out of 2000. The SVM approach classified 1285 correctly out of 2000. 

<b>How does the precision and recall values for the "rotten" and "fresh" classes from the Naive Bayes approach compare to those from the SVM approach?  </b>

As seen from the results from the questions above, the precision and recall for the Naive Bayes approach do better than the SVM approach. 

<b>If we used the training data on the Baseline approach, how would you theorize those results would compare to those from the test data (better, worse, maybe both)? Explain why the comparison of the train and test data predictions from the Baseline model may or may not (depending on your previous answer) resemble the comparison of the train and test predictions from the Naive Bayes and SVM learners.  </b>

It would perform more or less the same due to the way that the Baseline approach is set up. 

<b>Present and discuss the overall results below (including the precision and recall comparisons). </b>

Training data on baseline: 6989 wrong predictions out of 18000 or 11011 correctly classified. Naive Bayes correctly classified 16250 total training examples out of 18000 examples. The SVM model correctly classified 13012 total training examples out of 18000 examples.


<b>Give two suggestions (each) to help the Naive Bayes approach and the SVM approach within the context of our experiment of polarity detection for movie reviews.</b>

Naive Bayes: Since negative and positive words can both be used in a review and only checking the frequency of the word means context information is lost. In <b>binarized (boolean) multinomial naive bayes</b>, the occurence of a word is more important than frequency. For example, if a word like "fantastic" occurs then that's more important (more telling) than the frequency of that word. 

Additionally, a bigger data set works better for Naive Bayes as the probabilities for each class (in this case only two classes) will be more reliable. More features with a larger training set will lead to better results for Naive Bayes. 

SVM: Choosing a better kernel parameter (as this affects the outcome) may yield better results. 
If training set is unbalanced towards a certain class, then reweighing these instances may yield better results. 


**Optional - No marks** 
For your own interest, create a local copy of the notebook and redo the questions using the entire dataset with the same train, test split. Also try some different kernels for the SVM. How do these tests compare with the ones that you have done on the partial dataset in the

#### Signature

I, Rupsi Kaushik, declare that the answers provided in this notebook are my own.