### DS102 | In Class Practice Week 4B - Text Mining II - Gaining Insights from Text
<hr>
## Learning Objectives
At the end of the lesson, you will be able to:

- use **Jaccard Similarity** to find similar texts

- perform **sentiment analysis** using the `SentimentIntensityAnalyzer`

- train a **Naïve Bayes classifier** to classify a new piece of text into 2 classes

### Datasets Required for this Self-Study
1. `billboard-lyrics.csv`

2. `popcorn-reviews-5k.csv`

**import libraries**

In [1]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
import re

#If you are running this for the first time, use the next cell to download all the 
# corpora first
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

In [2]:
# Download the VADER list of words / lexicon
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/victorwongyd/nltk_data...


True

### Jaccard Similarity

Jaccard Similarity is used to show how similar two documents are. Given two documents, $A$ and $B$, the Jaccard Similarity Score is calculated as:

$$
\text{Jaccard Similarity Score} = \frac{A\cap B}{A \cup B}
$$

Simply put, the numerator is the number of words that **are common across both documents** and the denominator is the **total number of words in both documents**. Keep in mind that the words here refer to **unique words**.

The function below, `calculate_jaccard_score` will return the similarity score of two documents, `d1` and `d2`. It uses list comprehensions and the documenation for that can be found [here](https://docs.python.org/3/tutorial/datastructures.html). Also, find out more about how multiple variables can be declared in the same line [here](https://docs.python.org/3.6/tutorial/datastructures.html#tuples-and-sequences).

In [3]:
def calculate_jaccard_score(d1, d2):
    set_a, set_b = set(d1), set(d2)
    return len(set_a & set_b) / len(set_a | set_b)

In [4]:
s1 = 'I would like to consolidate a few of my higher interest rate credit cards.'.split()
s2 = 'this loan is to consolidate credit card debt and pay of debt'.split()
# Your turn: what is the Jaccard score of the 2 lists? Call the function for this
# to validate your answer.
print(s1)
calculate_jaccard_score(s1, s2)

['I', 'would', 'like', 'to', 'consolidate', 'a', 'few', 'of', 'my', 'higher', 'interest', 'rate', 'credit', 'cards.']


0.19047619047619047

### Sentiment Analysis with `nltk`

The `nltk` library has a sentiment analyser. It uses the VADER method or **Valence Aware Dictionary for
sEntiment Reasoning**. It is a lexicon (vocabulary) of words and their relative sentiment strength. For example:
    
- `Good` has a positive but weak score, while `Excellent` scores more
- `Bad` has a negative but weaks score, while `Tragedy` scores more

Use `sid.polarity_scores(t)` to find the sentiment of a text. Then, use the `compound` value to determine the overall score. Note that `compound` give a (normalised) value from $-1$ to $1$, and hence a positive number is good sentiment while a negative number is bad sentiment.

In [None]:
sid = SentimentIntensityAnalyzer()

Observe how the sentiment scores change based on the sentiment of a movie review.

In [None]:
# Dataset 3, Credits at the end of the notebook
# This is an example of a positive review, showing positive sentiment.
review_1 = """I thoroughly enjoyed this movie because there was a genuine sincerity in the acting."""
ss = sid.polarity_scores(review_1)
print(ss)
print(ss['compound'])

In [None]:
# Dataset 3, Credits at the end of the notebook
review_2 = "I found it really boring and silly."
# Exercise: What is the compound score of the above review?
#

# This is an example of a negative review.

In [None]:
#Dataset 3, Credits at the end of the notebook
review_3 = "My personal favorite horror film."
# Your turn: What is the compound score of the above review?
#

# Your turn: Is this a positve or negative movie review? 
# What does the VADER polarity score say about this review?

### Naïve Bayes Classification

We now extend sentiment analysis to texts that we have never seen before, and use a machine learning algorithm - Naïve Bayes Classification - to help us predict if a newly received review has positive or negative sentiment. Naïve Bayes Classification is the first machine learning algorithm we will learn in DS102.

#### PROBLEM SETUP

Consider you have the following documents and their tagged sentiment **class**. $1$ represents a positive sentiment while $0$ represents a negative sentiment. There are only 2 possible classes in this problem.

|ID | Text | Sentiment
|---|---|--
|1|`enjoy like`|$1$
|2|`enjoy funny happy`|$1$
|3|`hate boring like`|$0$
|4|`like happy`|$1$
|5|`boring dull`|$0$

We now have a new document which is `like happy`. What is the predicted class of this new document?

#### TRAINING
 
The model mostly revolves around counting words and multiplying these proportions / probabilities. The following calculations are performed:

1. Find the number of unique words and store them in a variable $V$, called the vocabulary. $|V|$ is the length of the vocabulary.
2. Calculate the probability of each class, $1$ and $0$.
3. Calculate the **conditional probability** of each word given a class. For this calculation, add $1$ to the numerator and add $|V|$ to the denominator.

In this case, 
- $V$ = `{'boring', 'dull', 'enjoy', 'funny', 'happy', 'hate', 'like'}` and hence $|V|= 7$

- $P(1) = \frac{3}{5}$ and  $P(0) = \frac{2}{5}$

In [7]:
# like, happy, joy
# Positive class
print(3/5 *3/14 *3/14 *1/14*100)
print(2/5 *2/12 *1/12 *1/12*100)

0.19679300291545188
0.0462962962962963


- The conditional probability $P($ `enjoy` $| 1)$ is the number of times `enjoy` appears in class $1$ divided by the total number of words in class $1$. The number of times `enjoy` appears is $2$. The total number of words is $7$. Remember to "smooth" the fraction. Hence, $P($ `enjoy` $| 1) = \frac{2+1}{7+7} = \frac{3}{14}$. Repeat this for the rest of the words in both classes:

|Word | Class $1$ calculation or $P($ `(word)` $| 1)$| Class $0$ calculation or $P($ `(word` $| 0)$
|---|---|--
|`boring`|$\frac{1}{14}$|$\frac{3}{12}$
|`dull`|$\frac{1}{14}$ |$\frac{2}{12}$
|`enjoy`|$\frac{3}{14}$ |$\frac{1}{12}$
|`funny`|$\frac{2}{14}$|$\frac{1}{12}$
|`happy`|$\frac{3}{14}$|$\frac{1}{12}$
|`hate`|$\frac{1}{14}$|$\frac{2}{12}$
|`like`|$\frac{3}{14}$|$\frac{2}{12}$

#### TEST

For a the document `like happy`, calculate the probability scores of each class. Do so by multiplying the probability of the class and the conditional probability of each term in each class. For $1$, the calculation is:

$$
\text{Class 1 Prediction} \propto P(1) \times P(\text{ like } | 1) \times P(\text{ happy } | 1) = \frac{3}{5} \times \frac{3}{14} \times \frac{3}{14} = 0.02755
$$

In [None]:
(3/5) * (3/14) * (3/14)

and for $0$, the calculation is:
$$
\text{Class 0 Prediction} \propto P(0) \times P(\text{ like } | 0) \times P(\text{ happy } | 0) = \frac{2}{5} \times \frac{1}{12} \times \frac{2}{12} = 0.0055
$$

In [None]:
(2/5) * (1/12) * (2/12)

Since the score for class $1$ is higher, using the model, the document is classified as class $1$ or positive sentiment. 

#### YOUR TURN
There is a new document `hate dull`. What is the predicted class of this document?

In [8]:
# Exercise: What is the calculation for Class 1?
3/5 * 1/14 * 1/14 

0.003061224489795918

In [11]:
# Exercise: What is the calculation for Class 0?
2/5 * 2/12 * 2/12

0.011111111111111112

### Applied Naïve Bayes Classification using `sklearn`

In [17]:
# Dataset 3, Credits at the end of the notebook
# Read the reviews data from the CSV
popcorn_df = pd.read_csv('popcorn-reviews-5k.csv', sep="#") 

# Split the dataset into the training and test set. 
# The first 4500 records will be the training set while the last 500 records will be 
# the test set.
d_train = popcorn_df[:4500]
display(d_train.head(3))
d_test = popcorn_df[4500:]
display(d_test.head(3))
#

Unnamed: 0,id,review,sentiment
0,5196_9,Human Tornado (1976) is in many ways a better ...,1
1,2668_9,"Chilling, majestic piece of cinematic fright, ...",1
2,9565_3,I cant say that Wargames The Dead Code is the ...,0


Unnamed: 0,id,review,sentiment
4500,2910_7,This is a great film for McCartneys and Beatle...,1
4501,11707_10,"I remember seeing this movie a long time ago, ...",1
4502,5461_7,Escaping the life of being pimped by her fathe...,1


In [None]:
# Your turn: print the first 5 records of the training set. 
# How many columns are there in the training set?
#

In [None]:
# Your turn: print the first 5 records of the test set. 
# How many columns are there in the training set?
#

#### TRAIN

Train the model given the reviews and the given sentiment. Recall that for `sentiment`, 1 represents a positive review and 0 represents a negative review.

In [None]:
# Using fit_transform, transform the corpus to a matrix.
count_vect = CountVectorizer()
train_df_counts = count_vect.fit_transform(train_df['review'])

In [None]:
# Train a multinomial classifier using the training set using the features 
# and the training set labels
clf = MultinomialNB().fit(train_df_counts, train_df['sentiment'])

#### TEST

Now that we have trained our classifier, let's test it using the test set. We will check the actual prediction of a test example, and observe what the predicted model gives us.

In [18]:
test_df = d_test.copy()

In [19]:
# Now, randomly sample an example from the test set.
t_sample = test_df.sample()

# Let's see the review in the test set and the actual sentiment.
s = t_sample.iloc[0]['review']
print(s)
t = t_sample.iloc[0]['sentiment']
print(t)

# Let's now see what class the model predicts for this test example.
print()
print(clf.predict(count_vect.transform([s])))

Jeremy Brett is simply the best Holmes ever, narrowly edging out the great Basil Rathbone of course, and this is probably the best adaptation of a Conon-Doyle short story. Excellent performances all round, especially from Robert Hardy, and both Brett and Hardwick fully rounded and comfortable in their roles makes this a superb piece of drama.
1



NameError: name 'clf' is not defined

**Credits**
- [sebleier](https://gist.github.com/sebleier/554280) for Dataset 1
- [Kaggle (Billboard 1964-2015 Songs + Lyrics)](https://www.kaggle.com/rakannimer/billboard-lyrics) for Dataset 2
- [Kaggle (Bag of Words Meets Bags of Popcorn)](https://www.kaggle.com/c/word2vec-nlp-tutorial/data) for Dataset 3

**Footnote**

(1) : The reviews are partially processed. Only removal of special characters was performed. The remaining steps to be performed are text normalisation.

**Further Reading**
Naïve Bayes can also be used for the following classification problems:

1. Spam vs. Non-Spam in E-Mail Filtering
2. Product classification using product titles