### DS102 | In Class Practice Week 4B - Text Mining II - Gaining Insights from Text
<hr>
## Learning Objectives
At the end of the lesson, you will be able to:

- use **Jaccard Similarity** to find similar texts

- perform **sentiment analysis** using the `SentimentIntensityAnalyzer`

- train a **Naïve Bayes classifier** to classify a new piece of text into 2 classes

### Datasets Required for this Self-Study
1. `billboard-lyrics.csv`

2. `popcorn-reviews-5k.csv`

**import libraries**

In [1]:
import pandas as pd
import nltk
import re

#If you are running this for the first time, use the next cell to download all the corpora first
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

In [2]:
# Download the VADER list of words / lexicon
# nltk.download('vader_lexicon')

**define `ENGLISH_STOP_WORDS` as a list**

In [3]:
#Dataset 1, Credits at the end of the notebook
ENGLISH_STOP_WORDS = ['i', 'me', 'my', 'myself', 'we', 'our', 
                      'ours', 'ourselves', 'you', 'your', 'yours', 
                      'yourself', 'yourselves', 'he', 'him', 'his', 
                      'himself', 'she', 'her', 'hers', 'herself', 'it', 
                      'its', 'itself', 'they', 'them', 'their', 'theirs', 
                      'themselves', 'what', 'which', 'who', 'whom', 'this', 
                      'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 
                      'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 
                      'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 
                      'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 
                      'with', 'about', 'against', 'between', 'into', 'through', 'during', 
                      'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 
                      'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 
                      'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 
                      'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 
                      'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']

### Jaccard Similarity

Jaccard Similarity is used to show how similar two documents are. Given two documents, $A$ and $B$, the Jaccard Similarity Score is calculated as:

$$
\text{Jaccard Similarity Score} = \frac{A\cap B}{A \cup B}
$$

Simply put, the numerator is the number of words that **are common across both documents** and the denominator is the **total number of words in both documents**. Keep in mind that the words here refer to **unique words**.

The function below, `calculate_jaccard_score` will return the similarity score of two documents, `d1` and `d2`. It uses list comprehensions and the documenation for that can be found [here](https://docs.python.org/3/tutorial/datastructures.html).

In [4]:
def calculate_jaccard_score(d1, d2):
    intersect = set([l for l in d1 if l in d2])
    union = set(d1 + d2)
    return len(intersect)/len(union)

In [5]:
s1 = ['one', 'dem', 'days', 'dont', 'take', 'personal',]
s2 = ['dont', 'take', 'personal', 'wanna', 'alone',]
# Your turn: what is the Jaccard score of the 2 lists? Call the function for this
print(calculate_jaccard_score(s1, s2))

0.375


One simple application of Jaccard Similarity is to find similar lyrics in songs. The following is a collection of song lyrics of some songs from 1965 to 2015.

In [6]:
#Dataset 2, Credits at the end of the notebook
songs_df = pd.read_csv('billboard-lyrics.csv', index_col=0)
# Set the ID of the df to be the index
songs_df['ID'] = songs_df.index

In [7]:
# Your turn: Inspect the last 10 rows of the dataset to see songs in 2015 using tail()
songs_df.tail(10)

Unnamed: 0,Rank,Song,Artist,Year,Lyrics,ID
966,11,cheerleader,omi,2015,when i need motivation my one solution is my ...,966
967,12,cant feel my face,the weeknd,2015,and i know shell be the death of me at least ...,967
968,13,love me like you do,ellie goulding,2015,youre the light youre the night youre the col...,968
969,14,take me to church,hozier,2015,my lovers got humour shes the giggle at a fun...,969
970,15,bad blood,taylor swift featuring kendrick lamar,2015,cause baby now we got bad blood you know it u...,970
971,16,lean on,major lazer and dj snake featuring mo,2015,do you recall not long ago we would walk on t...,971
972,17,want to want me,jason derulo,2015,its too hard to sleep i got the sheets on the...,972
973,18,shake it off,taylor swift,2015,i stay up too late got nothing in my brain th...,973
974,19,"where are ""u now",skrillex and diplo featuring justin bieber,2015,i need you i need you i need you i need you i...,974
975,20,fight song,rachel platten,2015,like a small boat on the ocean sending big wa...,975


Use `df.shape` to get the number of records

In [8]:
# Your turn: Use .shape to find the number of rows and columns
print(songs_df.shape)
# Your turn: Assign the number of rows to the variable num_songs
num_songs = songs_df.shape[0]
print(num_songs)

(976, 6)
976


We can extract out the values in the `df` by using `iloc` first to obtain the row by its position in the `df`. Use the second index to extract the value by column.

In [9]:
print(songs_df.iloc[4]['ID'])
print()
# Your turn: Extract the column 'Lyrics' from the row
print(songs_df.iloc[4]['Lyrics'])

4

 when youre alone and life is making you lonely you can always go downtown when youve got worries all the noise and the hurry seems to help i know downtownjust listen to the music of the traffic in the city linger on the sidewalk where the neon signs are pretty how can you lose the lights are much brighter there you can forget all your troubles forget all your caresso go downtown things will be great when youre downtown no finer place for sure downtown every things waiting for youdont hang around and let your problems surround you there are movie shows downtown maybe you know some little places to go to where they never close downtownjust listen to the rhythm of a gentle bossa nova youll be dancing with em too before the night is over happy again the lights are much brighter there you can forget all your troubles forget all your caresso go downtown where all the lights are bright downtown waiting for you tonight downtown youre gonna be alright nowdowntownand you may find somebody ki

Before performing analysis, we can store the values in a dictionary. Before starting, remove all stop words to in the lyrics.

In [10]:
lyrics = {}

for i in range(0, num_songs):
    # Find the song_id from the df row
    song_id = songs_df.iloc[i]['ID']
    # Your turn: find the Lyrics from the df row
    song_lyrics = songs_df.iloc[i]['Lyrics']    
    # Use split() to split the words into a list of tokens    
    list_of_lyrics = song_lyrics.split()
    # Filter for words that exists in the list of stopwords    
    list_of_lyrics_without_sw = [w for w in list_of_lyrics if w not in ENGLISH_STOP_WORDS]
    # Your turn: Use the song_id as the key and the lyrics as the value in the dictionary    
    lyrics[song_id] = list_of_lyrics_without_sw


Use this to validate the dictionary `lyrics`.

In [11]:
# Validate the dictionary
print(lyrics[4])

['youre', 'alone', 'life', 'making', 'lonely', 'always', 'go', 'downtown', 'youve', 'got', 'worries', 'noise', 'hurry', 'seems', 'help', 'know', 'downtownjust', 'listen', 'music', 'traffic', 'city', 'linger', 'sidewalk', 'neon', 'signs', 'pretty', 'lose', 'lights', 'much', 'brighter', 'forget', 'troubles', 'forget', 'caresso', 'go', 'downtown', 'things', 'great', 'youre', 'downtown', 'finer', 'place', 'sure', 'downtown', 'every', 'things', 'waiting', 'youdont', 'hang', 'around', 'let', 'problems', 'surround', 'movie', 'shows', 'downtown', 'maybe', 'know', 'little', 'places', 'go', 'never', 'close', 'downtownjust', 'listen', 'rhythm', 'gentle', 'bossa', 'nova', 'youll', 'dancing', 'em', 'night', 'happy', 'lights', 'much', 'brighter', 'forget', 'troubles', 'forget', 'caresso', 'go', 'downtown', 'lights', 'bright', 'downtown', 'waiting', 'tonight', 'downtown', 'youre', 'gonna', 'alright', 'nowdowntownand', 'may', 'find', 'somebody', 'kind', 'help', 'understand', 'someone', 'like', 'needs'

**Exercise** - Given a `song_id` and the `lyrics` dataset, return the `song_id` and `score` of the song with the highest similarity score. Note that you do not compare the song to test with itself.

In [12]:
# Your turn: Complete this method to return the highest song ID and the highest Jaccard score
def get_most_similar_song(lyrics, test_song_id):
    highest_song_id = 0
    highest_score = 0.0
    print(highest_score)
#     for song_id, song_lyrics_list in lyrics.items():
#         if song_id != test_song_id:
#             score = calculate_jaccard_score(lyrics[test_song_id], lyrics[song_id])
#             if score > highest_score:
#                 highest_song_id, highest_score = song_id, score
    
    return test_song_id, highest_song_id, highest_score

Use the results to find interesting song pairs.

In [13]:
i, j, h_score = get_most_similar_song(lyrics, 960)
# i, j, score = get_most_similar_song(lyrics, songs_df['ID'].sample().iloc[0])
#Interesting results: 960, 962, 877, 613, 332, 966

#Remove the index to show the full list of lyrics
print(str(i) + " (Source) - " + str(lyrics[i])) 
print()
#Remove the index to show the full list of lyrics
print(str(j) + " (Target) - " + str(lyrics[j])) 
print()
print(h_score)
songs_df[songs_df['ID'].isin([i,j])]

0.0
960 (Source) - ['im', 'hurting', 'baby', 'im', 'broken', 'need', 'loving', 'loving', 'need', 'im', 'without', 'im', 'something', 'weak', 'got', 'begging', 'begging', 'im', 'kneesi', 'dont', 'wanna', 'needing', 'love', 'wanna', 'deep', 'love', 'killing', 'youre', 'away', 'ooh', 'baby', 'cause', 'really', 'dont', 'care', 'wanna', 'gotta', 'get', 'one', 'little', 'tasteyour', 'sugar', 'yes', 'please', 'wont', 'come', 'put', 'im', 'right', 'cause', 'need', 'little', 'love', 'little', 'sympathy', 'yeah', 'show', 'good', 'loving', 'make', 'alright', 'need', 'little', 'sweetness', 'life', 'sugar', 'yes', 'please', 'wont', 'come', 'put', 'memy', 'broken', 'pieces', 'pick', 'dont', 'leave', 'hanging', 'hanging', 'come', 'give', 'im', 'without', 'ya', 'im', 'insecure', 'one', 'thing', 'one', 'thing', 'im', 'living', 'fori', 'dont', 'wanna', 'needing', 'love', 'wanna', 'deep', 'love', 'killing', 'youre', 'away', 'ooh', 'baby', 'cause', 'really', 'dont', 'care', 'wanna', 'gotta', 'get', 'one',

Unnamed: 0,Rank,Song,Artist,Year,Lyrics,ID
0,1,wooly bully,sam the sham and the pharaohs,1965,sam the sham miscellaneous wooly bully wooly b...,0
960,5,sugar,maroon 5,2015,im hurting baby im broken down i need your lo...,960


### Sentiment Analysis with `nltk`

The `nltk` library has a sentiment analyser. It uses the VADER method or **Valence Aware Dictionary for
sEntiment Reasoning**. It is a lexicon (vocabulary) of words and their relative sentiment strength. For example:
    
- `Good` has a positive but weak score, while `Excellent` scores more
- `Bad` has a negative but weaks score, while `Tragedy` scores more

Use `sid.polarity_scores(t)` to find the sentiment of a text. Then, use the `compound` value to determine the overall score. Note that `compound` give a (normalised) value from $-1$ to $1$, and hence a positive number is good sentiment while a negative number is bad sentiment.

In [14]:
sid = SentimentIntensityAnalyzer()

Observe how the sentiment scores change based on the sentiment of a movie review.

In [15]:
#Dataset 3, Credits at the end of the notebook
# This is an example of a positive review, showing positive sentiment.
review_1 = """I thoroughly enjoyed this movie because there was a genuine sincerity in the acting."""
ss = sid.polarity_scores(review_1)
print(ss)
print(ss['compound'])

{'neg': 0.0, 'neu': 0.754, 'pos': 0.246, 'compound': 0.5563}
0.5563


In [16]:
# Dataset 3, Credits at the end of the notebook
review_2 = "I found it really boring and silly."
# Your turn: Print the VADER polarity scores. What is the compound score of the above review?
ss2 = sid.polarity_scores(review_2)
print(ss2)
print(ss2['compound'])
# This is an example of a negative review.

{'neg': 0.326, 'neu': 0.503, 'pos': 0.171, 'compound': -0.3025}
-0.3025


In [17]:
#Dataset 3, Credits at the end of the notebook
review_3 = "My personal favorite horror film."
# Your turn: Print the VADER polarity scores. What is the compound score of the above review?
ss3 = sid.polarity_scores(review_3)
print(ss3)
print(ss3['compound'])
# Your turn: Is this a positve or negative movie review? What does the VADER polarity score
# say about this review?
# A review could have a positive sentiment but because of the words
# in it, it could result in a negative sentiment score

{'neg': 0.381, 'neu': 0.309, 'pos': 0.309, 'compound': -0.1779}
-0.1779


### Naïve Bayes Classification

We now extend sentiment analysis to texts that we have never seen before, and use a machine learning algorithm - **Naïve Bayes Classification** - to help us predict if a newly received review has positive or negative sentiment. Naïve Bayes Classification is the first machine learning algorithm we will learn in DS102.

#### PROBLEM SETUP

Consider you have the following documents and their tagged sentiment **class**. $1$ represents a positive sentiment while $0$ represents a negative sentiment. There are only 2 possible classes in this problem. These documents are possible because stop words have been already removed.

|ID | Text | Sentiment
|---|---|--
|1|`enjoy like`|$1$
|2|`enjoy funny happy`|$1$
|3|`hate boring like`|$0$
|4|`like happy`|$1$
|5|`boring dull`|$0$

We now have a new document which is `like happy`. What is the predicted class of this new document?

#### TRAINING
 
The model mostly revolves around counting words and multiplying these proportions / probabilities. The following calculations are performed:

1. Find the number of unique words and store them in a variable $V$, called the vocabulary. $|V|$ is the length of the vocabulary.
2. Calculate the probability of each class, $1$ and $0$.
3. Calculate the **conditional probability** of each word given a class. For this calculation, add $1$ to the numerator and add $|V|$ to the denominator.

In this case, 
- $V$ = `{'boring', 'dull', 'enjoy', 'funny', 'happy', 'hate', 'like'}` and hence $|V|= 7$

- $P(1) = \frac{3}{5}$ and  $P(0) = \frac{2}{5}$

- The conditional probability $P($ `enjoy` $| 1)$ is the number of times `enjoy` appears in class $1$ divided by the total number of words in class $1$. The number of times `enjoy` appears is $2$. The total number of words is $7$. Remember to "smooth" the fraction. Hence, $P($ `enjoy` $| 1) = \frac{2+1}{7+7} = \frac{3}{14}$. Repeat this for the rest of the words in both classes:

|Word | Class $1$ calculation or $P($ `(word)` $| 1)$| Class $0$ calculation or $P($ `(word` $| 0)$
|---|---|--
|`boring`|$\frac{1}{14}$|$\frac{3}{12}$
|`dull`|$\frac{1}{14}$ |$\frac{2}{12}$
|`enjoy`|$\frac{3}{14}$ |$\frac{1}{12}$
|`funny`|$\frac{2}{14}$|$\frac{1}{12}$
|`happy`|$\frac{3}{14}$|$\frac{1}{12}$
|`hate`|$\frac{1}{14}$|$\frac{2}{12}$
|`like`|$\frac{3}{14}$|$\frac{2}{12}$

#### TEST

For a the document `like happy`, calculate the probability scores of each class. Do so by multiplying the probability of the class and the conditional probability of each term in each class. For $1$, the calculation is:

$$
\text{Class 1 Prediction} \propto P(1) \times P(\text{ like } | 1) \times P(\text{ happy } | 1) = \frac{3}{5} \times \frac{3}{14} \times \frac{3}{14} = 0.02755
$$

In [18]:
(3/5) * (3/14) * (3/14)

0.027551020408163263

and for $0$, the calculation is:
$$
\text{Class 0 Prediction} \propto P(0) \times P(\text{ like } | 0) \times P(\text{ happy } | 0) = \frac{2}{5} \times \frac{1}{12} \times \frac{2}{12} = 0.0055
$$

In [19]:
(2/5) * (1/12) * (2/12)

0.005555555555555555

Since the score for class $1$ is higher, using the model, the document is classified as class $1$ or positive sentiment. 

#### BAYES THEOREM EXPLAINED

$$P(A,B) = P(A)\times P(B|A)$$

$$\begin{align}
P (A|B) &= \frac{P(A,B)}{P(B)}\\
&= \frac{P(A)\times P(B|A)}{P(B)}
\end{align}$$



#### GENERAL FORM OF THE ALGORITHM

Given a document with $n$ terms, $w_1, w_2, \cdots, w_n$, calculate the prediction of a class, $C_k$, using the formula:

$$\begin{align}
\text{Class } C_k \text{ Prediction } = P(C_k| w_1,w_2,\cdots,w_n)&= \frac{P(\text{Class } C_k) \times P(w_1|C_k) \times P(w_2|C_k) \times \cdots \times P(w_n|C_k)}{P(w_1) \times P(w_2) \times \cdots \times P(w_n)}
\end{align}$$

and find the class $C_k$ with the **largest probability**. In this case, $k=0$ or $k=1$.

Notice that for all classes $C_0, C_1$, the denominator term 
$$P(w_1) \times P(w_2) \times \cdots \times P(w_n)$$ 

is the same. Hence, it can be treated as a constant, let's call it $T$. The above equation simplifies to:

$$\begin{align}
\text{Class } C_k \text{ Prediction } = \frac 1T \times \begin{bmatrix}P(\text{Class } C_k) \times P(w_1|C_k) \times P(w_2|C_k) \times \cdots \times P(w_n|C_k)\end{bmatrix}
\end{align}$$

and hence can be treated simply as a proportionality constant, yielding the equation we used above:

<div class="alert alert-success">
$$\begin{align}
\text{Class } C_k \text{ Prediction } \propto P(\text{Class } C_k) \times P(w_1|C_k) \times P(w_2|C_k) \times \cdots \times P(w_n|C_k)
\end{align}$$</div>


#### YOUR TURN
There is a new document `hate dull`. What is the predicted class of this document?

In [20]:
# Your turn: What is the calculation for Class 1?
(3/5) * (1/14) * (1/14)

0.0030612244897959178

In [21]:
# Your turn: What is the calculation for Class 0?
(2/5) * (2/12) * (2/12)

0.01111111111111111

### Applied Naïve Bayes Classification using `sklearn`

What is the predicted sentiment of the following partially processed (Footnote 1) reviews? 

- `Ive seen it over and over throughout the years and Im always spellbound by it`

- `This film failed to explore the humanity of the animals which left me with an empty feeling inside.`

- `As an adult I really did enjoy this one. `

- `The lead characters  lone wolf  bravado is uninspiring and lame, and the script was apparently written by a monkey with an eight grade education.`

The reviews are so because you have identified words (or a group of words) that reflect this sentiment. `spellbound`, `failed`, `enjoy`, `uninspiring`, `lame` are words that give you an indication of the sentiment of the text. Let's use this idea in application.

In [22]:
#Dataset 3, Credits at the end of the notebook
# Read the reviews data from the CSV
popcorn_df = pd.read_csv('popcorn-reviews-5k.csv', sep="#") 

# Split the dataset into the training and test set. The first 4500 records will be the training set
# while the last 500 records will be the test set.
train_df = popcorn_df[:4500]
test_df = popcorn_df[4500:]

In [23]:
# Your turn: print the first 5 records of the training set. How many columns are there in the training set?
print(train_df.head())
print(train_df.shape)

         id                                             review  sentiment
0    5196_9  Human Tornado (1976) is in many ways a better ...          1
1    2668_9  Chilling, majestic piece of cinematic fright, ...          1
2    9565_3  I cant say that Wargames The Dead Code is the ...          0
3  10271_10  This movie should not be compared to  The Stin...          1
4    5639_7  Ive read the other reviews and found some to b...          1
(4500, 3)


In [24]:
# Your turn: print the first 5 records of the test set. How many columns are there in the training set?
print(test_df.head())
print(test_df.shape)

            id                                             review  sentiment
4500    2910_7  This is a great film for McCartneys and Beatle...          1
4501  11707_10  I remember seeing this movie a long time ago, ...          1
4502    5461_7  Escaping the life of being pimped by her fathe...          1
4503    6029_7  There arent too many times when I see a film a...          1
4504    9462_1  Inappropriate. The PG rating that this movie g...          0
(500, 3)


#### TRAIN

Train the model given the reviews and the given sentiment. Recall that for `sentiment`, 1 represents a positive review and 0 represents a negative review.

In [25]:
# Using fit_transform, transform the corpus to a matrix.
count_vect = CountVectorizer()
train_df_counts = count_vect.fit_transform(train_df['review'])

In [26]:
# Train a multinomial classifier using the training set using the features and the training set labels
clf = MultinomialNB().fit(train_df_counts, train_df['sentiment'])

#### TEST

Now that we have trained our classifier, let's test it using the test set. We will check the actual prediction of a test example, and observe what the predicted model gives us.

In [27]:
# Now, randomly sample an example from the test set.
t_sample = test_df.sample()

# Let's see the review in the test set and the actual sentiment.
s = t_sample.iloc[0]['review']
print(s)
t = t_sample.iloc[0]['sentiment']
print(t)

# Let's now see what class the model predicts for this test example.
print()
print(clf.predict(count_vect.transform([s])))

For my humanities quarter project for school, i chose to do human trafficking. After some research on the internet, i found this DVD and ordered it. I just finished watching it and I am still thinking about it. All I can say is  Wow . It is such a compelling story of a 12 year old Vietnamese girl named Holly and an American man named Patric who tries to save her. The ending leaves you breathless, and although its not a happily-ever-after ending, it is very realistic. It is amazing and I recommend it to anyone! You really connect with Holly and Patric and your heart breaks for her and because of what happens to her. I loved it so much and now I want to know what happens next 
1

[1]


**Credits**
- [sebleier](https://gist.github.com/sebleier/554280) for Dataset 1
- [Kaggle (Billboard 1964-2015 Songs + Lyrics)](https://www.kaggle.com/rakannimer/billboard-lyrics) for Dataset 2
- [Kaggle (Bag of Words Meets Bags of Popcorn)](https://www.kaggle.com/c/word2vec-nlp-tutorial/data) for Dataset 3

**Footnote**

(1) : The reviews are partially processed. Only removal of special characters was performed. The remaining steps to be performed are stemming and removal of stop words.

**Further Reading**
Naïve Bayes can also be used for the following classification problems:

1. Spam vs. Non-Spam in E-Mail Filtering
2. Product classification using product titles