### DS102 | Self-Study Week 4B - Text Mining IV - Numerically Representing Texts

## Learning Objectives
At the end of the self-study, you will be able to:

- extend the existing implementation of Jaccard Similarity to find similar texts

- find additional corpora in the `nltk` library

- further understanding of `CountVectorizer`

### Datasets Required for this Self-Study
1. `billboard-lyrics.csv`

#### Import Libraries

In [None]:
import pandas as pd
import nltk
import re

In [None]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag, wordnet
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import CountVectorizer

Let's first read from `'billboard-lyrics.csv'` to a `df`. Then, inspect the `df` for all billboard songs in $2015$.

In [None]:
# Dataset 1, Credits at the end of the notebook
songs_df = pd.read_csv('billboard-lyrics.csv')
# Store the indices of the df as a new column 'ID'
songs_df['ID'] = songs_df.index

In [None]:
# Use tail() to find the last 20 of this dataset. They are the songs in 2015
# Write your code here. Solution in Fill in the Blanks 1
#

Then, store all `english` stopwords into `ENGLISH_STOP_WORDS`

In [None]:
# Use stopwords.words('english') to get all English stopwords and store them in 
# ENGLISH_STOP_WORDS
# This list can also be found in Dataset 1. Credits at the end of the notebook

ENGLISH_STOP_WORDS = [] # Solution in Fill in the Blanks 2

Specify the `ID` of a song to get its song lyrics.

In [None]:
song_id = 970 # Change the index accordingly.

# Use iloc[song_id][<column_name>] to get the values in the song_df. 'ID' is already
# done for you.
print(songs_df.iloc[song_id]['ID'])
# Extract the column 'Lyrics' from the row.  Solution in Fill in the Blanks 3
#

Before performing analysis, we can store the values in a dictionary. Before starting, remove all stop words to in the lyrics.

In [None]:
NUM_SONGS = songs_df.shape[0]
lyrics = {}

for i in range(0, NUM_SONGS):
    # Find the song_id from the df row
    song_id = songs_df.iloc[i]['ID']
    # Your turn: find the Lyrics from the df row
    song_lyrics = songs_df.iloc[i]['Lyrics']    
    # Use split() to split the words into a list of tokens    
    list_of_lyrics = song_lyrics.split()
    # Filter for words that exists in the list of stopwords    
    list_of_lyrics_without_sw = [w for w in list_of_lyrics if w not in ENGLISH_STOP_WORDS]
    # Use the song_id as the key and the lyrics as the value in the dictionary    
    lyrics[song_id] = list_of_lyrics_without_sw


In [None]:
# Validate the dictionary
print(lyrics[song_id])

Given a `song_id` and the `lyrics` dataset, return the `song_id` and `score` of the song with the highest similarity score. Note that you do not compare the song to test and itself.

In [None]:
def calculate_jaccard_score(d1, d2):
    set_a, set_b = set(d1), set(d2)
    return len(set_a & set_b) / len(set_a | set_b)

def get_most_similar_song(lyrics, test_song_id):
    highest_song_id = 0
    highest_score = 0.0
    
    # Iterate through every song.
    for song_id, song_lyrics_list in lyrics.items():
        if song_id != test_song_id: # Do not compare the selected song with itself.
            # Reassign highest_song_id and highest_score if song's similarity
            # is higher than the current song with highest score. 
            score = calculate_jaccard_score(lyrics[test_song_id], lyrics[song_id])
            if score > highest_score:
                highest_song_id, highest_score = song_id, score
    
    return test_song_id, highest_song_id, highest_score

Use the results to find interesting song pairs.

In [None]:
i, j, h_score = get_most_similar_song(lyrics, song_id)
# i, j, score = get_most_similar_song(lyrics, songs_df['ID'].sample().iloc[0])
#Interesting results: 960, 962, 877, 613, 332, 966. Store them in the song_id variable

#Remove the index to show the full list of lyrics
print(str(i) + " (Source) - " + str(lyrics[i])) 
print()
#Remove the index to show the full list of lyrics
print(str(j) + " (Target) - " + str(lyrics[j])) 
print()
print(h_score)
songs_df[songs_df['ID'].isin([i,j])]

### Self-Study Solutions

In [None]:
#################################
# Solution: Fill in the Blanks 1
#################################
# songs_df.tail(20)

In [None]:
#################################
# Solution: Fill in the Blanks 2
#################################
# ENGLISH_STOP_WORDS = stopwords.words('english')

In [None]:
#################################
# Solution: Fill in the Blanks 2
#################################
# print(songs_df.iloc[song_id]['Lyrics'])

### Convert a Document to a Vector using `CountVectorizer`

The following are the words in the corpus in the Naïve Bayes Classification example:

In [None]:
docs_as_s = ['enjoy like', 
             'enjoy funny happy', 
             'hate boring like', 
             'like happy', 
             'boring dull']

`fit_transform()` will first fit the dataset to a vocabulary. 

In [None]:
count_vectorizer = CountVectorizer()
ft = count_vectorizer.fit_transform(docs_as_s)

First, let's see the vocabulary. Every unique term in the corpus is assigned a position in the dictionary. 

In [None]:
count_vectorizer.vocabulary_

The positions are useful when finding if the word exists in the particular document. For example, if the column with index `2` has a value greater than `0` then the document will contain the word `enjoy`. 

**NOTE: ** The position of the words are random and hence the description does not fit the result.

In [None]:
ft.A

### Further Exploration - `nltk.corpus`

`nltk.corpus` has many corpora (plural form of corpus) to allow you to download text to play with. The following are 2 more contemporary corpora. Before you access the corpus, ensure that you use `nltk.download()` to download the corpus to your local machine.

<div class="alert alert-info">The Brown corpus is curated by Brown University. It has 1 million words in English and contains text from 500 sources. Each source is categorised into a genre.</alert>

In [None]:
from nltk.corpus import brown
try:
    #Use corpus.categories() to show all category tags of each document.
    print(brown.categories())
except LookupError:
    print("Downloading brown...")    
    nltk.download('brown')
    print("Download brown complete")

<div class="alert alert-info">The Reuters Corpus contains 10,788 news documents totaling 1.3 million words. The documents have been classified into 90 topics.</alert>

In [None]:
from nltk.corpus import reuters
try:
    #Use corpus.words() to show all words that appear in the corpus.
    print(reuters.words()[:20])
except LookupError:
    print("Downloading reuters...")      
    nltk.download('reuters')
    print("Download reuters complete")    

For more information on the corpora mentioned, head over to [Chapter 2 of Natural Language Processing with Python](http://www.nltk.org/book/ch02.html).

**Credits**
- [sebleier](https://gist.github.com/sebleier/554280) for Dataset 1
- [Kaggle (Billboard 1964-2015 Songs + Lyrics)](https://www.kaggle.com/rakannimer/billboard-lyrics) for Dataset 2