# Understanding TF-IDF

In this section we will work with texts and derive weighted metrics based on words (or terms) frequencies within these texts. More precisely, we will look at the _TF-IDF_ metric, which stands for _Term Frequency-Inverse Document Frequency_, to produce our metrics which will allow us to measure and evaluate how important certain words are in documents that are part of our IMDb corpus. The "texts" or "documents" we will look at are 'plot' descriptions in the IMDb dataset. 

## Loading the IMDb dataset

Load the IMDb dataset and look closely at the 'Plot' column

In [1]:
# code goes here

## Create a data structure

We need a custom data structure to carry out our TF-IDF calculations. Create a python dictionary having for keys the indices of the dataframe above and for value another dictionary with 'plot' as an entry for each row in the dataframe. 

In [23]:
plot_dict = {}

# code goes here

## Tokenize and filter

Now that we have the plot of each IMDb entry in our dictionary, it is time to tokenize each plot's text and clean it up. Do we need punctuations as part of our tokens? Are there "stop words" we could get rid off? Please complete the following tokenizer function utilising the spacy library (which you used in Data Mining - also, remember to uncomment the first line if you are using spacy for the first time). When this is done, augment your custom dictionary with the plot's tokens for each entry.

In [24]:
# !python3 -m spacy download en_core_web_sm

import spacy

nlp = spacy.load("en_core_web_sm")

def split_and_stop(text):
    
    # tokenize the text with spacy
    tokens = nlp(text.lower())
    
    return # code goes here

In [34]:
# code goes here

## Understanding Term Frequency (TF)

$$
tf(t, d) = \frac{n_{t}} {\sum_{k} n_{k}}
$$

_Term Frequency_ is a normalised metric that measures how frequent a certain term $t$ is in a given document $d$. In the formula above ${n_{t}}$ stands for the number of times the term $t$ occur in document $d$ while $\sum_{k} n_{k}$ is the sum of all terms in the document (its length in other words). Note that term $t$ can potentially occur many times in $d$ hence the need to normalise the metric over the sum of all terms. Below is a function definition `calculate_tf` which takes as input the `tokens` of a certain document $d$ and counts the number of occurences of each terms in the document and calculate their normalised frequency. 

In [29]:
def calculate_tf(tokens):
    unique_tokens = set(tokens)
    term_count = dict.fromkeys(unique_tokens, 0)
    term_frequency = dict.fromkeys(unique_tokens, 0)
    N = float(len(tokens))
    for term in tokens:
        term_count[term] += 1
        term_frequency[term] += 1 / N
    return term_count, term_frequency        

Considering the function `calculate_tf` above, augment your custom dictionary with both the `term_count` and normalised `term_frequency` given the respective plot's `tokens` you previously computed.

In [32]:
# code goes here

## Understanding Inverse Document Frequency (IDF)

$$
idf(t, D) = \log\frac{|D|}{|{d_{i} \in D : t \in d_{i}}|}
$$

_Inverse Document Frequency_ is a metric that measures of important a term $t$ is in a given corpus (or collection) $D$ of documents $d_{i}$. While _Term Frequency_ measures the frequency of a term $t$ in a single document $d$, here _IDF_ consider frequency of a term $t$ over the whole corpus $D$ as to derive a weight on the statistical significance of term $t$ overall. The idea here is that common words which occur in many documents ("man" or a stop word like "it" for example) hold little importance overall as they are redundant. What _IDF_ does is to give more weight to words that are uncommon overall yet possibly significant for certain documents. This is the reason why the metric takes the $\log$ of the fraction $\frac{|D|}{|{d_{i} \in D : t \in d_{i}}|}$ where $|D|$ is the number of documents in corpus $D$ and $|{d_{i} \in D : t \in d_{i}}|$ is the number of times a term $t$ appears in a document in the corpus. 

The first thing we need to do to calculate _IDF_ is to establish the overall vocabulary of the entire corpus. What are all the unique words (or terms) in all of our plots? How many unique words do we have? Consider the following `bag_of_words` python set and fill it with all the unique terms present in our plots. 

In [1]:
# Vocabulary -> bag of words

bag_of_words = set()

# code goes here

Now, remember we calculated a `term_count` for each term in each document when we calculated the _TF_ with `calculate_tf` above? We need to use this pre-calculated informatin here to derive $|{d_{i} \in D : t \in d_{i}}|$ which is the number of times a term $t$ appears in a document in the corpus. Make a list of each `term_count` you recorded in your custom dicitonary as to use it to computer _IDF_ below. 

In [44]:
list_all_documents_count = # code goes here

Here is function defintion `calculate_idf` that computes the _IDF_ of all the terms in our corpus. It takes a list of `term_count` as `documents_count_list` and a overall vocabulary as `bag_of_words`. Can you make sense of the function in light of the $idf(t, D)$ formula above?

In [46]:
import math

def calculate_idf(documents_count_list, bag_of_words):
    
    idf = dict.fromkeys(bag_of_words, 0)
    D = len(documents_count_list)
    
    for d in documents_count_list:
        for term, count in d.items():
            if count > 0:
                idf[term] += 1
                
    for term, document_count in idf.items():
        idf[term] = math.log(D / float(document_count))
        
    return idf

Lets calculate the _IDF_ then using the function above. What are the highest weight? What are the lowest weight?

In [None]:
# code goes here

## Putting it together: TF-IDF

$$
tf-idf(t, d, D) = tf(t, d) \cdot idf(t, D)
$$

Putting _TF_ and _IDF_ together is quite simple. Since _IDF_ is a weight for each term in the corpus, simply multiply the terms' weight value to all the _TF_ we already have calculated. Here is a function `calculate_tf_idf` that does just that!

In [48]:
def calculate_tf_idf(tf, idf):
    tf_idf = dict.fromkeys(tf.keys(), 0)
    for term, frequency in tf.items():
        tf_idf[term] = frequency * idf[term]
    return tf_idf    

With the function above, calculate the _TF-IDF_ of all plots in your custom dictionary and record the results in the dictionary itself. 

In [51]:
# code goes here

What is the difference between _TF_ and _IDF_ for a given plot?

In [1]:
# code goes here

## Save the data

Save your custom dictionary you have constructed above in a json file.

In [None]:
import json

# code goes here