# Walkthrough : indexing Bag-of-Words into a vector table

This Walkthrough will lead us from bag-of-words representations of documents to **vector signatures** (indexes) using the **TF-IDF** formula.

The ultimate goal of **indexing** is to create a **vector representation** (signature) for each document. This vector representation will be used for:

- mine the features that can caracterize classes of documents (supervised learning using **labels**)
- mine the documents that have similar features to establish trends (unsupervised learning).

To do that, we need:
- a fixed number of features
- a quantitative value for each feature.

The number of features is given by the vocabulary over the corpus: the set of all possible words (tokens) found in all documents.

The quantitative value is given, for each doc, by counting the occurences of each of these words in the doc and by using a TF-IDF formula.

<img src="img/pipeline-walkthrough2.png" width="70%"/>

## 0. Loading some input data from the Amazon Reviews

To try this indexing walkthrough, we will get 5 reviews from the Amazon Reviews dataset. We will apply a function for extracting bag-of-words representations from these 5 documents.

In [None]:
import os               # for environ variables in Part 3
from nlp_pipeline import extract_bow_from_raw_text
import json

docs = []
with open('./reviews.json', 'r') as data_file:    
    for line in data_file:
        docs.append(json.loads(line))

# extracting bows
bows = list(map(lambda row: extract_bow_from_raw_text(row['reviewText']), docs))

# displaying bows
for i in range(len(docs)):
    print("\n--- review: {}".format(docs[i]['reviewText']))
    print("--- bow: {}".format(bows[i]))

## 1. Obtaining vocabulary and term frequencies

In [None]:
from collections import Counter

# term occurence = counting distinct words in each bag
term_occ = list(map(lambda bow : Counter(bow), bows))

# term frequency = occurences over length of bag
term_freq = list()
for i in range(len(docs)):
    term_freq.append( {k: (v / float(len(bows[i])))
                       for k, v in term_occ[i].items()} )

# displaying occurences
for i in range(len(docs)):
    print("\n--- review: {}".format(docs[i]['reviewText']))
    print("--- bow: {}".format(bows[i]))
    print("--- term_occ: {}".format(term_occ[i]))
    print("--- term_freq: {}".format(term_freq[i]))

## 2. Obtaining document frequencies

In [None]:
# document occurence = number of documents having this word
# term frequency = occurences over length of bag

doc_occ = Counter( [word for bow in bows for word in set(bow)] )

# document frequency = occurences over length of corpus
doc_freq = {k: (v / float(len(docs)))
            for k, v in doc_occ.items()}

# displaying vocabulary
print("\n--- full vocabulary: {}".format(doc_occ))
print("\n--- doc freq: {}".format(doc_freq))

## 3. Transforming frequencies into a TF-IDF vector

TF-IDF is an acronym for the product of two parts: the term frequency tf and what is called the inverse document frequency idf. The term frequency is just the counts in a term frequency vector. 

$tf(term,document) = \# \ of \ times \ a \ term \ appears \ in \ a \ document$

The idf part is defined in terms of the document frequency. The document frequency is 

$df(term,corpus) = \frac{ \# \ of \ documents \ that \ contain \ a \ term}{ \# \ of \ documents \ in \ the \ corpus}$

The inverse document frequency is defined in terms of the document frequency as

$idf(term,corpus) = \log{\frac{1}{df(term,corpus)}}$.

It is called the inverse document frequency but really it is the log of the inverse document frequency. Finally tf-idf is just

tf-idf $ = tf(term,document) * idf(term,corpus)$

In [None]:
# the minimum document frequency (in proportion of the length of the corpus)
min_df = 0.3

# filtering items to obtain the vocabulary
vocabulary = [ k for k,v in doc_freq.items() if v >= min_df ]

# print vocabulary
print ("-- vocabulary (len={}): {}".format(len(vocabulary),vocabulary))

In [None]:
import numpy as np

# create a dense matrix of vectors for each document
# each vector has the length of the vocabulary
vectors = np.zeros((len(docs),len(vocabulary)))

# fill these vectors with tf-idf values
for i in range(len(docs)):
    for j in range(len(vocabulary)):
        term     = vocabulary[j]
        term_tf  = term_freq[i].get(term, 0.0)   # 0.0 if term not found in doc
        term_idf = np.log(1 + 1 / doc_freq[term]) # smooth formula
        vectors[i,j] = term_tf * term_idf

# displaying results
for i in range(len(docs)):
    print("\n--- review: {}".format(docs[i]['reviewText']))
    print("--- bow: {}".format(bows[i]))
    print("--- tfidf vector: {}".format( vectors[i] ) )
    print("--- tfidf sorted: {}".format( 
            sorted( zip(vocabulary,vectors[i]), key=lambda x:-x[1] )
         ))