# Natural Language Processing

J.F. Omhover
thanks to Mari Pierce-Quinonez for some great enhancements I reused.

### Requirements

You need to install the `nltk` module:

```
conda install nltk
```

This module will need corporas that you need to download. This can take a very long time, for simplicity here's the minimal corporas for this lecture. In a terminal, open `ipython` and type:

```
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_treebank_pos_tagger')
```

### General Introduction

Natural Language Processing is a subfield of machine learning focused on making sense of text. Text is inherently unstructured and has all sorts of tricks required for converting (vectorizing) text into a format that a machine learning algorithm can interpret. It is called Processing for a reason - most of what we'll be covering during this morning session are Data Processing operations that make it possible to plug test into other ML algorythms.

### Overview of nlp

Natural language processing is concerned with understanding text using computation. People working within the field are often concerned with:
- Information retrieval. How do you find a document or a particular fact within a document?
- Document classification. What is the document about amongst mutually exclusive categories?
- Machine translation. How do you write an English phrase in Chinese? Think of Google translate.
- Sentiment analysis. Was a product review positive or negative?
Natural language processing is a huge field and we will just touch on some of the concepts.

### Objectives

- Name and describe the steps necessary for processing text in machine learning.
- Implement a Natural Language Processing pipeline.
- Explain the cosine similarity measure and why it is used in NLP.

# Text Featurization part 1 : Bags of Words

This Walkthrough will lead us from raw documents to bag-of-words representations using **Natural Language Processing** functions.

In our case, this walkthrough is a preliminary step of a pipeline for **indexing** documents.

The ultimate goal of **indexing** is to create a **signature** (vector) for each document.

This **signature** will be used for relating documents one to the other (and find out similar clusters of documents), or for mining underlying relations between concepts.

<img src="img/pipeline-walkthrough1.png" width="70%"/>

## 0. Text sources and possible text mining inputs

In [1]:
paragraph = u"My mother drove me to the airport with the windows rolled down. It was seventy-five degrees in Phoenix, the sky a perfect, cloudless blue. I was wearing my favorite shirt – sleeveless, white eyelet lace; I was wearing it as a farewell gesture. My carry-on item was a parka. In the Olympic Peninsula of northwest Washington State, a small town named Forks exists under a near-constant cover of clouds. It rains on this inconsequential town more than any other place in the United States of America. It was from this town and its gloomy, omnipresent shade that my mother escaped with me when I was only a few months old. It was in this town that I’d been compelled to spend a month every summer until I was fourteen. That was the year I finally put my foot down; these past three summers, my dad, Charlie, vacationed with me in California for two weeks instead."

print(paragraph)

My mother drove me to the airport with the windows rolled down. It was seventy-five degrees in Phoenix, the sky a perfect, cloudless blue. I was wearing my favorite shirt – sleeveless, white eyelet lace; I was wearing it as a farewell gesture. My carry-on item was a parka. In the Olympic Peninsula of northwest Washington State, a small town named Forks exists under a near-constant cover of clouds. It rains on this inconsequential town more than any other place in the United States of America. It was from this town and its gloomy, omnipresent shade that my mother escaped with me when I was only a few months old. It was in this town that I’d been compelled to spend a month every summer until I was fourteen. That was the year I finally put my foot down; these past three summers, my dad, Charlie, vacationed with me in California for two weeks instead.


### Encode

In [2]:
input_string = paragraph.encode('utf-8')

print(input_string)

My mother drove me to the airport with the windows rolled down. It was seventy-five degrees in Phoenix, the sky a perfect, cloudless blue. I was wearing my favorite shirt – sleeveless, white eyelet lace; I was wearing it as a farewell gesture. My carry-on item was a parka. In the Olympic Peninsula of northwest Washington State, a small town named Forks exists under a near-constant cover of clouds. It rains on this inconsequential town more than any other place in the United States of America. It was from this town and its gloomy, omnipresent shade that my mother escaped with me when I was only a few months old. It was in this town that I’d been compelled to spend a month every summer until I was fourteen. That was the year I finally put my foot down; these past three summers, my dad, Charlie, vacationed with me in California for two weeks instead.


In [3]:
import unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    only_ascii = nfkd_form.encode('ASCII', 'ignore')
    return str(only_ascii)
    #return str(only_ascii, 'utf-8')

input_string = remove_accents(paragraph)

print(input_string)

My mother drove me to the airport with the windows rolled down. It was seventy-five degrees in Phoenix, the sky a perfect, cloudless blue. I was wearing my favorite shirt  sleeveless, white eyelet lace; I was wearing it as a farewell gesture. My carry-on item was a parka. In the Olympic Peninsula of northwest Washington State, a small town named Forks exists under a near-constant cover of clouds. It rains on this inconsequential town more than any other place in the United States of America. It was from this town and its gloomy, omnipresent shade that my mother escaped with me when I was only a few months old. It was in this town that Id been compelled to spend a month every summer until I was fourteen. That was the year I finally put my foot down; these past three summers, my dad, Charlie, vacationed with me in California for two weeks instead.


# 1. Creating bag-of-words for each document

## 1.1. Tokenize document

**"Tokenize"** means creating "tokens" which are atomic units of the text. These tokens are usually words we extract from the document by splitting it (using punctuations as a separator). We can also consider sentences as tokens (and words as sub-tokens of sentences).

### nltk.tokenize.sent_tokenize

In [4]:
from nltk.tokenize import sent_tokenize

sent_tokens = sent_tokenize(input_string)

for sent in sent_tokens:
    print("--- sentence: {}".format(sent))

--- sentence: My mother drove me to the airport with the windows rolled down.
--- sentence: It was seventy-five degrees in Phoenix, the sky a perfect, cloudless blue.
--- sentence: I was wearing my favorite shirt  sleeveless, white eyelet lace; I was wearing it as a farewell gesture.
--- sentence: My carry-on item was a parka.
--- sentence: In the Olympic Peninsula of northwest Washington State, a small town named Forks exists under a near-constant cover of clouds.
--- sentence: It rains on this inconsequential town more than any other place in the United States of America.
--- sentence: It was from this town and its gloomy, omnipresent shade that my mother escaped with me when I was only a few months old.
--- sentence: It was in this town that Id been compelled to spend a month every summer until I was fourteen.
--- sentence: That was the year I finally put my foot down; these past three summers, my dad, Charlie, vacationed with me in California for two weeks instead.


### nltk.tokenize.word_tokenize

In [5]:
from nltk.tokenize import word_tokenize

tokens = [sent for sent in map(word_tokenize, sent_tokens)]

#tokens = word_tokenize(input_string)
for sent in tokens:
    print("--- sentence tokens: {}".format(sent))
#print("--- nltk tokens from paragraph:\n{}".format(tokens))

--- sentence tokens: ['My', 'mother', 'drove', 'me', 'to', 'the', 'airport', 'with', 'the', 'windows', 'rolled', 'down', '.']
--- sentence tokens: ['It', 'was', 'seventy-five', 'degrees', 'in', 'Phoenix', ',', 'the', 'sky', 'a', 'perfect', ',', 'cloudless', 'blue', '.']
--- sentence tokens: ['I', 'was', 'wearing', 'my', 'favorite', 'shirt', 'sleeveless', ',', 'white', 'eyelet', 'lace', ';', 'I', 'was', 'wearing', 'it', 'as', 'a', 'farewell', 'gesture', '.']
--- sentence tokens: ['My', 'carry-on', 'item', 'was', 'a', 'parka', '.']
--- sentence tokens: ['In', 'the', 'Olympic', 'Peninsula', 'of', 'northwest', 'Washington', 'State', ',', 'a', 'small', 'town', 'named', 'Forks', 'exists', 'under', 'a', 'near-constant', 'cover', 'of', 'clouds', '.']
--- sentence tokens: ['It', 'rains', 'on', 'this', 'inconsequential', 'town', 'more', 'than', 'any', 'other', 'place', 'in', 'the', 'United', 'States', 'of', 'America', '.']
--- sentence tokens: ['It', 'was', 'from', 'this', 'town', 'and', 'its', 

### lower

In [6]:
import string

tokens_lower = [list(map(lambda s : s.lower(), sent)) for sent in tokens]

for sent in tokens_lower:
    print("--- sentence tokens: {}".format(sent))

--- sentence tokens: ['my', 'mother', 'drove', 'me', 'to', 'the', 'airport', 'with', 'the', 'windows', 'rolled', 'down', '.']
--- sentence tokens: ['it', 'was', 'seventy-five', 'degrees', 'in', 'phoenix', ',', 'the', 'sky', 'a', 'perfect', ',', 'cloudless', 'blue', '.']
--- sentence tokens: ['i', 'was', 'wearing', 'my', 'favorite', 'shirt', 'sleeveless', ',', 'white', 'eyelet', 'lace', ';', 'i', 'was', 'wearing', 'it', 'as', 'a', 'farewell', 'gesture', '.']
--- sentence tokens: ['my', 'carry-on', 'item', 'was', 'a', 'parka', '.']
--- sentence tokens: ['in', 'the', 'olympic', 'peninsula', 'of', 'northwest', 'washington', 'state', ',', 'a', 'small', 'town', 'named', 'forks', 'exists', 'under', 'a', 'near-constant', 'cover', 'of', 'clouds', '.']
--- sentence tokens: ['it', 'rains', 'on', 'this', 'inconsequential', 'town', 'more', 'than', 'any', 'other', 'place', 'in', 'the', 'united', 'states', 'of', 'america', '.']
--- sentence tokens: ['it', 'was', 'from', 'this', 'town', 'and', 'its', 

## 1.2. Filtering stopwords (and punctuation)

**Stopwords** are words that should be stopped at this step because they do not carry much information about the actual meaning of the document. Usually, they are "common" words you use. You can find lists of such **stopwords** online, or embedded within the NLTK library.

### Using your own stop list

In [7]:
from nltk.corpus import stopwords

stopwords_ = set(stopwords.words('english'))

print("--- stopwords in english: {}".format(stopwords_))

--- stopwords in english: set([u'all', u'just', u'being', u'over', u'both', u'through', u'yourselves', u'its', u'before', u'o', u'hadn', u'herself', u'll', u'had', u'should', u'to', u'only', u'won', u'under', u'ours', u'has', u'do', u'them', u'his', u'very', u'they', u'not', u'during', u'now', u'him', u'nor', u'd', u'did', u'didn', u'this', u'she', u'each', u'further', u'where', u'few', u'because', u'doing', u'some', u'hasn', u'are', u'our', u'ourselves', u'out', u'what', u'for', u'while', u're', u'does', u'above', u'between', u'mustn', u't', u'be', u'we', u'who', u'were', u'here', u'shouldn', u'hers', u'by', u'on', u'about', u'couldn', u'of', u'against', u's', u'isn', u'or', u'own', u'into', u'yourself', u'down', u'mightn', u'wasn', u'your', u'from', u'her', u'their', u'aren', u'there', u'been', u'whom', u'too', u'wouldn', u'themselves', u'weren', u'was', u'until', u'more', u'himself', u'that', u'but', u'don', u'with', u'than', u'those', u'he', u'me', u'myself', u'ma', u'these', u'up'

In [8]:
# list found at http://www.textfixer.com/resources/common-english-words.txt
# 'not' has been removed (do you know why ?)

stopwords_ = "a,able,about,across,after,all,almost,also,am,among,an,and,any,\
are,as,at,be,because,been,but,by,can,could,dear,did,do,does,either,\
else,ever,every,for,from,get,got,had,has,have,he,her,hers,him,his,\
how,however,i,if,in,into,is,it,its,just,least,let,like,likely,may,\
me,might,most,must,my,neither,no,of,off,often,on,only,or,other,our,\
own,rather,said,say,says,she,should,since,so,some,than,that,the,their,\
them,then,there,these,they,this,tis,to,too,twas,us,wants,was,we,were,\
what,when,where,which,while,who,whom,why,will,with,would,yet,you,your]".split(',')

print("--- stopwords in english: {}".format(stopwords_))

--- stopwords in english: ['a', 'able', 'about', 'across', 'after', 'all', 'almost', 'also', 'am', 'among', 'an', 'and', 'any', 'are', 'as', 'at', 'be', 'because', 'been', 'but', 'by', 'can', 'could', 'dear', 'did', 'do', 'does', 'either', 'else', 'ever', 'every', 'for', 'from', 'get', 'got', 'had', 'has', 'have', 'he', 'her', 'hers', 'him', 'his', 'how', 'however', 'i', 'if', 'in', 'into', 'is', 'it', 'its', 'just', 'least', 'let', 'like', 'likely', 'may', 'me', 'might', 'most', 'must', 'my', 'neither', 'no', 'of', 'off', 'often', 'on', 'only', 'or', 'other', 'our', 'own', 'rather', 'said', 'say', 'says', 'she', 'should', 'since', 'so', 'some', 'than', 'that', 'the', 'their', 'them', 'then', 'there', 'these', 'they', 'this', 'tis', 'to', 'too', 'twas', 'us', 'wants', 'was', 'we', 'were', 'what', 'when', 'where', 'which', 'while', 'who', 'whom', 'why', 'will', 'with', 'would', 'yet', 'you', 'your]']


We also need to filter punctuation tokens: tokens made of punctuation marks. We can find a list of those punctuations in string.punctuation.

In [9]:
import string

punctuation_ = set(string.punctuation)
print("--- punctuation: {}".format(string.punctuation))

def filter_tokens(sent):
    return([w for w in sent if not w in stopwords_ and not w in punctuation_])

tokens_filtered = list(map(filter_tokens, tokens_lower))

for sent in tokens_filtered:
    print("--- sentence tokens: {}".format(sent))

--- punctuation: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
--- sentence tokens: ['mother', 'drove', 'airport', 'windows', 'rolled', 'down']
--- sentence tokens: ['seventy-five', 'degrees', 'phoenix', 'sky', 'perfect', 'cloudless', 'blue']
--- sentence tokens: ['wearing', 'favorite', 'shirt', 'sleeveless', 'white', 'eyelet', 'lace', 'wearing', 'farewell', 'gesture']
--- sentence tokens: ['carry-on', 'item', 'parka']
--- sentence tokens: ['olympic', 'peninsula', 'northwest', 'washington', 'state', 'small', 'town', 'named', 'forks', 'exists', 'under', 'near-constant', 'cover', 'clouds']
--- sentence tokens: ['rains', 'inconsequential', 'town', 'more', 'place', 'united', 'states', 'america']
--- sentence tokens: ['town', 'gloomy', 'omnipresent', 'shade', 'mother', 'escaped', 'few', 'months', 'old']
--- sentence tokens: ['town', 'id', 'compelled', 'spend', 'month', 'summer', 'until', 'fourteen']
--- sentence tokens: ['year', 'finally', 'put', 'foot', 'down', 'past', 'three', 'summers', 'dad', 'charl

## 1.3. Stemming and lemmatization

**Stemming** means reducing each word to a **stem**. That is, reducing each word in all its diversity to a root common to all its variants.

In [10]:
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer

stemmer_porter = PorterStemmer()
tokens_stemporter = [list(map(stemmer_porter.stem, sent)) for sent in tokens_filtered]
print("--- sentence tokens (porter): {}".format(tokens_stemporter[0]))

stemmer_snowball = SnowballStemmer('english')
tokens_stemsnowball = [list(map(stemmer_snowball.stem, sent)) for sent in tokens_filtered]
print("--- sentence tokens (snowball): {}".format(tokens_stemsnowball[0]))

--- sentence tokens (porter): ['mother', 'drove', 'airport', u'window', u'roll', 'down']
--- sentence tokens (snowball): [u'mother', u'drove', u'airport', u'window', u'roll', u'down']


## 1.4. N-Grams

<span style="color:red">To capture sequences of tokens</span>

In [11]:
from nltk.util import ngrams

list(ngrams(tokens_stemsnowball[0],2))

[(u'mother', u'drove'),
 (u'drove', u'airport'),
 (u'airport', u'window'),
 (u'window', u'roll'),
 (u'roll', u'down')]

In [12]:
from nltk.util import ngrams

def join_sent_ngrams(input_tokens, n):
    # first add the 1-gram tokens
    ret_list = list(input_tokens)
    
    #then for each n
    for i in range(2,n+1):
        # add each n-grams to the list
        ret_list.extend(['-'.join(tgram) for tgram in ngrams(input_tokens, i)])
    
    return(ret_list)

tokens_ngrams = list(map(lambda x : join_sent_ngrams(x, 3), tokens_stemporter))

print("--- sentence tokens: {}".format(tokens_ngrams[0]))

--- sentence tokens: ['mother', 'drove', 'airport', u'window', u'roll', 'down', 'mother-drove', 'drove-airport', u'airport-window', u'window-roll', u'roll-down', 'mother-drove-airport', u'drove-airport-window', u'airport-window-roll', u'window-roll-down']


## 1.5. Part-of-Speech tagging

This is an alternative process that relies on machine learning to tag each word in a sentence with its function. In libraries such as NLTK, there are embedded tools to do that. Tags detected depend on the corpus used for training. In NLTK, the function `nltk.pos_tag()` uses the [Penn Treebank](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

### nltk.pos_tag

In [13]:
from nltk import pos_tag

sent_tags = list(map(pos_tag, tokens))

for sent in sent_tags:
    print("--- sentence tags: {}".format(sent))

--- sentence tags: [('My', 'PRP$'), ('mother', 'NN'), ('drove', 'VBD'), ('me', 'PRP'), ('to', 'TO'), ('the', 'DT'), ('airport', 'NN'), ('with', 'IN'), ('the', 'DT'), ('windows', 'NNS'), ('rolled', 'VBD'), ('down', 'RB'), ('.', '.')]
--- sentence tags: [('It', 'PRP'), ('was', 'VBD'), ('seventy-five', 'JJ'), ('degrees', 'NNS'), ('in', 'IN'), ('Phoenix', 'NNP'), (',', ','), ('the', 'DT'), ('sky', 'NN'), ('a', 'DT'), ('perfect', 'JJ'), (',', ','), ('cloudless', 'JJ'), ('blue', 'NN'), ('.', '.')]
--- sentence tags: [('I', 'PRP'), ('was', 'VBD'), ('wearing', 'VBG'), ('my', 'PRP$'), ('favorite', 'JJ'), ('shirt', 'NN'), ('sleeveless', 'NN'), (',', ','), ('white', 'JJ'), ('eyelet', 'NN'), ('lace', 'NN'), (';', ':'), ('I', 'PRP'), ('was', 'VBD'), ('wearing', 'VBG'), ('it', 'PRP'), ('as', 'IN'), ('a', 'DT'), ('farewell', 'NN'), ('gesture', 'NN'), ('.', '.')]
--- sentence tags: [('My', 'PRP$'), ('carry-on', 'JJ'), ('item', 'NN'), ('was', 'VBD'), ('a', 'DT'), ('parka', 'NN'), ('.', '.')]
--- senten

Let's filter verbs !

In [14]:
for sent in sent_tags:
    tags_filtered = [t for t in sent if t[1].startswith('VB')]
    print("--- verbs:\n{}".format(tags_filtered))

--- verbs:
[('drove', 'VBD'), ('rolled', 'VBD')]
--- verbs:
[('was', 'VBD')]
--- verbs:
[('was', 'VBD'), ('wearing', 'VBG'), ('was', 'VBD'), ('wearing', 'VBG')]
--- verbs:
[('was', 'VBD')]
--- verbs:
[('named', 'VBN'), ('exists', 'VBZ')]
--- verbs:
[('rains', 'VBZ')]
--- verbs:
[('was', 'VBD'), ('escaped', 'VBD'), ('was', 'VBD')]
--- verbs:
[('was', 'VBD'), ('been', 'VBN'), ('compelled', 'VBN'), ('spend', 'VB'), ('was', 'VBD')]
--- verbs:
[('was', 'VBD'), ('put', 'VBD'), ('vacationed', 'VBD')]


In [15]:
from nltk import RegexpParser

grammar = r"""
  NPB: {<DT|PP\$>?<JJ|NN|,>*<NN>}   # chunk determiner/possessive, adjectives and noun
      {<NNP>+}                # chunk sequences of proper nouns
  V2V: {<V.*> <TO> <V.*>}
"""

cp = RegexpParser(grammar)
result = cp.parse(sent_tags[1])

#print result

for sent in sent_tags:
    tree = cp.parse(sent)
    for subtree in tree.subtrees():
        if subtree.label() == 'NPB': print(subtree)
        if subtree.label() == 'V2V': print(subtree)

(NPB mother/NN)
(NPB the/DT airport/NN)
(NPB Phoenix/NNP)
(NPB the/DT sky/NN)
(NPB a/DT perfect/JJ ,/, cloudless/JJ blue/NN)
(NPB
  favorite/JJ
  shirt/NN
  sleeveless/NN
  ,/,
  white/JJ
  eyelet/NN
  lace/NN)
(NPB a/DT farewell/NN gesture/NN)
(NPB carry-on/JJ item/NN)
(NPB a/DT parka/NN)
(NPB Olympic/NNP Peninsula/NNP)
(NPB Washington/NNP State/NNP)
(NPB a/DT small/JJ town/NN)
(NPB Forks/NNP)
(NPB a/DT near-constant/JJ cover/NN)
(NPB this/DT inconsequential/JJ town/NN)
(NPB any/DT other/JJ place/NN)
(NPB United/NNP)
(NPB America/NNP)
(NPB this/DT town/NN)
(NPB gloomy/NN ,/, omnipresent/NN shade/NN)
(NPB mother/NN)
(NPB this/DT town/NN)
(NPB Id/NNP)
(V2V compelled/VBN to/TO spend/VB)
(NPB a/DT month/NN)
(NPB every/DT summer/NN)
(NPB the/DT year/NN)
(NPB foot/NN)
(NPB dad/NN)
(NPB Charlie/NNP)
(NPB California/NNP)


# Text Featurization part 2 : Indexing Bag-of-Words into a vector table

This Walkthrough will lead us from bag-of-words representations of documents to **vector signatures** (indexes) using the **TF-IDF** formula.

The ultimate goal of **indexing** is to create a **vector representation** (signature) for each document. This vector representation will be used for:

- mine the features that can caracterize classes of documents (supervised learning using **labels**)
- mine the documents that have similar features to establish trends (unsupervised learning).

To do that, we need:
- a fixed number of features
- a quantitative value for each feature.

The number of features is given by the vocabulary over the corpus: the set of all possible words (tokens) found in all documents.

The quantitative value is given, for each doc, by counting the occurences of each of these words in the doc and by using a TF-IDF formula.

<img src="img/pipeline-walkthrough2.png" width="70%"/>

## 0. Loading some input data from the Amazon Reviews

To try this indexing walkthrough, we will get 5 reviews from the Amazon Reviews dataset. We will apply a function for extracting bag-of-words representations from these 5 documents.

In [16]:
import os               # for environ variables in Part 3
from nlp_pipeline import extract_bow_from_raw_text
import json

docs = []
with open('./reviews.json', 'r') as data_file:    
    for line in data_file:
        docs.append(json.loads(line))

# extracting bows
bows = list(map(lambda row: extract_bow_from_raw_text(row['reviewText']), docs))

# displaying bows
for i in range(len(docs)):
    print("\n--- review: {}".format(docs[i]['reviewText']))
    print("--- bow: {}".format(bows[i]))


--- review: Not much to write about here, but it does exactly what it's supposed to. filters out the pop sounds. now my recordings are much more crisp. it is one of the lowest prices pop filters on amazon so might as well buy it, they honestly work the same despite their pricing,
--- bow: [u'much', u'filter', u'pop', u'record', u'more', u'crisp', u'lowest', u'price', u'filter', u'amazon', u'same', u'price']

--- review: The product does exactly as it should and is quite affordable.I did not realized it was double screened until it arrived, so it was even better than I had expected.As an added bonus, one of the screens carries a small hint of the smell of an old grape candy I used to buy, so for reminiscent's sake, I cannot stop putting the pop filter next to my nose and smelling it after recording. :DIf you needed a pop filter, this will work just as well as the expensive ones, and it may even come with a pleasing aroma like mine did!Buy this product! :]
--- bow: [u'product', u'doubl'

# 1. Indexing Bag of Words into a Vector Matrix using Term Frequency / Inverse Document Frequency
The ultimate goal of indexing is to create a vector representation (signature) for each document. This vector representation will be used for:
mine the features that can caracterize classes of documents (supervised learning using labels)
mine the documents that have similar features to establish trends (unsupervised learning).
To do that, we need:
- a fixed number of features
- a quantitative value for each feature.

The number of features is given by the vocabulary over the corpus: the set of all possible words (tokens) found in all documents.

The quantitative value is given, for each doc, by counting the occurences of each of these words in the doc and by using a TF-IDF formula.

## 1.1 Term Frequency

The number of times a term occurs in a specific document:

$tf(term,document) = \# \ of \ times \ a \ term \ appears \ in \ a \ document$

In [17]:
from collections import Counter

# term occurence = counting distinct words in each bag
term_occ = list(map(lambda bow : Counter(bow), bows))

# term frequency = occurences over length of bag
term_freq = list()
for i in range(len(docs)):
    term_freq.append( {k: (v / float(len(bows[i])))
                       for k, v in term_occ[i].items()} )

# displaying occurences
for i in range(len(docs)):
    print("\n--- review: {}".format(docs[i]['reviewText']))
    print("--- bow: {}".format(bows[i]))
    print("--- term_occ: {}".format(term_occ[i]))
    print("--- term_freq: {}".format(term_freq[i]))


--- review: Not much to write about here, but it does exactly what it's supposed to. filters out the pop sounds. now my recordings are much more crisp. it is one of the lowest prices pop filters on amazon so might as well buy it, they honestly work the same despite their pricing,
--- bow: [u'much', u'filter', u'pop', u'record', u'more', u'crisp', u'lowest', u'price', u'filter', u'amazon', u'same', u'price']
--- term_occ: Counter({u'price': 2, u'filter': 2, u'lowest': 1, u'crisp': 1, u'pop': 1, u'record': 1, u'amazon': 1, u'much': 1, u'same': 1, u'more': 1})
--- term_freq: {u'lowest': 0.08333333333333333, u'crisp': 0.08333333333333333, u'price': 0.16666666666666666, u'pop': 0.08333333333333333, u'filter': 0.16666666666666666, u'record': 0.08333333333333333, u'amazon': 0.08333333333333333, u'much': 0.08333333333333333, u'same': 0.08333333333333333, u'more': 0.08333333333333333}

--- review: The product does exactly as it should and is quite affordable.I did not realized it was double sc

## 1.2. Obtaining document frequencies

$df(term,corpus) = \frac{ \# \ of \ documents \ that \ contain \ a \ term}{ \# \ of \ documents \ in \ the \ corpus}$


In [18]:
# document occurence = number of documents having this word
# term frequency = occurences over length of bag

doc_occ = Counter( [word for bow in bows for word in set(bow)] )

# document frequency = occurences over length of corpus
doc_freq = {k: (v / float(len(docs)))
            for k, v in doc_occ.items()}

# displaying vocabulary
print("\n--- full vocabulary: {}".format(doc_occ))
print("\n--- doc freq: {}".format(doc_freq))


--- full vocabulary: Counter({u'pop': 5, u'filter': 4, u'clamp': 2, u'doubl': 2, u'screen': 2, u'prevent': 1, u'old': 1, u'ad': 1, u'color': 1, u'abl': 1, u'sag': 1, u'mine': 1, u'one': 1, u'high': 1, u'littl': 1, u'thing': 1, u'reduct': 1, u'breath': 1, u'grape': 1, u'hint': 1, u'sake': 1, u'windscreen': 1, u'mic': 1, u'devic': 1, u'primari': 1, u'same': 1, u'nose': 1, u'better': 1, u'attach': 1, u'enough': 1, u'much': 1, u'mxl': 1, u'care': 1, u'smell': 1, u'more': 1, u'lowest': 1, u'gooseneck': 1, u'product': 1, u'bonus': 1, u'great': 1, u'price': 1, u'mike': 1, u'voic': 1, u'notic': 1, u'posit': 1, u'cloth': 1, u'amazon': 1, u'volum': 1, u'studio': 1, u'aroma': 1, ']': 1, u'neck': 1, u'sound': 1, u'protect': 1, u'perform': 1, u'crisp': 1, u'coax': 1, u'frequenc': 1, u'dif': 1, u'secur': 1, u'metal': 1, u'job': 1, u'vocal': 1, u'record': 1, u'expens': 1, u'small': 1, u'goos': 1, u'nice': 1, u'reminisc': 1, u'mount': 1, u'candi': 1})

--- doc freq: {u'clamp': 0.4, u'prevent': 0.2, u

## 1.3 Creating the vocabulary for indexing

In [19]:
# the minimum document frequency (in proportion of the length of the corpus)
min_df = 0.3

# filtering items to obtain the vocabulary
vocabulary = [ k for k,v in doc_freq.items() if v >= min_df ]

# print vocabulary
print ("-- vocabulary (len={}): {}".format(len(vocabulary),vocabulary))

-- vocabulary (len=5): [u'clamp', u'pop', u'doubl', u'screen', u'filter']


## 1.4 the TFIDF vector

Words might show up a lot in individual documents, but their relevace is less important if they're in every document! We need to take into account words that show up everywhere and reduce their relative importance. The document frequency does exactly that:

$df(term,corpus) = \frac{ \# \ of \ documents \ that \ contain \ a \ term}{ \# \ of \ documents \ in \ the \ corpus}$

The inverse document frequency is defined in terms of the document frequency as

$idf(term,corpus) = \log{\frac{1}{df(term,corpus)}}$.


TF-IDF is an acronym for the product of two parts: the term frequency tf and what is called the inverse document frequency idf. The term frequency is just the counts in a term frequency vector. 

tf-idf $ = tf(term,document) * idf(term,corpus)$

In [20]:
import numpy as np

# create a dense matrix of vectors for each document
# each vector has the length of the vocabulary
vectors = np.zeros((len(docs),len(vocabulary)))

# fill these vectors with tf-idf values
for i in range(len(docs)):
    for j in range(len(vocabulary)):
        term     = vocabulary[j]
        term_tf  = term_freq[i].get(term, 0.0)   # 0.0 if term not found in doc
        term_idf = np.log(1 + 1 / doc_freq[term]) # smooth formula
        vectors[i,j] = term_tf * term_idf

# displaying results
for i in range(len(docs)):
    print("\n--- review: {}".format(docs[i]['reviewText']))
    print("--- bow: {}".format(bows[i]))
    print("--- tfidf vector: {}".format( vectors[i] ) )
    print("--- tfidf sorted: {}".format( 
            sorted( zip(vocabulary,vectors[i]), key=lambda x:-x[1] )
         ))


--- review: Not much to write about here, but it does exactly what it's supposed to. filters out the pop sounds. now my recordings are much more crisp. it is one of the lowest prices pop filters on amazon so might as well buy it, they honestly work the same despite their pricing,
--- bow: [u'much', u'filter', u'pop', u'record', u'more', u'crisp', u'lowest', u'price', u'filter', u'amazon', u'same', u'price']
--- tfidf vector: [ 0.          0.05776227  0.          0.          0.13515504]
--- tfidf sorted: [(u'filter', 0.13515503603605478), (u'pop', 0.057762265046662105), (u'clamp', 0.0), (u'doubl', 0.0), (u'screen', 0.0)]

--- review: The product does exactly as it should and is quite affordable.I did not realized it was double screened until it arrived, so it was even better than I had expected.As an added bonus, one of the screens carries a small hint of the smell of an old grape candy I used to buy, so for reminiscent's sake, I cannot stop putting the pop filter next to my nose and s

## 1.5 Sklearn pipeline

In [21]:
corpus = [row['reviewText'] for row in docs]

In [22]:
from sklearn.feature_extraction.text import CountVectorizer

tf = CountVectorizer()

document_tf_matrix = tf.fit_transform(corpus).todense()

print sorted(tf.vocabulary_)
print document_tf_matrix

[u'able', u'about', u'added', u'affordable', u'after', u'allowing', u'amazon', u'an', u'and', u'are', u'aroma', u'arrived', u'as', u'attached', u'attaches', u'avoid', u'better', u'block', u'blocks', u'bonus', u'breath', u'but', u'buy', u'candy', u'cannot', u'careful', u'carries', u'clamp', u'cloth', u'coaxing', u'coloration', u'come', u'crisp', u'despite', u'device', u'did', u'dif', u'does', u'double', u'eliminate', u'enough', u'even', u'exactly', u'expected', u'expensive', u'filter', u'filters', u'for', u'frequencies', u'gets', u'goose', u'gooseneck', u'grape', u'great', u'had', u'here', u'high', u'hint', u'hold', u'honestly', u'if', u'in', u'is', u'it', u'job', u'just', u'keep', u'lets', u'like', u'little', u'looks', u'lowest', u'marginally', u'may', u'metal', u'mic', u'might', u'mike', u'mine', u'more', u'mount', u'much', u'mxl', u'my', u'neck', u'needed', u'needs', u'next', u'nice', u'no', u'nose', u'not', u'noticeable', u'now', u'of', u'old', u'on', u'one', u'ones', u'only', u'or'

In [23]:
from math import log

def idf(frequency_matrix):
    df =  float(len(document_tf_matrix)) / sum(frequency_matrix > 0)
    return [log(i) for i in df.getA()[0]]
print sorted(tf.vocabulary_)
print idf(document_tf_matrix)

[u'able', u'about', u'added', u'affordable', u'after', u'allowing', u'amazon', u'an', u'and', u'are', u'aroma', u'arrived', u'as', u'attached', u'attaches', u'avoid', u'better', u'block', u'blocks', u'bonus', u'breath', u'but', u'buy', u'candy', u'cannot', u'careful', u'carries', u'clamp', u'cloth', u'coaxing', u'coloration', u'come', u'crisp', u'despite', u'device', u'did', u'dif', u'does', u'double', u'eliminate', u'enough', u'even', u'exactly', u'expected', u'expensive', u'filter', u'filters', u'for', u'frequencies', u'gets', u'goose', u'gooseneck', u'grape', u'great', u'had', u'here', u'high', u'hint', u'hold', u'honestly', u'if', u'in', u'is', u'it', u'job', u'just', u'keep', u'lets', u'like', u'little', u'looks', u'lowest', u'marginally', u'may', u'metal', u'mic', u'might', u'mike', u'mine', u'more', u'mount', u'much', u'mxl', u'my', u'neck', u'needed', u'needs', u'next', u'nice', u'no', u'nose', u'not', u'noticeable', u'now', u'of', u'old', u'on', u'one', u'ones', u'only', u'or'

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
document_tfidf_matrix = tfidf.fit_transform(corpus)
print sorted(tfidf.vocabulary_)
print document_tfidf_matrix.todense()

[u'able', u'about', u'added', u'affordable', u'after', u'allowing', u'amazon', u'an', u'and', u'are', u'aroma', u'arrived', u'as', u'attached', u'attaches', u'avoid', u'better', u'block', u'blocks', u'bonus', u'breath', u'but', u'buy', u'candy', u'cannot', u'careful', u'carries', u'clamp', u'cloth', u'coaxing', u'coloration', u'come', u'crisp', u'despite', u'device', u'did', u'dif', u'does', u'double', u'eliminate', u'enough', u'even', u'exactly', u'expected', u'expensive', u'filter', u'filters', u'for', u'frequencies', u'gets', u'goose', u'gooseneck', u'grape', u'great', u'had', u'here', u'high', u'hint', u'hold', u'honestly', u'if', u'in', u'is', u'it', u'job', u'just', u'keep', u'lets', u'like', u'little', u'looks', u'lowest', u'marginally', u'may', u'metal', u'mic', u'might', u'mike', u'mine', u'more', u'mount', u'much', u'mxl', u'my', u'neck', u'needed', u'needs', u'next', u'nice', u'no', u'nose', u'not', u'noticeable', u'now', u'of', u'old', u'on', u'one', u'ones', u'only', u'or'

# Part 3 : Comparing two documents / Similarity Measures

## 3.1 Euclidean distance

We could try the Euclidean distance $||\vec{x}-\vec{y}||$  
What problems would we encounter with this? 

The euclidean distance goes up with the length of a document. Intuitively, duplicating each word in our bag of words generates a vector that points in exactly the same direction, however, the euclidean distance goes up. One solution is to normalize vectors before calculating the euclidean distance. Now increasing the length of a document does not change the Euclidean distance unless the direction of the term frequency vector changes. 

## 3.2 Cosine Similarity
Recall that for two vector $\vec{x}$ and $\vec{y}$ that $\vec{x} \cdot \vec{y} = ||\vec{x}|| ||\vec{y}|| \cos{\theta}$. And so,

$\frac{\vec{x} \cdot \vec{y} }{||\vec{x}|| ||\vec{y}||} = \cos{\theta}$

θ can only range from 0 to 90 degrees, because tf-idf vectors are non-negative. Therefore cos θ ranges from 0 to 1. Documents that are exactly identical will have cos θ = 1
