<a href="https://colab.research.google.com/github/rahiakela/natural-language-processing-in-action/blob/3-math-with-words/1_bag_of_words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Math with words(TF-IDF vectors)

Having collected and counted words (tokens), and bucketed them into stems or lemmas, it’s time to do something interesting with them. Detecting words is useful for simple tasks, like getting statistics about word usage or doing keyword search. 

But you’d like to know which words are more important to a particular document and across the corpus as a whole. Then you can use that “importance” value to find relevant documents in a corpus based on keyword importance within each document.

With representations of words in a continuous space, you can operate on their representation with more exciting math. Your goal is to find numerical representation of words that somehow capture the importance or information content of the words they represent.

There are three increasingly powerful ways to represent words and
their importance in a document:

1. **Bags of words**—Vectors of word counts or frequencies
2. **Bags of n-grams**—Counts of word pairs (bigrams), triplets (trigrams), and so on
3. **TF-IDF vectors**—Word scores that better represent their importance

**These are all statistical models in that they are frequency based.**

But these “shallow” NLP machines are powerful and useful for many practical
applications such as spam filtering and sentiment analysis.

## Setup

In [3]:
import pandas as pd
pd.set_option('display.width', 75)

from nltk.tokenize import TreebankWordTokenizer

from collections import Counter

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

In [4]:
!git clone https://github.com/totalgood/nlpia

!mkdir test
!cp -r nlpia/src/ test/
!rm -rf nlpia
!cp -r test/src/nlpia/ .
!rm -rf test

Cloning into 'nlpia'...
remote: Enumerating objects: 119, done.[K
remote: Counting objects: 100% (119/119), done.[K
remote: Compressing objects: 100% (74/74), done.[K
remote: Total 6295 (delta 54), reused 79 (delta 31), pack-reused 6176[K
Receiving objects: 100% (6295/6295), 124.11 MiB | 13.02 MiB/s, done.
Resolving deltas: 100% (4001/4001), done.


## Bag of Words

As a first approximation, you assume that the more times a word occurs, the more meaning it must contribute to that document. 

A document that refers to “wings” and “rudder” frequently may be more relevant to a problem involving jet airplanes or air travel, than say a document that refers frequently to “cats” and “gravity.” Or if you have classified some words as expressing positive emotions—words like “good,” “best,” “joy,” and “fantastic”—the more a document that contains those words is likely to have positive “sentiment.” You can imagine though how an algorithm that relied on these simple rules might be mistaken or led astray.


Let’s look at an example where counting occurrences of words is useful:

In [9]:
sentence = "The faster Harry got to the store, the faster Harry,the faster, would get home."

tokenizer = TreebankWordTokenizer()
tokens = tokenizer.tokenize(sentence.lower())
tokens

['the',
 'faster',
 'harry',
 'got',
 'to',
 'the',
 'store',
 ',',
 'the',
 'faster',
 'harry',
 ',',
 'the',
 'faster',
 ',',
 'would',
 'get',
 'home',
 '.']

With your simple list, you want to get unique words from the document and their counts. A Python dictionary serves this purpose nicely, and because you want to count the words as well, you can use Counter,

In [10]:
bag_of_words = Counter(tokens)
bag_of_words

Counter({',': 3,
         '.': 1,
         'faster': 3,
         'get': 1,
         'got': 1,
         'harry': 2,
         'home': 1,
         'store': 1,
         'the': 4,
         'to': 1,
         'would': 1})

So let’s get these words ranked—sorted in some order that’s easier to think about. The Counter object has a handy method, most_common, for just this purpose:

In [11]:
bag_of_words.most_common(4)

[('the', 4), ('faster', 3), (',', 3), ('harry', 2)]

Specifically, the number of times a word occurs in a given document is called the term frequency, commonly abbreviated TF.

Let’s calculate the term frequency of “harry” from the Counter object.

In [12]:
times_harry_appears = bag_of_words['harry']

# The number of unique tokens from your original source
num_unique_words = len(bag_of_words)

tf = times_harry_appears / num_unique_words
round(tf, 4)

0.1818

Now this is a clearly contrived example, but you can quickly see how meaningful results could come from this approach. Let’s look at a bigger piece of text.

In [14]:
kite_text = """A kite is traditionally a tethered heavier-than-air craft with wing surfaces that react against the air to create lift and drag. 
A kite consists of wings, tethers, and anchors. Kites often have a bridle to guide the face of the kite at the correct angle so the wind can
 lift it. A kite's wing also may be so designed so a bridle is not needed; when kiting a sailplane for launch, the tether meets the wing at 
 a single point. A kite may have fixed or moving anchors. Untraditionally in technical kiting, a kite consists of tether-set-coupled wing sets;
even in technical kiting, though, a wing in the system is still often called the kite. The lift that sustains the kite in flight is generated
when air flows around the kite's surface, producing low pressure above and high pressure below the wings. The interaction with the wind also
generates horizontal drag along the direction of the wind. The resultant force vector from the lift and drag force components is opposed by
the tension of one or more of the lines or tethers to which the kite is attached. The anchor point of the kite line may be static or moving
(e.g., the towing of a kite by a running person, boat, free-falling anchors as in paragliders and fugitive parakites or vehicle). 
The same principles of fluid flow apply in liquids and kites are also used under water. A hybrid tethered craft comprising both a 
lighter-than-air balloon as well as a kite lifting surface is called a kytoon. Kites have a long and varied history and many different types 
are flown individually and at festivals worldwide. Kites may be flown for recreation, art or other practical uses. Sport kites can be flown
in aerial ballet, sometimes as part of a competition. Power kites are multi-line steerable kites designed to generate large forces which can
be used to power activities such as kite surfing, kite landboarding, kite fishing, kite buggying and a new trend snow kiting. 
Even Man-lifting kites have been made."""

In [15]:
tokens = tokenizer.tokenize(kite_text.lower())
token_sequence = Counter(tokens)
print(token_sequence)

Counter({'the': 26, 'a': 20, 'kite': 16, ',': 15, 'and': 10, 'of': 10, 'kites': 8, 'is': 7, 'in': 7, 'or': 6, 'wing': 5, 'to': 5, 'be': 5, 'as': 5, 'lift': 4, 'have': 4, 'may': 4, 'at': 3, 'so': 3, 'can': 3, 'also': 3, 'kiting': 3, 'are': 3, 'flown': 3, 'tethered': 2, 'craft': 2, 'with': 2, 'that': 2, 'air': 2, 'consists': 2, 'tethers': 2, 'anchors.': 2, 'often': 2, 'bridle': 2, 'wind': 2, "'s": 2, 'designed': 2, ';': 2, 'when': 2, 'for': 2, 'moving': 2, 'technical': 2, 'even': 2, 'called': 2, 'surface': 2, 'pressure': 2, 'drag': 2, 'force': 2, 'by': 2, 'which': 2, '.': 2, 'used': 2, 'power': 2, 'traditionally': 1, 'heavier-than-air': 1, 'surfaces': 1, 'react': 1, 'against': 1, 'create': 1, 'drag.': 1, 'wings': 1, 'guide': 1, 'face': 1, 'correct': 1, 'angle': 1, 'it.': 1, 'not': 1, 'needed': 1, 'sailplane': 1, 'launch': 1, 'tether': 1, 'meets': 1, 'single': 1, 'point.': 1, 'fixed': 1, 'untraditionally': 1, 'tether-set-coupled': 1, 'sets': 1, 'though': 1, 'system': 1, 'still': 1, 'kite.

So that is a lot of stop words. It’s not likely that this Wikipedia
article is about the articles “the” and “a,” nor the conjunction “and” and the other stop words. So let’s ditch them for now:

In [16]:
import nltk
nltk.download('stopwords', quiet=True)

True

In [17]:
stopwords = nltk.corpus.stopwords.words('english')
tokens = [x for x in tokens if x not in stopwords]
kite_count = Counter(tokens)
print(kite_count)

Counter({'kite': 16, ',': 15, 'kites': 8, 'wing': 5, 'lift': 4, 'may': 4, 'also': 3, 'kiting': 3, 'flown': 3, 'tethered': 2, 'craft': 2, 'air': 2, 'consists': 2, 'tethers': 2, 'anchors.': 2, 'often': 2, 'bridle': 2, 'wind': 2, "'s": 2, 'designed': 2, ';': 2, 'moving': 2, 'technical': 2, 'even': 2, 'called': 2, 'surface': 2, 'pressure': 2, 'drag': 2, 'force': 2, '.': 2, 'used': 2, 'power': 2, 'traditionally': 1, 'heavier-than-air': 1, 'surfaces': 1, 'react': 1, 'create': 1, 'drag.': 1, 'wings': 1, 'guide': 1, 'face': 1, 'correct': 1, 'angle': 1, 'it.': 1, 'needed': 1, 'sailplane': 1, 'launch': 1, 'tether': 1, 'meets': 1, 'single': 1, 'point.': 1, 'fixed': 1, 'untraditionally': 1, 'tether-set-coupled': 1, 'sets': 1, 'though': 1, 'system': 1, 'still': 1, 'kite.': 1, 'sustains': 1, 'flight': 1, 'generated': 1, 'flows': 1, 'around': 1, 'producing': 1, 'low': 1, 'high': 1, 'wings.': 1, 'interaction': 1, 'generates': 1, 'horizontal': 1, 'along': 1, 'direction': 1, 'wind.': 1, 'resultant': 1, 

By looking purely at the number of times words occur in this document, you’re learning something about it. The terms kite(s), wing, and lift are all important.

Across multiple documents in a corpus, things get a little more interesting. A set of documents may all be about, say, kite flying. You would imagine all the documents may refer to string and wind quite often, and the term frequencies TF("string") and TF("wind") would therefore rank highly in all the documents.