<a href="https://colab.research.google.com/github/rahiakela/natural-language-processing-in-action/blob/3-build-vocabulary-using-word-tokenization/1_building_vocabulary_with_tokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building vocabulary with a tokenizer

In NLP, tokenization is a particular kind of document segmentation. Segmentation breaks up text into smaller chunks or segments, with more focused information content. Segmentation can include breaking a document into paragraphs, paragraphs into sentences, sentences into phrases, or phrases into tokens (usually words) and punctuation. So, we focus on segmenting text into tokens, which is called tokenization.

For the fundamental building blocks of NLP, there are equivalents in a computer language compiler:

- tokenizer—scanner, lexer, lexical analyzer
- vocabulary—lexicon
- parser—compiler
- token, term, word, or n-gram—token, symbol, or terminal symbol

Tokenization is the first step in an NLP pipeline, so it can have a big impact on the rest of your pipeline. A tokenizer breaks unstructured data, natural language text, into chunks of information that can be counted as discrete elements. These counts of token occurrences in a document can be used directly as a vector representing that document. This immediately turns an unstructured string (text document) into a numerical data structure suitable for machine learning. 

These counts can be used directly by a computer to trigger useful actions and responses. Or they might also be used in a machine learning pipeline as features that trigger more complex decisions or behavior. The most common use for bag-of-words vectors created this way is for document retrieval, or search.

The simplest way to tokenize a sentence is to use whitespace within a string as the “delimiter” of words.

In [1]:
sentence = "Thomas Jefferson began building Monticello at the age of 26."
sentence.split()

['Thomas',
 'Jefferson',
 'began',
 'building',
 'Monticello',
 'at',
 'the',
 'age',
 'of',
 '26.']

In [2]:
str.split(sentence)

['Thomas',
 'Jefferson',
 'began',
 'building',
 'Monticello',
 'at',
 'the',
 'age',
 'of',
 '26.']

With a bit more Python, you can create a numerical vector representation for each word. These vectors are called one-hot vectors, and soon
you’ll see why. A sequence of these one-hot vectors fully captures the original document text in a sequence of vectors, a table of numbers. That will solve the first problem of NLP, turning words into numbers:

In [3]:
import numpy as np

In [4]:
# str.split () is your quick-and-dirty tokenizer
token_sequence = str.split(sentence)
# Your vocabulary lists all the unique tokens (words) that you want to keep track of.
vocab = sorted(set(token_sequence))
# orted lexographically (lexically) so numbers come before letters, and capital letters come before lowercase letters.
", ".join(vocab)

'26., Jefferson, Monticello, Thomas, age, at, began, building, of, the'

In [5]:
num_tokens = len(token_sequence)
vocab_size = len(token_sequence)

# The empty table is as wide as your count of unique vocabulary terms and as high as the length of your document, 10 rows by 10 columns.
onehot_vectors = np.zeros((num_tokens, vocab_size), int)
# For each word in the sentence, mark the column for that word in your vocabulary with a 1.
for i, word in enumerate(token_sequence):
  onehot_vectors[i, vocab.index(word)] = 1
" ".join(vocab)
onehot_vectors

array([[0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

Pandas DataFrames can help make this a little easier on the eyes and more informative. Pandas wraps a 1D array with some helper functionality in an object called a Series. And Pandas is particularly handy with tables of numbers like lists of lists, 2D numpy arrays, 2D numpy matrices, arrays of arrays, dictionaries of dictionaries, and so on.

A DataFrame keeps track of labels for each column, allowing you to label each column in our table with the token or word it represents. A DataFrame can also keep track of labels for each row in the DataFrame.index, for speedy lookup.

In [6]:
import pandas as pd

In [7]:
pd.DataFrame(onehot_vectors, columns=vocab)

Unnamed: 0,26.,Jefferson,Monticello,Thomas,age,at,began,building,of,the
0,0,0,0,1,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,1,0,0,0
3,0,0,0,0,0,0,0,1,0,0
4,0,0,1,0,0,0,0,0,0,0
5,0,0,0,0,0,1,0,0,0,0
6,0,0,0,0,0,0,0,0,0,1
7,0,0,0,0,1,0,0,0,0,0
8,0,0,0,0,0,0,0,0,1,0
9,1,0,0,0,0,0,0,0,0,0


In [8]:
pd.DataFrame(onehot_vectors, columns=vocab, index=token_sequence)

Unnamed: 0,26.,Jefferson,Monticello,Thomas,age,at,began,building,of,the
Thomas,0,0,0,1,0,0,0,0,0,0
Jefferson,0,1,0,0,0,0,0,0,0,0
began,0,0,0,0,0,0,1,0,0,0
building,0,0,0,0,0,0,0,1,0,0
Monticello,0,0,1,0,0,0,0,0,0,0
at,0,0,0,0,0,1,0,0,0,0
the,0,0,0,0,0,0,0,0,0,1
age,0,0,0,0,1,0,0,0,0,0
of,0,0,0,0,0,0,0,0,1,0
26.,1,0,0,0,0,0,0,0,0,0


One-hot vectors are super-sparse, containing only one nonzero value in each row vector. So we can make that table of one-hot row vectors even prettier by replacing zeros with blanks.

In [9]:
df = pd.DataFrame(onehot_vectors, columns=vocab, index=token_sequence)
df[df == 0] = ''
df

Unnamed: 0,26.,Jefferson,Monticello,Thomas,age,at,began,building,of,the
Thomas,,,,1.0,,,,,,
Jefferson,,1.0,,,,,,,,
began,,,,,,,1.0,,,
building,,,,,,,,1.0,,
Monticello,,,1.0,,,,,,,
at,,,,,,1.0,,,,
the,,,,,,,,,,1.0
age,,,,,1.0,,,,,
of,,,,,,,,,1.0,
26.,1.0,,,,,,,,,


In this representation of your one-sentence document, each row is a vector for a single word.

Each row of the table is a binary row vector, and you can see why it’s also called a one-hot vector: all but one of the positions (columns) in a row are 0 or blank. Only one column, or position in the vector, is “hot” (“1”). A one (1) means on, or hot. A zero (0) means off, or absent.

One nice feature of this vector representation of words and tabular representation of documents is that no information is lost.3 As long as you keep track of which words are indicated by which column, you can reconstruct the original document from this table of one-hot vectors. And this reconstruction process is 100% accurate, even though your tokenizer was only 90% accurate at generating the tokens you thought would be useful. 

As a result, one-hot word vectors like this are typically used in neural
nets, sequence-to-sequence language models, and generative language models.
They’re a good choice for any model or NLP pipeline that needs to retain all the meaning inherent in the original text.