# Tokenization

Tokenization is the process of **converting text documentation into contextual vectors which contain numeric representations** (index of where those words occur in a work dictionary) of the words in the documents. 



## Imports
Of course, we'll need to import various packages. They are either built in the notebook image you are running, or have been installed in the previous step.


In [1]:
import numpy as np 
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from tensorflow import keras

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.text import hashing_trick, text_to_word_sequence

2021-08-10 00:03:51.375757: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0



In the following example, we define 5 'text' documents.  Each word in the document is gong to be placed into a dictionary and will be given a unique index.

In [2]:
#============================================================================
# Define 5 'text' documents
#============================================================================
docs = ['Well done!',
		'Good work',
		'Great effort',
		'nice work',
		'Excellent!']

#============================================================================
# Create the tokenizer
#============================================================================
t = Tokenizer()

#============================================================================
# Fit the tokenizer on the documents.  Creates a word index and other info
#============================================================================
t.fit_on_texts(docs)  

#============================================================================
# Summarize what was learned. 

#============================================================================
# Dict of words and their count. In this example there are 8 unique words
#============================================================================
print("Word counts:  \n{}".format(t.word_counts)) 

#============================================================================
# Number of docs in the fit
#============================================================================
print("Document count:  \n{}".format(t.document_count)) 

#============================================================================
# Dictionary of words and their indexes. Index starts at 1 since 0 is reserved.
#============================================================================
print("Dictionary or Word index:  \n{}".format(t.word_index)) 

#============================================================================
# Dictionary of words and how many docs each appeared in
#============================================================================
print("Word docs:  \n{}".format(t.word_docs)) 

#============================================================================
# Integer encode documents
# Matrix of words of each doc. Uses t.word_index for each doc.
#============================================================================
encoded_docs = t.texts_to_matrix(docs) 

#============================================================================
#this is the 'contextual vector form' of the original textual document 'docs'
#============================================================================
print("Encoded docs matrix:  \n{}".format(encoded_docs)) 


Word counts:  
OrderedDict([('well', 1), ('done', 1), ('good', 1), ('work', 2), ('great', 1), ('effort', 1), ('nice', 1), ('excellent', 1)])
Document count:  
5
Dictionary or Word index:  
{'work': 1, 'well': 2, 'done': 3, 'good': 4, 'great': 5, 'effort': 6, 'nice': 7, 'excellent': 8}
Word docs:  
defaultdict(<class 'int'>, {'well': 1, 'done': 1, 'work': 2, 'good': 1, 'great': 1, 'effort': 1, 'nice': 1, 'excellent': 1})
Encoded docs matrix:  
[[0. 0. 1. 1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 1. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1.]]




## Let's take a closer look at what happened in the above code!

For example, the first row in the encoded matrix is:
[0. 0. 1. 1. 0. 0. 0. 0. 0.]

Recall that our Word index is the following: 
{'work': 1, 'well': 2, 'done': 3, 'good': 4, 'great': 5, 'effort': 6, 'nice': 7, 'excellent': 8}

The first row in the encoded matrix, contains two words, because there are 2 non-zero entries in the vector.  One of the words has an index of 2, and the other word has and index of 3 in the Word index.  Therefore the first document contains the 2 words 'well' and 'done'.

**Note:  our matrix contains as many columns as there are unique words in the dictionary.**  The non-zero values in each row represent the word indexes of the all words in the document.

Now that we understand Tokenization, let's train our job titles model in the next notebook: **03-TrainJobsModel.ipynb**