In [1]:
#Example of Tokenization where we convert text documents into contextual vectors which contain numeric representations (index of where those words 
# occur in a word dictionary) of the words in the documents.

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.text import hashing_trick, text_to_word_sequence

# define 5 'text'documents.  Here each string is considered a document.  Each word in the document is going to be put into a dictionary and is given a
#unique index.

docs = ['Well done!',
		'Good work',
		'Great effort',
		'nice work',
		'Excellent!']

# create the tokenizer
t = Tokenizer()

# fit the tokenizer on the documents
t.fit_on_texts(docs)  # Basically, creates a word index and other info

# summarize what was learned
print("Word counts:  \n{}".format(t.word_counts))  # Dict of words and their count
                                                    # In this example there are 8 unique words
print("Document count:  \n{}".format(t.document_count)) # Number of docs in the fit
print("Dictionary or Word index:  \n{}".format(t.word_index)) # Dict of words and their indexes. Index starts at 1
                                                # since 0 is reserved.
print("Word docs:  \n{}".format(t.word_docs)) # Dict of words and how many docs each appeared in

# integer encode documents
encoded_docs = t.texts_to_matrix(docs) # Matrix of words of each doc. Uses t.word_index for each doc.

print("Encoded docs matrix:  \n{}".format(encoded_docs))  #this is the 'contextual vector form' of the original textual document 'docs'

Word counts:  
OrderedDict([('well', 1), ('done', 1), ('good', 1), ('work', 2), ('great', 1), ('effort', 1), ('nice', 1), ('excellent', 1)])
Document count:  
5
Word index:  
{'work': 1, 'well': 2, 'done': 3, 'good': 4, 'great': 5, 'effort': 6, 'nice': 7, 'excellent': 8}
Word docs:  
defaultdict(<class 'int'>, {'done': 1, 'well': 1, 'work': 2, 'good': 1, 'effort': 1, 'great': 1, 'nice': 1, 'excellent': 1})
Encoded docs matrix:  
[[0. 0. 1. 1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 1. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1.]]


For example, the first row in the encoded matrix is:
[0. 0. 1. 1. 0. 0. 0. 0. 0.]

Recall that our Word index is the following: 
{'work': 1, 'well': 2, 'done': 3, 'good': 4, 'great': 5, 'effort': 6, 'nice': 7, 'excellent': 8}

The first row in the encoded matrix, contains two words, because there are 2 non-zero entries in the vector.  One of the words has an index of 2, and the other word has and index of 3 in the Word index.  Therefore the first document contains the 2 words 'well' and 'done'.


Now that we understand Tokenization let's go back to our claims data and vectorize it.  Go to notebook 01-Create-Claims-Classification.ipynb

Note:  our matrix contains as many columns as there are unique words in the dictionary.  The non-zero values in each row represent the word indexes of the all words in the document.