# Prepare Text Data With `tf.keras`

@ Sani Kamal, 2019

## Split Words with `text_to_word_sequence`

`tf.keras` provides the text to word sequence() function that you can use to split text into a list of words.

In [13]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import text_to_word_sequence

# define the document
text = 'Poetry is often separated into lines on a page, in a process known as lineation.'
# tokenize the document
result = text_to_word_sequence(text)
print(result)

['poetry', 'is', 'often', 'separated', 'into', 'lines', 'on', 'a', 'page', 'in', 'a', 'process', 'known', 'as', 'lineation']


## Encoding with `one_hot`

In [14]:
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.text import text_to_word_sequence

# define the document
text = 'Poetry is often separated into lines on a page, in a process known as lineation.'

# estimate the size of the vocabulary
words = set(text_to_word_sequence(text))
vocab_size = len(words)
print(vocab_size)

# integer encode the document
result = one_hot(text, round(vocab_size*1.3))
print(result)

14
[1, 11, 17, 5, 1, 4, 13, 9, 12, 17, 9, 8, 14, 8, 1]


## Hash Encoding with `hashing_trick`

`tf.keras` provides the `hashing_trick()` function that tokenizes and then integer encodes the
document, just like the `one_hot()` function. It provides more flexibility, allowing you to specify the hash function as either hash (the default) or other hash functions such as the built in `md5` function or custom function.

In [15]:
from tensorflow.keras.preprocessing.text import hashing_trick
from tensorflow.keras.preprocessing.text import text_to_word_sequence

# define the document
text = 'Poetry is often separated into lines on a page, in a process known as lineation.'

# estimate the size of the vocabulary
words = set(text_to_word_sequence(text))
vocab_size = len(words)
print(vocab_size)

# integer encode the document
result = hashing_trick(text, round(vocab_size*1.3), hash_function= 'md5')
print(result)

14
[11, 6, 5, 8, 15, 1, 14, 12, 17, 1, 12, 16, 5, 3, 10]


## `Tokenizer` API
`tf.keras` provides a more sophisticated API for preparing text that can be fit and reused to prepare multiple text documents. This may be the preferred approach for large projects. `tf.keras` provides the `Tokenizer` class for preparing text documents for deep learning. The `Tokenizer` must be constructed and then fit on either raw text documents or integer encoded text documents.

In [16]:
from tensorflow.keras.preprocessing.text import Tokenizer

# define 5 documents
docs = [ ' Well done! ' ,
' Good work ' ,
' Great effort ' ,
' nice work ' ,
' Excellent! ' ]

# create the tokenizer
t = Tokenizer()

# fit the tokenizer on the documents
t.fit_on_texts(docs)

Once fit the `Tokenizer` provides 4 attributes that you can use to query what has been learned about your documents:

- `word_counts`: A dictionary of words and their counts.
- `word_docs`: An integer count of the total number of documents that were used to fit the Tokenizer.
- `word_index`: A dictionary of words and their uniquely assigned integers.
- `document_count`: A dictionary of words and how many documents each appeared in.

In [17]:
# summarize what was learned
print(t.word_counts)
print(t.document_count)
print(t.word_index)
print(t.word_docs)

OrderedDict([('well', 1), ('done', 1), ('good', 1), ('work', 2), ('great', 1), ('effort', 1), ('nice', 1), ('excellent', 1)])
5
{'work': 1, 'well': 2, 'done': 3, 'good': 4, 'great': 5, 'effort': 6, 'nice': 7, 'excellent': 8}
defaultdict(<class 'int'>, {'done': 1, 'well': 1, 'work': 2, 'good': 1, 'great': 1, 'effort': 1, 'nice': 1, 'excellent': 1})


In [18]:
# binary: Whether or not each word is present in the document. This is the default.
encoded_docs = t.texts_to_matrix(docs)
print(encoded_docs)

[[0. 0. 1. 1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 1. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1.]]


In [19]:
# tfidf: The Text Frequency-Inverse DocumentFrequency (TF-IDF) 
# scoring for each wordin the document.
encoded_docs = t.texts_to_matrix(docs, mode= 'tfidf' )
print(encoded_docs)

[[0.         0.         1.25276297 1.25276297 0.         0.
  0.         0.         0.        ]
 [0.         0.98082925 0.         0.         1.25276297 0.
  0.         0.         0.        ]
 [0.         0.         0.         0.         0.         1.25276297
  1.25276297 0.         0.        ]
 [0.         0.98082925 0.         0.         0.         0.
  0.         1.25276297 0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.         1.25276297]]


In [20]:
# freq: The frequency of each word as a ratio of words within each document.
encoded_docs = t.texts_to_matrix(docs, mode= 'freq' )
print(encoded_docs)

[[0.  0.  0.5 0.5 0.  0.  0.  0.  0. ]
 [0.  0.5 0.  0.  0.5 0.  0.  0.  0. ]
 [0.  0.  0.  0.  0.  0.5 0.5 0.  0. ]
 [0.  0.5 0.  0.  0.  0.  0.  0.5 0. ]
 [0.  0.  0.  0.  0.  0.  0.  0.  1. ]]


In [21]:
# count: The count of each word in the document.
# integer encode documents
encoded_docs = t.texts_to_matrix(docs, mode= 'count' )
print(encoded_docs)

[[0. 0. 1. 1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 1. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1.]]
