# How to Prepare Text Data With Keras

As we cannot Feed the raw text directly into deep learning models,the text data must be encoded as numbers to be used asa input or output for machine learning and deep learning models.As with any neural network, we need to convert our data into a numeric format; in Keras and TensorFlow we work with tensors. The IMDB example data from the keras package has been preprocessed to a list of integers, where every integer corresponds to a word arranged by descending word frequency.

# 7.1 Split Words with text_to_word_sequence

Words are called tokens and the process of splitting the text into tokens is called tokenization  in that the first step is when working with text is to spliting it into the words,As the keras provides the text_to_word_sequece() function that is been used to split the text into a list of words. This function will automatically does three things 
 1. Split Words by Space
 2. Filters out Punctuation
 3. Converts text it lowercas(lower=true)

In [0]:
from keras.preprocessing.text import text_to_word_sequence
#define the document
text = 'The quick brown fox jumped over the lazy dog.'
#tokenize the document
result = text_to_word_sequence(text)
print(result)

In [0]:
['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']

# 7.2 Encoding with one hot

A one hot encoding is a representation of categorical variables as binary vectors. This first requires that the categorical values be mapped to integer values. Then, each integer value is represented as a binary vector that is all zero values except the index of the integer, which is marked with a 1.

In [0]:
from keras.preprocessing.text import text_to_word_sequence
#define the document
text = 'The quick brown fox jumped over the lazy dog.'
from keras.preprocessing.text import text_to_word_sequence
#define the document
text = 'The quick brown fox jumped over the lazy dog.'

In [0]:
from keras.preprocessing.text import one_hot
from keras.preprocessing.text import text_to_word_sequence
#define the document
text = 'The quick brown fox jumped over the lazy dog.'
#estimate the size of the vocabulary
words = set(text_to_word_sequence(text))
vocab_size = len(words)
print(vocab_size)
#integer encode the document
result = one_hot(text, round(vocab_size*1.3))
print(result)

8
[5, 9, 8, 7, 9, 1, 5, 3, 8]

# 7.3 Hash Encoding with hashing_trick

The choice of hash function determines the range of possible outputs, i.e. the range is always fixed (e.g. numbers from 0 to 1024). Hash functions are one-way: given a hash, we can't perform a reverse lookup to determine what the input was. Hash functions may output the same value for different inputs (collision).The limitatio of integer and count base encodings is that they must maintain a vocabulary of words and their mapping to integers.This avoids the need of keeping a track if the vocabulary which is faster and requires less memory.

In [0]:
from keras.preprocessing.text import hashing_trick
from keras.preprocessing.text import text_to_word_sequence
#define the document
text = 'The quick brown fox jumped over the lazy dog.'
#estimate the size of the vocabulary
words = set(text_to_word_sequence(text))
vocab_size = len(words)
print(vocab_size)
#integer encode the document
result = hashing_trick(text, round(vocab_size*1.3), hash_function='md5')
print(result)

In [0]:
8
[6, 4, 1, 2, 7, 5, 6, 2, 6]

# 7.4 Tokenizer API

Keras provides a more sophisticated API for preparing text that can be fit and reused to prepare multiple text documents,as the keras providesa the tokenizer class for preparing text documents for deep learning,this may be preffered approach for large projects.

In [0]:
from keras.preprocessing.text import Tokenizer
#define 5 documents
docs = ['Well done!',
'Good work',
'Great effort',
'nice work',
'Excellent!']
#create the tokenizer
t = Tokenizer()
#fit the tokenizer on the documents
t.fit_on_texts(docs)

word_couts : A dictionary of words and their counts

word_docs : An integer count of the total number of documents that were used to fit the Tokenizer.

word index: A dictionary of words and their uniquely assigned integers.

document count: A dictionary of words and how many documents each appeared in.

In [0]:
#summarize what was learned
print(t.word_counts)
print(t.document_count)
print(t.word_index)
print(t.word_docs)

binary: Whether or not each word is present in the document. This is the default.

count: The count of each word in the document.

tfidf: The Text Frequency-Inverse DocumentFrequency (TF-IDF) scoring for each word in the document.

freq: The frequency of each word as a ratio of words within each document.

In [0]:
from keras.preprocessing.text import Tokenizer
#define 5 documents
docs = ['Well done!',
'Good work',
'Great effort',
'nice work',
'Excellent!']
#create the tokenizer
t = Tokenizer()
#fit the tokenizer on the documents
t.fit_on_texts(docs)
#summarize what was learned
print(t.word_counts)
print(t.document_count)
print(t.word_index)
print(t.word_docs)
#integer encode documents
encoded_docs = t.texts_to_matrix(docs, mode='count')
print(encoded_docs)