## Data Preprocessing for Keras
We already know that we cannot feed in raw text to deep learning models. Hence, the text data must be encoded as numbers to be used as input for deep learning models, such as word embeddings.

### Split words with text_to_word_sequence
We can use the text_to_word_sequence api from keras. This does 3 things.

1.  Splits words by Space
2.  Filters out punctuation
3.  Converts text to lowercase (lower=True)

In [2]:
from keras.preprocessing.text import text_to_word_sequence
text='The quick brown fox jumped over the lazy dog'
result=text_to_word_sequence(text)
print(result)


['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']


## Encoding with one-hot

In [3]:
#Calculating the number of unique words in our document
words=set(text_to_word_sequence(text))
vocab_size=len(words)
print(vocab_size)

8


In [5]:
#Now we shall use the one_hot() function and encode the words in the document. 
from keras.preprocessing.text import one_hot
from keras.preprocessing.text import text_to_word_sequence
text='The quick brown fox jumped over the lazy dog'
words=set(text_to_word_sequence(text))
vocab_size=len(words)
print(vocab_size)
result=one_hot(text, round(vocab_size*1.3))
print(result)

8
[6, 6, 2, 7, 6, 2, 6, 8, 3]


### Hashing encoding with hasing_trick
A limitation of integer and count based encodings is that they must maintain a vocabulary of words and their mapping to integers. An alternative to this approach is to use a one-way hash function to convert words to integers. 
This avoids the need to keep track of a vocabulary which is faster and requires less memory

In [6]:
from keras.preprocessing.text import hashing_trick
from keras.preprocessing.text import text_to_word_sequence
text='The quick brown fox jumped over the lazy dog'
words=set(text_to_word_sequence(text))
vocab_size=len(words)

result=hashing_trick(text, round(vocab_size*1.3), hash_function='md5')
print(result)

[6, 4, 1, 2, 7, 5, 6, 2, 6]
