# Text encoding
## Encoding text word-by-word based on frequency

* [Tokenizer](https://keras.io/preprocessing/text/#tokenizer) by word frequency with some minimal lexical pre-processing **texts_to_sequences()**
* [pad_sequences](https://keras.io/preprocessing/sequence/#pad_sequences) up to expected NN input size.


In [1]:
# Tokenizer example
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'I love you!',
    'do i owe you?'
    'i need a break',
    'your\'s jokes are the BEST.' ,
    'you are so butefull to me'
]

tknzr = Tokenizer(num_words=100) # top 100
tknzr.fit_on_texts(sentences)
idx = tknzr.word_index
print(idx)
# note no punctuations or upper case

{'i': 1, 'you': 2, 'are': 3, 'love': 4, 'do': 5, 'owe': 6, 'need': 7, 'a': 8, 'break': 9, "your's": 10, 'jokes': 11, 'the': 12, 'best': 13, 'so': 14, 'butefull': 15, 'to': 16, 'me': 17}


In [2]:
sequences = tknzr.texts_to_sequences(sentences)
print(sequences)

[[1, 4, 2], [5, 1, 6, 2, 1, 7, 8, 9], [10, 11, 3, 12, 13], [2, 3, 14, 15, 16, 17]]


In [3]:
# ignoring unknown words by default, safe for inference with trained NN
test_data = ['i love you!' , 'i REALLY love you!']
print(test_data)
print(tknzr.texts_to_sequences(test_data))

['i love you!', 'i REALLY love you!']
[[1, 4, 2], [1, 4, 2]]


In [4]:
# to capture difference in meaning with unknown words, use special token to encode unknown word
tknzr2 = Tokenizer(num_words=100, oov_token='<OOV>') # Out Of Vacubulary data
tknzr2.fit_on_texts(sentences)
print(test_data)
print(tknzr2.texts_to_sequences(test_data))
print(tknzr2.word_index)

['i love you!', 'i REALLY love you!']
[[2, 5, 3], [2, 1, 5, 3]]
{'<OOV>': 1, 'i': 2, 'you': 3, 'are': 4, 'love': 5, 'do': 6, 'owe': 7, 'need': 8, 'a': 9, 'break': 10, "your's": 11, 'jokes': 12, 'the': 13, 'best': 14, 'so': 15, 'butefull': 16, 'to': 17, 'me': 18}


Evry sentence after encoding should be uniform (same length) by padding up to expected NN input size.
* [pad_sequences](https://keras.io/preprocessing/sequence/#pad_sequences) is available in keras.preprocessing.sequence.

In [5]:
# padding
from tensorflow.keras.preprocessing.sequence import pad_sequences

unpadded = tknzr2.texts_to_sequences(test_data)
padded  = pad_sequences(unpadded) # default = pre 
print(unpadded, type(unpadded), '\n----\n', padded, type(padded))


[[2, 5, 3], [2, 1, 5, 3]] <class 'list'> 
----
 [[0 2 5 3]
 [2 1 5 3]] <class 'numpy.ndarray'>


In [6]:
# padding up to expected NN input size
padded = pad_sequences(sequences, padding='post', truncating='post', maxlen=5)
print( sequences, '\n---\n', padded) # pre or post

[[1, 4, 2], [5, 1, 6, 2, 1, 7, 8, 9], [10, 11, 3, 12, 13], [2, 3, 14, 15, 16, 17]] 
---
 [[ 1  4  2  0  0]
 [ 5  1  6  2  1]
 [10 11  3 12 13]
 [ 2  3 14 15 16]]


## Embedding
Embedding takes encoded words and establish sentiment from them, so that you can begin to classify and then later predict texts.