**WORD BASED ENCODINGS:**

Tensorflow and keras give us a number of ways to encode words, but the one I'm going to focus on is the tokenizer. 
This will generating the dictionary of word-encodings , creating vectors out of the sentences.

In [None]:
import tensorflow as tf
from tensorflow import keras

from tensorflow.keras.preprocessing.text import Tokenizer

sentences = ["I love my dog",
             "I love my cat",
             "You love my dog!"]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)

word_index = tokenizer.word_index
print(word_index)

{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}


The fit on texts method of the tokenizer then takes in the data and encodes it. The tokenizer provides a word index property which returns a dictionary containing key value pairs, where the key is the word, and the value is the token for that word, which you can inspect by simply printing it out.

 I was capitalized, note that it's lower-cased here. That's another thing that the tokenizer does for you. It strips punctuation out. 

 A few things to take note of; number one is that punctuation like spaces and the comma, have actually been removed. So it cleans up my text for me in that way too just to actually pull out the words. Number two, you may have noticed that I have a lowercase i here and an uppercase I here. As you can see to make a case insensitive, it's just using I and it's giving the same token for both of these.

*** To note: Corpus of words***

**TEXT TO SEQUENCE:**

We saw how to tokenize the words and sentences, building up a dictionary of all the words to make a corpus. 
The next step will be to turn your sentences into lists of values based on these tokens. 
Once you have them, you'll likely also need to manipulate these lists, not least to make every sentence the same length, otherwise, it may be hard to train a neural network with them. 
Remember when we were doing images, we defined an input layer with the size of the image that we're feeding into the neural network. In the cases where images where differently sized, we would resize them to fit. 
Well, you're going to face the same thing with text. Fortunately, TensorFlow includes APIs to handle these issues. 

One really handy thing about this that you'll use later is the fact that the text to sequences called can take any set of sentences, so it can encode them based on the word set that it learned from (the one that was passed into fit on texts) and leave others. This is very significant. 


In [None]:
#Uisng earlier defined corpus : tokenizer

test_sentences = ["I really love my dog",
             "My dog loves my mantee"]


test_sequences = tokenizer.texts_to_sequences(test_sentences)
print(test_sequences)

[[3, 1, 2, 4], [2, 4, 2]]


Missed: really , loves , mantee : as these words were not part of word corpus

 instead of just ignoring unseen words, to put a special value in when an unseen word is encountered. OOV / Out of Vocabulary

** Encountering Unseen Words with OOV**

In [None]:
sentences = ["I love my dog",
             "I love my cat",
             "You love my dog!"]

tokenizer = Tokenizer(num_words = 100, oov_token = "<OOV>")
tokenizer.fit_on_texts(sentences)

word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences)
print(word_index)
print(sequences)

test_sentences = ["I really love my dog",
             "My dog loves my mantee"]


test_sequences = tokenizer.texts_to_sequences(test_sentences)
print(test_sequences)

{'<OOV>': 1, 'love': 2, 'my': 3, 'i': 4, 'dog': 5, 'cat': 6, 'you': 7}
[[4, 2, 3, 5], [4, 2, 3, 6], [7, 2, 3, 5]]
[[4, 1, 2, 3, 5], [3, 5, 1, 3, 1]]


**PADDING:**

As we mentioned, earlier when we were building neural networks to handle pictures. When we fed them into the network for training, we needed them to be uniform in size. Often, we use the generators to resize the image to fit for example. With texts you'll face a similar requirement before you can train with texts, we needed to have some level of uniformity of size, so padding is your friend there

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog blackie who is a labraor is amazing?'
]

tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)

padded = pad_sequences(sequences, padding="post",maxlen=7)
print("\nWord Index = " , word_index)
print("\nSequences = " , sequences)
print("\nPadded Sequences:")
print(padded)


# Try with words that the tokenizer wasn't fit to
test_data = [
    'i really love my dog',
    'my dog loves my place as its an open field place'
]

test_seq = tokenizer.texts_to_sequences(test_data)
print("\nTest Sequence = ", test_seq)

padded = pad_sequences(test_seq)
print("\nPadded Test Sequence: ")
print(padded)


Word Index =  {'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'is': 7, 'cat': 8, 'do': 9, 'think': 10, 'blackie': 11, 'who': 12, 'a': 13, 'labraor': 14, 'amazing': 15}

Sequences =  [[5, 3, 2, 4], [5, 3, 2, 8], [6, 3, 2, 4], [9, 6, 10, 2, 4, 11, 12, 7, 13, 14, 7, 15]]

Padded Sequences:
[[ 5  3  2  4  0  0  0]
 [ 5  3  2  8  0  0  0]
 [ 6  3  2  4  0  0  0]
 [11 12  7 13 14  7 15]]

Test Sequence =  [[5, 1, 3, 2, 4], [2, 4, 1, 2, 1, 1, 1, 1, 1, 1, 1]]

Padded Test Sequence: 
[[0 0 0 0 0 0 5 1 3 2 4]
 [2 4 1 2 1 1 1 1 1 1 1]]
