###  

In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [2]:
sentences = [
    'I love my dog',
    'I love my cat',
    'You love dog!',
    'Do you think my dog is amazing?']

In [3]:
# num_words keeps only 100 words
tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")

# here tokenize the sentences using tokenizer
tokenizer.fit_on_texts(sentences)

#word index has the word and its index
# notice how tokenizer ignored symbols
word_index = tokenizer.word_index

print(word_index)

{'<OOV>': 1, 'love': 2, 'my': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}


sequence of numbers in correct order makes up a sentence, this sequence will be used to train neural networks<br>
tokenizer.text_to_sequences(sentence) does the job us and turns sentences in to sequence of tokens

In [4]:
sequences = tokenizer.texts_to_sequences(sentences)

In [5]:
print(word_index)

{'<OOV>': 1, 'love': 2, 'my': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}


In [6]:
print(sequences)

[[5, 2, 3, 4], [5, 2, 3, 7], [6, 2, 4], [8, 6, 9, 3, 4, 10, 11]]


#### pad_sequences turns sequences to padded sequences 

In [7]:
padded = pad_sequences(sequences, padding='post', truncating='post',
                      maxlen=5)

In [8]:
print(padded)

[[5 2 3 4 0]
 [5 2 3 7 0]
 [6 2 4 0 0]
 [8 6 9 3 4]]


##### what happens when we need to classify words which the neural network hasn't learn?
lets try below

In [9]:
test_data = [
    'i really love my dog',
    'my dog loves my mantee'
]

In [10]:
test_seq = tokenizer.texts_to_sequences(test_data)

In [11]:
print(test_seq)

[[5, 1, 2, 3, 4], [3, 4, 1, 3, 1]]


Note: the tokenizer used the word < OOV > = 1 for out of vocabulary words which weren't in the traiing of the tokenizer<br>
Otherwise if the oov wasn't specified in the tokenizer then the tokenizer would simply have ignoed the word

#### allow out of index < OOV > for words that aren't used to train the tokenizer   

**pad_sequeces** can be used to pad the sequence and make the sentences of equal length, putting zeros where the word isn't present

**maxlen**, **truncating** in pad_sequences maxlen and truncating can be used to keep the sentence of some particular length, it will truncate the tail if truncating is set to 'post'