## Tokenizer and padding

In NLP, our input data is text, but neural networks are crazy for numbers and don't deal with texts. The tokenizer, helps us in this scenario. 

Here is a [link to the documentation](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer).

In [1]:
from tensorflow.keras.preprocessingprocessing.text import Tokenizer

In [2]:
#some sentences
sentences = [
    "I am learning Tensorflow",
    "I am learning Keras"
]

In [5]:
#initializing the tokenizer
tokenizer = Tokenizer(num_words = 100)
#tokenizing the sentences
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

In [6]:
#Let's print it
word_index

{'i': 1, 'am': 2, 'learning': 3, 'tensorflow': 4, 'keras': 5}

So, each word is mapped to  a number.

### Dealing with "Out of Vocabulary" words

We will encode all the words which are not present in our training data, as an out_of_vocabulary word.

In [13]:
#defining a new tokenizer
new_tokenizer = Tokenizer(num_words = 100, oov_token = "<OOV>")
new_tokenizer.fit_on_texts(sentences)

In [24]:
#printing entire sequences
print("words mapped to :", new_tokenizer.word_index, end='\n\n')
print(sentences, "mapped to", new_tokenizer.texts_to_sequences(sentences))

words mapped to : {'<OOV>': 1, 'i': 2, 'am': 3, 'learning': 4, 'tensorflow': 5, 'keras': 6}

['I am learning Tensorflow', 'I am learning Keras'] mapped to [[2, 3, 4, 5], [2, 3, 4, 6]]


In [14]:
new_sentence = ["I am learning NLP"]

In [15]:
'''
    texts_to_sequences(texts) 
    transforms each text in "texts" ot a sequence of integers
'''
sequence = tokenizer.texts_to_sequences(new_sentence)
new_sequence = new_tokenizer.texts_to_sequences(new_sentence)

print(new_sentence, "transformed to", sequence, "by tokenizer")
print(new_sentence, "transformed to", new_sequence , "by new tokenizer")

['I am learning NLP'] transformed to [[1, 2, 3]] by tokenizer
['I am learning NLP'] transformed to [[2, 3, 4, 1]] by new tokenizer


Hence, the tokenizer simply rejects the "out of vocabulary" words, if they are not passed the oov_token. But it is mapped to 1, if oov_token is passed(make sure it is unique).Example, above since "NLP" is not in sentences, hence it is marked as '<OOV>'or 1 by tokenizer.

### Dealing with variable sized sentences

In [26]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [25]:
sentences = [
    "I am dancing", 
    "I am singing", 
    "Will you dance with me?"
]

In [29]:
tokenizer.fit_on_texts(sentences)
sequences = tokenizer.texts_to_sequences(sentences)
sequences

[[1, 2, 4], [1, 2, 5], [6, 7, 8, 9, 10]]

In [35]:
print("Padded sequences", end = '\n\n')

pad_sequence_1 = pad_sequences(sequences)
print('sequences with default max length')
print(pad_sequence_1, end='\n\n')

pad_sequence_2 = pad_sequences(sequences, maxlen = 8)
print('sequences with max length = 8')
print(pad_sequence_2, end='\n\n')

pad_sequence_3 = pad_sequences(sequences, maxlen= 8, padding = 'post')
print('sequences with post padding')
print(pad_sequence_3, end='\n\n')

padded sequences

sequences with default max length
[[ 0  0  1  2  4]
 [ 0  0  1  2  5]
 [ 6  7  8  9 10]]

sequences with max length = 8
[[ 0  0  0  0  0  1  2  4]
 [ 0  0  0  0  0  1  2  5]
 [ 0  0  0  6  7  8  9 10]]

sequences with post padding
[[ 1  2  4  0  0  0  0  0]
 [ 1  2  5  0  0  0  0  0]
 [ 6  7  8  9 10  0  0  0]]

