## Padding: 

In this notebook we are gonna add __Paddings__ to our unbalanced sequences. In order to feed all the textual sentences to our model, we need to __Padd__ the large sequences somehow with special characters or integers, which help us pass the observations smoothly to our model. Later on we'll test the padded sequences on __Unseen__ data and will analyse the results.

Let's go ahead and do it!

In [2]:
import tensorflow as tf 
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer 
from tensorflow.keras.preprocessing.sequence import pad_sequences 
import warnings 
warnings.filterwarnings("ignore")

In [26]:
sentences = [
    "I love python",
    "I love machine learning",
    "I love natural language processing",
    "Smallest sentence"
]

tokenizer = Tokenizer(num_words = 100, oov_token = "<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)

padded = pad_sequences(sequences, maxlen = 5) # This will pad the sequences with zeros, to make
# the sequences equal in size
# we can also set "maxlength" argument to set the desired length for a sequence
print("\nword_index = ", word_index)
print("\nsequences = ", sequences)
print("\nPadded Sequences: ")
print(padded)

# Try with the words the tokenizer wasn't fit to
test_data = [
    "I really like python",
    "My computer is slow",
    "this is test sequence"
]

test_seq = tokenizer.texts_to_sequences(test_data)
print("\nTest Sequences = ", test_seq)

padded = pad_sequences(test_seq, maxlen = 10)
print("\nPadded Test Sequence: ")
print(padded)


word_index =  {'<OOV>': 1, 'i': 2, 'love': 3, 'python': 4, 'machine': 5, 'learning': 6, 'natural': 7, 'language': 8, 'processing': 9, 'smallest': 10, 'sentence': 11}

sequences =  [[2, 3, 4], [2, 3, 5, 6], [2, 3, 7, 8, 9], [10, 11]]

Padded Sequences: 
[[ 0  0  2  3  4]
 [ 0  2  3  5  6]
 [ 2  3  7  8  9]
 [ 0  0  0 10 11]]

Test Sequences =  [[2, 1, 1, 4], [1, 1, 1, 1], [1, 1, 1, 1]]

Padded Test Sequence: 
[[0 0 0 0 0 0 2 1 1 4]
 [0 0 0 0 0 0 1 1 1 1]
 [0 0 0 0 0 0 1 1 1 1]]
