# Sequencing - Turning sentences into data
![](https://miro.medium.com/max/1218/1*zsIXWoN0_CE9PXzmY3tIjQ.png)

In [1]:
from tensorflow.keras.preprocessing.text import Tokenizer

In [2]:
# python array of strings

sentences=[
           'I love my Dog',
           'i love my Cat',
           'You love my Dog!',
           'Do you think my Dog is amazing!'            
]

In [3]:
tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

## texts_to_sequences()
Creates sequences of tokens representing each sentences

In [4]:
sequences = tokenizer.texts_to_sequences(sentences)

In [5]:
print(word_index)

{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}


In [6]:
print(sequences)

[[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]


This for getting data ready for Neural network,
### but how to handle this when NN needs to classify text but there are words in the text that are never seen before.....

In [7]:
test_data = [
             'i really love my dog', # 5 words sentence
             'my dog loves my mantee'
]

# applied texts_to_sequences() method without fit_on texts()
test_seq = tokenizer.texts_to_sequences(test_data)
print(test_seq)  # indexes not present in word_index are skipped

[[4, 2, 1, 3], [1, 3, 1]]


Here we lost the length ofthe sequence,<br>
to avoid this,<br>
use <b>oov_token</b> property

In [8]:
arr =[
           'I love my Dog',
           'i love my Cat',
           'You love my Dog!',
           'Do you think my Dog is amazing!'            
]
tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")
tokenizer.fit_on_texts(arr)
wi = tokenizer.word_index
seq = tokenizer.texts_to_sequences(arr)

In [9]:
print(wi)

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}


In [10]:
test_arr = [
             'i really love my dog', # 5 words sentence
             'my dog loves my mantee'
]

# applied texts_to_sequences() method without fit_on texts()
test_seq = tokenizer.texts_to_sequences(test_arr)
print(test_seq)

[[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]


In [11]:
print(wi)

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}


In [12]:
print(wi['<OOV>'])

1


Here we still lost some meaning but a lot less, but we've maintained length of the sequence

But how we can handle sentences of different length when training a NN<br>Advanced Solution is <b>RaggedTensor</b><br>but Simple solution is <b>Padding</b>

In [13]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [14]:
arr1 =[
           'I love my Dog',
           'i love my Cat',
           'You love my Dog!',
           'Do you think my Dog is amazing!'            
]
tokenizer1 = Tokenizer(num_words=100, oov_token="<OOV>")
tokenizer1.fit_on_texts(arr1)
wi1 = tokenizer1.word_index
seq1 = tokenizer1.texts_to_sequences(arr1)

In [15]:
# measures length of longest sentence and ensures sequence of all he sentences have equal length by adding padding of zeros to it
padded = pad_sequences(seq1)

In [16]:
print(wi1)
print(seq1)

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]


In [17]:
print(padded)

[[ 0  0  0  5  3  2  4]
 [ 0  0  0  5  3  2  7]
 [ 0  0  0  6  3  2  4]
 [ 8  6  9  2  4 10 11]]


In [18]:
padded = pad_sequences(seq1, padding='post') # if you want zeros after the sequence
padded

array([[ 5,  3,  2,  4,  0,  0,  0],
       [ 5,  3,  2,  7,  0,  0,  0],
       [ 6,  3,  2,  4,  0,  0,  0],
       [ 8,  6,  9,  2,  4, 10, 11]], dtype=int32)

In [19]:
padded = pad_sequences(seq1, maxlen=5, truncating='post') # keep only 5 digit sequence and chop sequence from last(post-truncation)
                                                          # by default is pre-truncation
padded

array([[0, 5, 3, 2, 4],
       [0, 5, 3, 2, 7],
       [0, 6, 3, 2, 4],
       [8, 6, 9, 2, 4]], dtype=int32)