<a href="https://colab.research.google.com/github/sodiq-sulaimon/Preparations-for-TensorFlow-Developer-Certification/blob/NLP/Tokenizer_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenizer Basics

Preparing the data to extract a vocabulary from a corpus of words and defining how to represent the texts into numerical representations which can be used to train a neural network. These representations are called tokens.

### Generating the vocalubary

In [9]:
from tensorflow.keras.preprocessing.text import Tokenizer

# Define input sentences
sentences = [
    'I love my cat',
    'i love my dog', # using lowercase for i to show that the Tokenizer is case insensitive by default
    'You love my dog!'
]

# Initialize the Tokenizer class
tokenizer = Tokenizer(num_words=100)

# Map each word in the corpus to an index
tokenizer.fit_on_texts(sentences)

# Get the indices and print it
word_idx = tokenizer.word_index
print(word_idx)


{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'ca': 5, 't': 6, 'you': 7}


In [14]:
# Playing around with the Tokenizer parameters
sentences = [
    'I love my cat',
    'i love /my dog',
    'You ?love my dog!'
]

# Initialize the Tokenizer class
tokenizer = Tokenizer(num_words=1, lower= False, filters='?')

# Map each word in the corpus to an index
tokenizer.fit_on_texts(sentences)

# Get the indices and print it
word_idx = tokenizer.word_index
print(word_idx)

{'love': 1, 'my': 2, 'I': 3, 'cat': 4, 'i': 5, '/my': 6, 'dog': 7, 'You': 8, 'dog!': 9}


# Generating Sequences and Padding

Converting the input texts into sequence of tokens

### Text to Sequences
Use the result of the tokens generated to create sequence of tokens for the texts

In [34]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Define input sentences
sentences = [
    'I love my cat',
    'I love my dog',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

# Initialize the Tokenizer class
tokenizer = Tokenizer(num_words=100, oov_token='<OOV>')

# Map each word in the corpus to an index
tokenizer.fit_on_texts(sentences)

# Get the indices and print it
word_idx = tokenizer.word_index

# Generate list of token sequences
sequences = tokenizer.texts_to_sequences(sentences)

print('\nWord Index:', word_idx)
print('\nSeqeunces:', sequences)



Word Index: {'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}

Seqeunces: [[5, 3, 2, 7], [5, 3, 2, 4], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]


### Padding
The sequences has to be padded to a uniform length to be fed into a deep learning model

In [35]:
padded = pad_sequences(sequences=sequences, padding='post', truncating='post', maxlen=5)

print('\nPadded sequences:')
print(padded)


Padded sequences:
[[5 3 2 7 0]
 [5 3 2 4 0]
 [6 3 2 4 0]
 [8 6 9 2 4]]


### Out of vocabulary tokens
This will be used when you have input words that are not found in the word_index dictionary

In [41]:
# Trying with the word the tokenizer wasn't fit to

test_data = [
    'I really love my dog',
    'My dog loves my manatee'
]

test_seq = tokenizer.texts_to_sequences(test_data)

print('Test sequence:', test_seq)

padded_test = pad_sequences(test_seq, padding='post')
print('\nPadded test sequence:')
print(padded_test)

Test sequence: [[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]

Padded test sequence:
[[5 1 3 2 4]
 [2 4 1 2 1]]
