<a href="https://colab.research.google.com/github/souvikmajumder26/DeepLearning-AI-TensorFlow-Developer-Professional-Certificate/blob/main/Course-3-Natural-Language-Processing-in-TensorFlow/c3_week1_lab_1_2_tokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [LAB 1] Tokenizer *WITHOUT* OOV (Out of Vocavulary) and Padding

**Tokenizer** does not consider punctuation or capitlization of words so that 'dog' and 'Dog!' are treated same.

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
             'i love my dog',
             'I, love my cat',
             'You love my dog!',
             'Do you think my dog is amazing?'
]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)

word_index = tokenizer.word_index
# creates a dict of words and their representative number
# the numbers/TOKENS now represent those particular words

print (word_index)


{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}


**Text to Sequence** : Converting our training sentences into list of tokens/numbers that represent the words - to visualize the training data.

In [None]:
sequences = tokenizer.texts_to_sequences(sentences)
# sentences converted into sequences of tokens that represent the words
print (sequences)

[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]


**Testing** on new sentences : coverting new sentences into sequences of tokens using the word:token dictionary formed previously using train data/sentences.

In [None]:
test_data = [
             'i really love my dog',
             'My dog loves amazing food'
]

test_seq = tokenizer.texts_to_sequences(test_data)
# new sentences converted into sequences of tokens but the words that were not present in the training set and thus not in the word_index dictionary will be lost
print (test_seq)

[[5, 1, 3, 2, 4], [2, 4, 1, 11, 1]]


'loves' and 'food' are lost as their respective tokens could not be found.

# [LAB 2] Tokenizer *WITH* OOV (Out of Vocabulary) and Padding

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

sentences = [
             'i love my dog',
             'I, love my cat',
             'You love my dog!',
             'Do you think my dog is amazing?'
]

tokenizer = Tokenizer(num_words = 100, oov_token = "<OOV>")
tokenizer.fit_on_texts(sentences)

word_index = tokenizer.word_index
# creates a dict of words and their representative number
# the numbers/TOKENS now represent those particular words
print (word_index, '\n')

sequences = tokenizer.texts_to_sequences(sentences)
# sentences converted into sequences of tokens that represent the words
print (sequences, '\n')

padded_train_seq_1 = pad_sequences(sequences, padding = 'post', maxlen = 8)
# produces a matrix  # default padding is pre, and default maxlen is the max length among all sentences
print (padded_train_seq_1, '\n')

padded_train_seq_2 = pad_sequences(sequences, padding = 'post', truncating = 'post', maxlen = 5)
# produces a matrix  # default padding is pre, and default maxlen is the max length among all sentences which is "7" here
print (padded_train_seq_2, '\n')

test_data = [
             'i really love my dog',
             'My dog loves amazing food'
]

test_seq = tokenizer.texts_to_sequences(test_data)
# new sentences converted into sequences of tokens but the words that were not present in the training set and thus not in the word_index dictionary will be lost
print (test_seq, '\n')

padded_test_seq_1 = pad_sequences(test_seq, padding = 'post', truncating = 'post', maxlen = 10)
print (padded_test_seq_1, '\n')

padded_test_seq_2 = pad_sequences(test_seq, padding = 'post', truncating = 'post', maxlen = 3)
print (padded_test_seq_2)


{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11} 

[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]] 

[[ 5  3  2  4  0  0  0  0]
 [ 5  3  2  7  0  0  0  0]
 [ 6  3  2  4  0  0  0  0]
 [ 8  6  9  2  4 10 11  0]] 

[[5 3 2 4 0]
 [5 3 2 7 0]
 [6 3 2 4 0]
 [8 6 9 2 4]] 

[[5, 1, 3, 2, 4], [2, 4, 1, 11, 1]] 

[[ 5  1  3  2  4  0  0  0  0  0]
 [ 2  4  1 11  1  0  0  0  0  0]] 

[[5 1 3]
 [2 4 1]]
