<a href="https://colab.research.google.com/github/dswh/lil_nlp_with_tensorflow/blob/main/01_03_begin.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creating sequences of tokens

The notebook covers the creation of sequences of tokens from words in a sentence.

In [1]:
from tensorflow.keras.preprocessing.text import Tokenizer

2023-08-23 08:38:09.729987: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-08-23 08:38:09.811772: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-08-23 08:38:09.813619: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Define training sentences in a list

In [2]:
##define list of sentences to tokenize
train_sentences = [
             'It is a sunny day',
             'It is a cloudy day',
             'Will it rain today?'

]

## Train the tokenizer

In [4]:
##set up the tokenizer
tokenizer = Tokenizer(num_words=100)

##train the tokenizer on training sentences
tokenizer.fit_on_texts(train_sentences)

##store word index for the words in the sentence
word_index = tokenizer.word_index

## Create sequences

In [7]:
##create sequences using tokenizer
sequences = tokenizer.texts_to_sequences(train_sentences)

In [8]:
##print word index dictionary and sequences
print(f"Word index -->{word_index}")
print(f"Sequences of words -->{sequences}")

Word index -->{'it': 1, 'is': 2, 'a': 3, 'day': 4, 'sunny': 5, 'cloudy': 6, 'will': 7, 'rain': 8, 'today': 9}
Sequences of words -->[[1, 2, 3, 5, 4], [1, 2, 3, 6, 4], [7, 1, 8, 9]]


In [9]:
##print sample sentence and sequence
print(train_sentences[0])
print(sequences[0])

It is a sunny day
[1, 2, 3, 5, 4]


## Tokenizing new data using the same tokenizer

In [10]:
new_sentences = [
                 'Will it be raining today?',
                 'It is a pleasant day.'
]

In [11]:
new_sequences = tokenizer.texts_to_sequences(new_sentences)

In [12]:
print(new_sentences)
print(new_sequences)

['Will it be raining today?', 'It is a pleasant day.']
[[7, 1, 9], [1, 2, 3, 4]]


## Replacing newly encountered words with special values

In [13]:
##set up the tokenizer again with oov_token
tokenizer = Tokenizer(num_words=100, oov_token = "<oov>")

##train the new tokenizer on training sentences
tokenizer.fit_on_texts(train_sentences)

##store word index for the words in the sentence
word_index = tokenizer.word_index

In [14]:
##create sequences of the new sentences
new_sequences = tokenizer.texts_to_sequences(new_sentences)
print(word_index)
print(new_sequences)

{'<oov>': 1, 'it': 2, 'is': 3, 'a': 4, 'day': 5, 'sunny': 6, 'cloudy': 7, 'will': 8, 'rain': 9, 'today': 10}
[[8, 2, 1, 1, 10], [2, 3, 4, 1, 5]]
