<a href="https://colab.research.google.com/github/dswh/lil_nlp_with_tensorflow/blob/main/01_03_begin.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creating sequences of tokens

The notebook covers the creation of sequences of tokens from words in a sentence.

In [1]:
from tensorflow.keras.preprocessing.text import Tokenizer

2025-07-15 13:02:33.095984: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-07-15 13:02:33.096670: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-07-15 13:02:33.101745: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-07-15 13:02:33.120877: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1752584553.147055    9626 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1752584553.15

## Define training sentences in a list

In [2]:
##define list of sentences to tokenize
train_sentences = [
             'It is a sunny day',
             'It is a cloudy day',
             'Will it rain today?'

]

## Train the tokenizer

In [3]:
##set up the tokenizer
tokenizer = Tokenizer(num_words=100)

##train the tokenizer on training sentences
tokenizer.fit_on_texts(train_sentences)

##store word index for the words in the sentence
word_index = tokenizer.word_index

## Create sequences

In [4]:
##create sequences using tokenizer
sequences = tokenizer.texts_to_sequences(train_sentences)


In [5]:
##print word index dictionary and sequences
print(f"Word index -->{tokenizer.word_index}")
print(f"Sequences of words -->{sequences}")

Word index -->{'it': 1, 'is': 2, 'a': 3, 'day': 4, 'sunny': 5, 'cloudy': 6, 'will': 7, 'rain': 8, 'today': 9}
Sequences of words -->[[1, 2, 3, 5, 4], [1, 2, 3, 6, 4], [7, 1, 8, 9]]


In [6]:
##print sample sentence and sequence
print(train_sentences[0])
print(sequences[0])

It is a sunny day
[1, 2, 3, 5, 4]


## Tokenizing new data using the same tokenizer

In [7]:
new_sentences = [
                 'Will it be raining today?',
                 'It is a pleasant day.'
]

In [8]:
new_sequences = tokenizer.texts_to_sequences(new_sentences)

In [9]:
print(new_sentences)
print(new_sequences)

['Will it be raining today?', 'It is a pleasant day.']
[[7, 1, 9], [1, 2, 3, 4]]


## Replacing newly encountered words with special values

In [10]:
##set up the tokenizer again with oov_token
tokenizer = Tokenizer(num_words=100, oov_token = "<oov>")

##train the new tokenizer on training sentences
tokenizer.fit_on_texts(train_sentences)

##store word index for the words in the sentence
word_index = tokenizer.word_index

In [12]:
##create sequences of the new sentences
new_sequences = tokenizer.texts_to_sequences(new_sentences)
print(word_index)
print(new_sentences)
print(new_sequences)

{'<oov>': 1, 'it': 2, 'is': 3, 'a': 4, 'day': 5, 'sunny': 6, 'cloudy': 7, 'will': 8, 'rain': 9, 'today': 10}
['Will it be raining today?', 'It is a pleasant day.']
[[8, 2, 1, 1, 10], [2, 3, 4, 1, 5]]
