In [1]:
# Install the tensorflow library
!pip install tensorflow



In [2]:
# Import the tensorflow library and the Tokenizer class
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer

# Text to Sequences

In [3]:
# Initialize the Tokenizer with a vocabulary size of 10
tok = Tokenizer(num_words=10)

In [4]:
# Define a corpus of text data
corp=['coffee is hot','water is cold']

In [5]:
# Display the corpus
corp

['coffee is hot', 'water is cold']

In [6]:
# Fit the tokenizer on the corpus to build the vocabulary
tok.fit_on_texts(corp)

In [7]:
# Print the word index, which maps words to their integer IDs
tok.word_index

{'is': 1, 'coffee': 2, 'hot': 3, 'water': 4, 'cold': 5}

In [8]:
# Convert the corpus texts to sequences of integer IDs
tok.texts_to_sequences(corp)

[[2, 1, 3], [4, 1, 5]]

In [9]:
# Convert new texts to sequences using the fitted tokenizer
tok.texts_to_sequences(['water is hot','black coffee is cold'])

[[4, 1, 3], [2, 1, 5]]

#Adding OVV

**OOV stands for Out-Of-Vocabulary.**

In Natural Language Processing (NLP), it refers to words that don’t exist in the vocabulary your model or tokenizer was trained on.

For example:

Suppose your model’s vocabulary = {this, is, a, good, time, talk}

Your input sentence = "This is not a good time to talk"

The word "not" is not in the vocabulary → so it becomes an OOV token (like <OOV> or UNK).

This is important because models can’t handle unknown words directly. Instead, they replace them with a placeholder token (<OOV>) so the model knows "something is here, but I don’t know what it is."

In [10]:
# Initialize the Tokenizer with an out-of-vocabulary (OOV) token
from tensorflow.keras.preprocessing.text import Tokenizer
tok=Tokenizer(oov_token='black')
corp=['coffee is hot','water is cold']

In [11]:
# Display the corpus
corp

['coffee is hot', 'water is cold']

In [12]:
# Fit the tokenizer on the corpus and print the word index
tok.fit_on_texts(corp)
print(tok.word_index)
# Convert new texts to sequences, demonstrating OOV token handling
tok.texts_to_sequences(['water is hot','black coffee is cold'])

{'black': 1, 'is': 2, 'coffee': 3, 'hot': 4, 'water': 5, 'cold': 6}


[[5, 2, 4], [1, 3, 2, 6]]

# Limiting the number of words

In [13]:
# Initialize the Tokenizer and limit the vocabulary size
from tensorflow.keras.preprocessing.text import Tokenizer
tok=Tokenizer(num_words=5)
corp=['coffee is hot','water is cold']
tok.fit_on_texts(corp)
print(tok.word_index)
# Convert new texts to sequences, observing the effect of limited vocabulary
tok.texts_to_sequences(['water is hot','black coffee is cold'])

{'is': 1, 'coffee': 2, 'hot': 3, 'water': 4, 'cold': 5}


[[4, 1, 3], [2, 1]]

Cold is not taken as the index for it is 5 that is leading to exceeding of word_length

In [14]:
# Initialize the Tokenizer and further limit the vocabulary size
from tensorflow.keras.preprocessing.text import Tokenizer
tok=Tokenizer(num_words=3)
corp=['coffee is hot','water is cold']
tok.fit_on_texts(corp)
print(tok.word_index)
# Convert new texts to sequences, observing the effect of a very limited vocabulary
tok.texts_to_sequences(['water is hot','black coffee is cold'])

{'is': 1, 'coffee': 2, 'hot': 3, 'water': 4, 'cold': 5}


[[1], [2, 1]]

In [15]:
# Initialize the Tokenizer and set the vocabulary size to 4
from tensorflow.keras.preprocessing.text import Tokenizer
tok=Tokenizer(num_words=4)
corp=['coffee is hot','water is cold']
tok.fit_on_texts(corp)
print(tok.word_index)
# Convert new texts to sequences, observing the effect of a limited vocabulary
tok.texts_to_sequences(['water is hot','black coffee is cold'])

{'is': 1, 'coffee': 2, 'hot': 3, 'water': 4, 'cold': 5}


[[1, 3], [2, 1]]

# Project Summary

This notebook demonstrates the use of the `Tokenizer` class from TensorFlow Keras for text preprocessing. The key steps covered are:

1.  **Installation and Import:** Installing TensorFlow and importing the `Tokenizer`.
2.  **Tokenizer Initialization:** Initializing the `Tokenizer` with various parameters like `num_words` and `oov_token`.
3.  **Corpus Definition:** Defining a sample text corpus.
4.  **Fitting the Tokenizer:** Using `fit_on_texts` to build the vocabulary and word index based on the corpus.
5.  **Text to Sequences:** Converting text data into sequences of integer IDs using `texts_to_sequences`.
6.  **Handling Out-of-Vocabulary (OOV) Tokens:** Demonstrating how the `oov_token` handles words not present in the vocabulary.
7.  **Limiting Vocabulary Size:** Showing the effect of the `num_words` parameter on the vocabulary and resulting sequences.

This process is a fundamental step in preparing text data for use in various natural language processing tasks, especially with neural networks.