<a href="https://colab.research.google.com/github/https-deeplearning-ai/tensorflow-1-public/blob/master/C3/W1/ungraded_labs/C3_W1_Lab_1_tokenize_basic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ungraded Lab: Tokenizer Basics

In most NLP tasks, the initial step in preparing your data is to extract a vocabulary of words from your *corpus* (i.e. input texts). You will need to define how to represent the texts into numerical representations which can be used to train a neural network. These representations are called *tokens* and Tensorflow and Keras makes it easy to generate these using its APIs. You will see how to do that in the next cells.

## Generating the vocabulary

In this notebook, you will look first at how you can provide a look up dictionary for each word. The code below takes a list of sentences, then takes each word in those sentences and assigns it to an integer. This is done using the [fit_on_texts()](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer#fit_on_texts) method and you can get the result by looking at the `word_index` property. More frequent words have a lower index.

In [None]:
import tensorflow as tf
print(tf.__version__)

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

# Define input sentences
sentences = [
    'i love my dog',
    'I, love my cat'
    ]

# Initialize the Tokenizer class
tokenizer = Tokenizer(num_words = 100)

# Generate indices for each word in the corpus
tokenizer.fit_on_texts(sentences)

# Get the indices and print it
word_index = tokenizer.word_index
print(word_index)

The `num_words` parameter used in the initializer specifies the maximum number of words minus one (based on frequency) to keep when generating sequences. You will see this in a later exercise. For now, the important thing to note is it does not affect how the `word_index` dictionary is generated. You can try passing `1` instead of `100` as shown on the next cell and you will arrive at the same `word_index`. 

Also notice that by default, all punctuation is ignored and words are converted to lower case. You can override these behaviors by modifying the `filters` and `lower` arguments of the `Tokenizer` class as described [here](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer#arguments). You can try modifying these in the next cell below and compare the output to the one generated above.

# Apply TextVectorization instead of using 'Tokenizer' 

In [None]:
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

# Define input sentences
sentences = [
    'i love my dog',
    'I, love my cat'
]

# Create an instance of the TextVectorization layer
vectorizer = TextVectorization(max_tokens=100, output_mode='int', output_sequence_length=None)

# Adapt the TextVectorization layer to the input sentences
vectorizer.adapt(sentences)

# Get the vocabulary
vocabulary = vectorizer.get_vocabulary()
word_index = {word: i for i, word in enumerate(vocabulary)}

# Print the word_index
print(word_index)


In [None]:
# Define input sentences
sentences = [
    'i love my dog',
    'I, love my cat',
    'You love my dog!'
]

# Initialize the Tokenizer class
tokenizer = Tokenizer(
    num_words=100,
    filters='',  
    lower=False,
    split=' ',          # biggest limitation of tokenizer is that not allowing mutiple spliter
    char_level=False,
    oov_token=None,
    analyzer=None,
)

# Generate indices for each word in the corpus
tokenizer.fit_on_texts(sentences)

# Get the indices and print it
word_index = tokenizer.word_index
print(word_index)



1.   Tokenizer:

Tokenizer is a class provided by the Keras API in TensorFlow for text preprocessing. It handles tokenization, building a vocabulary from the training data, converting text to sequences (integer representations), and optional padding. The main focus of Tokenizer is to create a mapping between words and their corresponding integer values. It is typically used in combination with the pad_sequences function for padding.

2.   TextVectorization:

TextVectorization is a layer provided by the TensorFlow Keras API that can be directly added to a neural network model. It performs similar functions to Tokenizer but can also be used for more advanced text preprocessing, such as lowercasing, removing punctuation, and splitting text by whitespace. TextVectorization can also be configured to output one-hot-encoded or count-based representations of text, in addition to integer sequences. The main advantage of using TextVectorization is that it integrates seamlessly with the model architecture and can be part of the model itself, making the model more portable and easier to deploy.




That concludes this short exercise on tokenizing input texts!