<a href="https://colab.research.google.com/github/zenikigai/Pengembangan_Machine_Learning_IDcamp2023/blob/main/NLP_tokenization_exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenization Exercise

Inthis module, we will learn how to do tokenization and make sequence from our text.

In [3]:
# tokenizer library
from keras.preprocessing.text import Tokenizer

If num_words parameter is set to 15, only 15 words appear most often. The 15 words will be tokenized from all words in the dataset.

The oov_token parameter is a parameter that serves to replace words that are not tokenized into certain characters.

It's better to replace an unrecognized word with a specific word than to skip it to reduce missing information. This is what can be done by adding OOV parameters.

In [4]:
# make tokenizer objek
tokenizer = Tokenizer(num_words = 15, oov_token = "_")

In [9]:
# Text to be tokenize and we use to train our model
teks = ["I love coding",
        "Coding is fun!",
        "Machine learning is diferent from conventional programming"]

To tokenize, call the fit_on_text() function on the tokenizer object and fill in our text as arguments.

In [6]:
tokenizer.fit_on_texts(teks)

Then, we'll convert the previously created text into a sequence form using the text_to_sequences function.

In [7]:
sequences = tokenizer.texts_to_sequences(teks)

To see the tokenization result, we can call the word_index attribute of the tokenizer object. The word index attribute returns a dictionary in the form of a word as a key and a token or numeric value as a value. Note that punctuation and capital letters are not processed by the tokenizer. For example, the words "Congratulations!" and "CONGRATULATIONS" will be treated as the same word. The result from the cell below shows that words that are out-of-vocabulary will be given a token worth 1.

In [8]:
print(tokenizer.word_index)

{'_': 1, 'coding': 2, 'is': 3, 'i': 4, 'love': 5, 'fun': 6, 'machine': 7, 'learning': 8, 'diferent': 9, 'from': 10, 'conventional': 11, 'programming': 12}


Next, take a look at the code below. The code output below is an example of using a token to convert a sentence into numeric form. In the example, the words 'learn', 'since', and '2022' are marked with a value of "1". This indicates that these words are not present in a previously created dictionary (OOV). Without OOV, words that do not have a token are not included in the sequence. If we use OOV, then every word that does not have a token will be assigned a uniform token. With OOV, the order information of each word in the sentence is not lost.

In [10]:
print(tokenizer.texts_to_sequences(["I learn programming since 2022"]))
print(tokenizer.texts_to_sequences(["I love programming"]))

[[4, 1, 12, 1, 1]]
[[4, 5, 12]]


After tokenization, to convert the sentence into the corresponding values can be by using the text_to_sequence() function and enter parameters from our text. When the sequence has been created, all we need to do is padding. Yup, padding is the process of making each sentence in the text have a uniform length. Just like resizing images, so that the resolution of each image is the same. To use padding, we need to call the pad_sequence library first. Then, call the pad_sequence() function and enter the tokenized result sequence as its parameter.

In [11]:
from keras.preprocessing.sequence import pad_sequences
sequences_sameLength = pad_sequences(sequences)

After padding, each sequence will have the same length. Padding can do this by adding 0 by default at the beginning of shorter sequences

In [12]:
print(sequences_sameLength)

[[ 0  0  0  0  4  5  2]
 [ 0  0  0  0  2  3  6]
 [ 7  8  3  9 10 11 12]]


If we want to change so that 0 is added at the end of the sequence, we can use the padding parameter with the value 'post'. In addition, we can set the maxlen parameter (maximum length of each sequence) with the value we want. If we fill in the value 5, then the length of a sequence will not exceed 5.

In [14]:
sequences_sameLength = pad_sequences(sequences,
                                     padding = "post",
                                     maxlen = 5)

If our text has a length greater than the maxlen parameter value for example 5, then by default the value of the sequence will be taken the last 5 values or only the last 5 words of each sentence (ignoring the previous word). To change this setting and retrieve the first 5 words of each sentence, we can use the truncating parameter and fill in the 'post' value.

In [15]:
sequences_sameLength = pad_sequences(sequences,
                                     padding = "post",
                                     maxlen = 5,
                                     truncating = "post")