<div style="text-align: center;">
        <img src="./static/tokenizer_header.png" width="400px" style="height: auto;"></img>
</div>

---

This notebook demonstrates how to use the `WordPieceTokenizer` class for training and tokenization.

#### 📦 Importing dependencies

Let's begin by importing the necessary dependencies.

In [1]:
from babybert.data import load_corpus
from babybert.tokenizer import TokenizerConfig, WordPieceTokenizer

#### 📖 Loading corpus

Next, let's load our corpus, which is simply a `.txt` file containing newline-separated sentences. We'll use this corpus to train our tokenizer.

In [2]:
corpus = load_corpus("./data/corpus.txt")

#### ⚙️ Instantiating tokenizer

Here, we instantiate our tokenizer object. Since our corpus is relatively small, let's confine ourselves to a small vocabulary size - 5000 tokens should suffice.

In [3]:
config = TokenizerConfig(target_vocab_size=5000)

tokenizer = WordPieceTokenizer(config)

#### 🏋️ Training tokenizer

Now that we have everything ready, we can train our model! We simply call the `train` method on our tokenizer, passing in our corpus as an argument.

In [4]:
tokenizer.train(corpus)

Now that training's complete, let's inspect our tokenizer!

In [5]:
print(f"Tokenizer vocab size: {tokenizer.vocab_size}")

Tokenizer vocab size: 5000


It looks like the tokenizer was able to learn the 5000 vocabulary tokens that we specified earlier.

We can also check out the tokens that the tokenizer learned during training.

In [6]:
print(f"First ten tokens in vocab: {tokenizer.vocab[:10]}")

First ten tokens in vocab: ['##)', '##f', '##m', 'r', '##s', 'j', 'b', 'p', '4', '##e']


#### 🚀 Using trained tokenizer

Let's put our tokenizer to work! Here, we use it to tokenize a few example sentences.

In [7]:
examples = ["Hello, world!", "Here is a sentence.", "How are you today?"]

for example in examples:
    tokenized_example = tokenizer.tokenize(example)
    print(f"Original sentence: {example}")
    print(f"Tokenized sentence: {tokenized_example}")

Original sentence: Hello, world!
Tokenized sentence: ['h', '##e', '##ll', '##o', ',', 'world', '!']
Original sentence: Here is a sentence.
Tokenized sentence: ['h', '##e', '##r', '##e', 'is', 'a', 'sentence', '.']
Original sentence: How are you today?
Tokenized sentence: ['how', 'ar', '##e', 'you', 'to', '##d', '##ay', '[UNK]']


Looks like it does a pretty good job, given the small vocabulary size! The one exception is the `'[UNK]'` token in the last example sentence; this occurs because no instances of the `'?'` token appeared in our training corpus, so our tokenizer doesn't know how to handle it. In a production scenario, we would use a much larger corpus containing more varied punctuation.

We can also use our tokenizer to encode text, converting it to a list of token IDs. This will be handy later one when we're training our model, as these token IDs comprise the input features.

In [8]:
token_ids = tokenizer.encode(examples[0])
print(f"Token IDs: {token_ids}")

Token IDs: [39, 9, 3671, 75, 62, 4498, 13]


Let's decode these token IDs to make sure they correspond to the proper tokens.

In [9]:
tokens = tokenizer.decode(token_ids)
print(f"Decoded tokens: {tokens}")

Decoded tokens: ['h', '##e', '##ll', '##o', ',', 'world', '!']


#### 💾 Saving trained tokenizer

Now, let's save our model so that we can reuse it later! The `save_pretrained` instance method makes this easy; all we need to do is specify a path to which the tokenizer data should be saved.

In [10]:
path = "./my_tokenizer"
tokenizer.save_pretrained(path)

If we want to load this tokenizer again later, all we have to do is call the `from_pretrained` class method!

In [11]:
del tokenizer

tokenizer = WordPieceTokenizer.from_pretrained(path)
print(f"Tokenized text: {tokenizer.tokenize(examples[0])}")

Tokenized text: ['h', '##e', '##ll', '##o', ',', 'world', '!']
