# BMS Molecular Translation - Train InChI Tokenizer

## 1. Introduction

Recently, many deep learning NLP models use subword tokenization (e.g. [BPE](https://arxiv.org/abs/1508.07909v5), [WordPiece](https://arxiv.org/pdf/1609.08144v2.pdf)) rather than word-level or character-level tokenizations.
Subword tokenization can efficiently reduce the vocabulary size to the desired scale, and ensures to learn deeper semantics than the character-level tokenization.
Due to the reasons, subword tokenization is adopted to various tasks like translation, question-answering, reading comprehension and text generation.

In [this competition](https://www.kaggle.com/c/bms-molecular-translation/code), we need to train a model which generates **InChI** text by attending from old chemical image.
Since generating InChI sequences is identical to the NLP's one, we can consider the subword tokenization to **InChI** strings.
In this notebook, we are going to train the subword tokenizer for **InChI** format and check out a distribution of length of the tokenized sequences to determine the proper maximum sequence length.

## 2. Train Tokenizer

First, install [`tokenizers`](https://huggingface.co/docs/tokenizers/python/latest/).
`tokenizers` is a library containing today's most used tokenizers mentioned above.
It provides an implementation of those tokenizers and an interface for training, tokenizing, and pipelining the entire encoding procedures.

In [None]:
!pip install -qq -U allennlp transformers tokenizers

After installing the library, load the necessary modules.

In [None]:
import tqdm
import pandas as pd
import matplotlib.pyplot as plt
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer
from tokenizers.pre_tokenizers import Punctuation
from tokenizers.processors import TemplateProcessing
from tokenizers.decoders import WordPiece as WordPieceDecoder

Now, let's train our **InChI** tokenizer.
As you can see, `train_labels.csv` contains image ids and **InChI** strings.

In [None]:
samples = pd.read_csv('../input/bms-molecular-translation/train_labels.csv')
samples.head()

The below tokens, which do not appear to the target texts, will be added to the vocabulary.
- `[UNK]`: Unknown token. It is used when the word cannot replaced to subword combination.
- `[BOS]`: Begin-of-sequence token. It is added to the front of every sequences. You can feed this token to generate sequences without previous contexts.
- `[EOS]`: End-of-sequence token. It announce that the sequence is ended and the tokens after this are meaningless.
- `[PAD]`: Padding token. It is added to match the sequence length with each other in same batch.

The below code will train the tokenizer by constructing vocabulary with 256 subword tokens, including the above special tokens.
Frequently appeard subword pairs will be merged and added to the vocabulary.
It is repeated until the vocabulary is filled.

In [None]:
tokenizer = Tokenizer(WordPiece(unk_token='[UNK]'))
tokenizer.pre_tokenizer = Punctuation()

trainer = WordPieceTrainer(
    vocab_size=256, 
    min_frequency=2,
    special_tokens=['[UNK]', '[BOS]', '[EOS]', '[PAD]']
)
tokenizer.train_from_iterator(samples['InChI'], trainer=trainer)

That's all! We've train our own **InChI** tokenizer successfully. You can tokenize **InChI** strings to subword tokens through this tokenizer.

## 3. Visualize Sequence Lengths

Now we are wondering the range of sequence length.
It is important to restrict the maximum sequence length.
Which `max_seq_len` is proper?

To decide `max_seq_len`, let's visualize the distribution of `seq_len`s.
First of all, configure the encoding template.
As I mentioned above, `[BOS]` and `[EOS]` tokens will be added before and after the sequences respectively.
We can pipelining this post-processing by using `TemplateProcessing` class.

In [None]:
tokenizer.post_processor = TemplateProcessing(
    single="[BOS] $A [EOS]",
    special_tokens=[
        ("[BOS]", tokenizer.token_to_id("[BOS]")),
        ("[EOS]", tokenizer.token_to_id("[EOS]")),
    ],
)

The belows are some examples of tokenized **InChI** texts.
All the tokens are mapped to their indices and passed to the models.

In [None]:
print(' '.join(tokenizer.encode(samples.iloc[80000, 1]).tokens))
print(' '.join(tokenizer.encode(samples.iloc[53242, 1]).tokens))
print(' '.join(tokenizer.encode(samples.iloc[45212, 1]).tokens))
print(' '.join(tokenizer.encode(samples.iloc[782120, 1]).tokens))

Using the trained tokenizer and encoding template, let's tokenize all **InChI** sequences and plot the histogram of their lengths.

In [None]:
lengths = []
for inchi in tqdm.tqdm(samples['InChI']):
    lengths.append(len(tokenizer.encode(inchi).ids))

In [None]:
print(max(lengths))

In [None]:
plt.figure(figsize=(10, 5))
plt.hist(lengths, bins=500)
plt.show()

Great! It seems that `256` is the proper `max_seq_len`!
Let's configure the decoding, padding and truncation settings to the tokenizer.

In [None]:
tokenizer.decoder = WordPieceDecoder()
tokenizer.enable_padding(pad_id=tokenizer.token_to_id("[PAD]"), pad_token='[PAD]', pad_to_multiple_of=8)
tokenizer.enable_truncation(max_length=256)

## 3. Save Tokenizer

So, how can we use this tokenizer? The answer is simple.
We're going to save this tokenizer to `tokenizer.json`.
You can use the trained tokenizer anytime, by simply loading the `tokenizer.json` file.

In [None]:
tokenizer.save('tokenizer.json')