## Subtoken Tokenization
- Recent works have been done on tokenization, leveraging "subtoken" tokenization. **Subtokens** extends the previous splitting strategy to furthermore explode a word into grammatically logicial sub-components learned from the data. <br>
- Subtoken construction reduces the size of the vocabulary you have to carry to train a Machine Learning model. <br>
- On the other side, as one token might be exploded into multiple subtokens, the input of your model might increase and become an issue on model with non-linear complexity over the input sequence's length. <br>
<br>

## @huggingface/tokenizers library
- Along with the transformers library, @huggingface provides a blazing fast tokenization library able to train, tokenize and decode dozens of Gb/s of text on a common multi-core machine.
- The library is designed to provide all the required blocks to create end-to-end tokenizers in an interchangeable way. The various components of library are:
  - **Normalizer**: Executes all the initial transformations over the initial input string. For example when you need to lowercase some text, maybe strip it, or even apply one of the common unicode normalization process, you will add a Normalizer.
  - **PreTokenizer**: In charge of splitting the initial input string. That's the component that decides where and how to pre-segment the origin string. The simplest example would be like we saw before, to simply split on spaces.
  - **Model**: Handles all the sub-token discovery and generation, this part is trainable and really dependant of your input data.
  -  **Post-Processor**: Provides advanced construction features to be compatible with some of the Transformers-based SoTA models. For instance, for BERT it would wrap the tokenized sentence around [CLS] and [SEP] tokens.
  - **Decoder**: In charge of mapping back a tokenized input to the original string. The decoder is usually chosen according to the PreTokenizer we used previously.
  - **Trainer**: Provides training capabilities to each model.
<br><br>

- There are multiple implementations for each of the components above:
  - **Normalizer**: Lowercase, Unicode (NFD, NFKD, NFC, NFKC), Bert, Strip, ...
  - **PreTokenizer**: ByteLevel, WhitespaceSplit, CharDelimiterSplit, Metaspace, ...
  - **Model**: WordLevel, BPE, WordPiece
  - **Post-Processor**: BertProcessor, ...
  - **Decoder**: WordLevel, BPE, WordPiece, ...

- All of these building blocks can be combined to create working tokenization pipelines. In the next section we will go over our first pipeline.

- This notebook will train a **Byte-Pair Encoding (BPE)** tokenizer on a quite small input for the purpose of this notebook. We will work with **the file from Peter Norving**. This file contains around 130.000 lines of raw text that will be processed by the library to generate a working tokenizer.

In [1]:
# check execution time for whole code
import time
s_time = time.time()

In [2]:
# document : https://pypi.org/project/tokenizers/
!pip install tokenizers

Collecting tokenizers
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 25.4MB/s 
[?25hInstalling collected packages: tokenizers
Successfully installed tokenizers-0.10.2


In [3]:
file_url = 'https://raw.githubusercontent.com/dscape/spell/master/test/resources/big.txt'

# Let's download the file and save it somewhere
from requests import get
with open('big.txt', 'wb') as big_f:
    response = get(file_url)
    if response.status_code == 200:
        big_f.write(response.content)
        !ls
    else:
        print("Unable to get the file: {}".format(response.reason))

big.txt  sample_data


In [4]:
# Everything described below can be replaced by the ByteLevelBPETokenizer class. 

import tokenizers
from tokenizers import Tokenizer
from tokenizers.decoders import ByteLevel as ByteLevelDecoder
from tokenizers.models import BPE
from tokenizers.normalizers import Lowercase, NFKC, Sequence
from tokenizers.pre_tokenizers import ByteLevel
from tokenizers.trainers import BpeTrainer

# tokenizers : 0.10.2
print(f'tokenizers : {tokenizers.__version__}')

tokenizers : 0.10.2


In [5]:
# empty Byte-Pair Encoding model (i.e. not trained model)
tokenizer = Tokenizer(BPE())

# lower-casing & unicode-normalization
# The sequence normalizer combines multiple normalizer
tokenizer.normalizer = Sequence([
    NFKC(),
    Lowercase()
])

# pre-tokenizer to convert the input to a ByteLevel representation.
tokenizer.pre_tokenizer = ByteLevel()

# decoder to recover from a tokenized input to the original one
tokenizer.decoder = ByteLevelDecoder()

tokenizer

<tokenizers.Tokenizer at 0x5599104e5cc0>

In [6]:
%%time
# initialize trainer, with details about the vocabulary
trainer = BpeTrainer(vocab_size=25000, show_progress=True, initial_alphabet=ByteLevel.alphabet())
tokenizer.train(files=["big.txt"], trainer=trainer)

print("Trained vocab size: {}".format(tokenizer.get_vocab_size()))

Trained vocab size: 25000
CPU times: user 7.1 s, sys: 322 ms, total: 7.42 s
Wall time: 4.02 s


In [7]:
# save the output
tokenizer.model.save('.')

['./vocab.json', './merges.txt']

In [8]:
# load the tokenizer model
tokenizer.model = BPE.from_file('vocab.json', 'merges.txt')

# test the tokenizer model
sent = "This is a simple input to be tokenized"
encoded = tokenizer.encode(sent)
decoded = tokenizer.decode(encoded.ids)

print(encoded)
print(f">>> Encoded string: {encoded.tokens}")

print(f"\n>>> Decoded string: {decoded}")

Encoding(num_tokens=10, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])
>>> Encoded string: ['Ġthis', 'Ġis', 'Ġa', 'Ġsimple', 'Ġin', 'put', 'Ġto', 'Ġbe', 'Ġtoken', 'ized']

>>> Decoded string:  this is a simple input to be tokenized



The Encoding structure exposes multiple properties which are useful when working with transformers models

- **normalized_str**: The input string after normalization (lower-casing, unicode, stripping, etc.)
- **original_str**: The input string as it was provided
- **tokens**: The generated tokens with their string representation
- **input_ids**: The generated tokens with their integer representation
- **attention_mask**: If your input has been padded by the tokenizer, then this would be a vector of 1 for any non padded token and 0 for padded ones.
- **special_token_mask**: If your input contains special tokens such as [CLS], [SEP], [MASK], [PAD], then this would be a vector with 1 in places where a special token has been added.
- **type_ids**: If your input was made of multiple "parts" such as (question, context), then this would be a vector with for each token the segment it belongs to.
- **overflowing**: If your input has been truncated into multiple subparts because of a length limit (for BERT for example the sequence length is limited to 512), this will contain all the remaining overflowing parts.

In [9]:
# Using a pretrained tokenizer (pretrained BERT tokenizer)
# https://huggingface.co/docs/tokenizers/python/latest/quicktour.html#using-a-pretrained-tokenizer

## download vocabulary file
!wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
!ls

## load vocab & generate tokenizer
from tokenizers import BertWordPieceTokenizer
tokenizer_bert = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True)

print('\n>>>', tokenizer_bert)

--2021-05-07 14:34:27--  https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.17.163
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.17.163|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 231508 (226K) [text/plain]
Saving to: ‘bert-base-uncased-vocab.txt’


2021-05-07 14:34:27 (23.5 MB/s) - ‘bert-base-uncased-vocab.txt’ saved [231508/231508]

bert-base-uncased-vocab.txt  big.txt  merges.txt  sample_data  vocab.json

>>> Tokenizer(vocabulary_size=30522, model=BertWordPiece, unk_token=[UNK], sep_token=[SEP], cls_token=[CLS], pad_token=[PAD], mask_token=[MASK], clean_text=True, handle_chinese_chars=True, strip_accents=None, lowercase=True, wordpieces_prefix=##)


In [10]:
sent = "This is a simple input to be tokenized"
encoded = tokenizer_bert.encode(sent)
decoded = tokenizer_bert.decode(encoded.ids)

print(encoded)
print(f">>> Encoded string: {encoded.tokens}")

print(f"\n>>> Decoded string: {decoded}")

Encoding(num_tokens=11, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])
>>> Encoded string: ['[CLS]', 'this', 'is', 'a', 'simple', 'input', 'to', 'be', 'token', '##ized', '[SEP]']

>>> Decoded string: this is a simple input to be tokenized


In [11]:
# check execution time for whole code
e_time = time.time()
time_elapsed = e_time - s_time
print(f'Total time elapsed : {int(time_elapsed//60)} min {int(time_elapsed%60)} sec')

Total time elapsed : 0 min 10 sec
