# Train new tokenizer

This Notebook goes through the process of training a new NllbTokenizer on the Dyula to French dataset.

**Process Overview:**
 1. Create training data for the tokenizer from the dataset and save it to a text file.
 2. Load and save the existing tokenizer's configuration files from the specified model.
 3. Train a new tokenizer using the `SentencePieceBPETokenizer` class on the saved text data.
 4. Convert the trained tokenizer to a `PreTrainedTokenizerFast` object.
 5. Update the tokenizer configuration files with the new vocabulary and special tokens.
 6. Save the updated tokenizer configuration, special tokens, and additional tokens.
 7. Update the tokenizer's `sentencepiece` model with the new vocabulary and save it.
 8. Test the new tokenizer to ensure it loads correctly and is compatible with `transformers`.

Notes:
 - The `create_tokenizer_train_data` function is used to prepare the dataset for tokenizer training.
 - The existing tokenizer's configuration is used to ensure compatibility with the new tokenizer.
 - Special tokens and added tokens are preserved and updated in the new tokenizer.
 - The `SentencePieceBPETokenizer` class is used for training the tokenizer. 
 - The `SentencePiece` model is updated and serialized to ensure that the new vocabulary is included.

## Setup environment

Restart the kernel after you have installed packages with `pip install` in the Notebook cell below.

In [1]:
!pip install -q -U sentencepiece transformers huggingface_hub datasets sacrebleu lxml sentence-transformers accelerate fastai

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gradient 2.0.6 requires attrs<=19, but you have attrs 23.1.0 which is incompatible.[0m[31m
[0m

In [1]:
BASE_MODEL_ID = "facebook/nllb-200-distilled-600M"
VOCAB_SIZE = 2_000
SAVE_DIR = "tokenizers/tokenizer_2k"
HFHUB_LOGIN = False
LOAD_LOCAL_DATA = True

In [3]:
if HFHUB_LOGIN:
    from huggingface_hub import notebook_login
    notebook_login(new_session=False)

In [2]:
import os
from pathlib import Path
import json
from tqdm.auto import tqdm
from tokenizers import SentencePieceBPETokenizer
from transformers import PreTrainedTokenizerFast, NllbTokenizer
import sentencepiece as spm
from sentencepiece import sentencepiece_model_pb2 as sp_model

from data import load_from_json
from utils import preproc

## Prepare tokenizer training data

In [3]:
df = load_from_json(
    train_files="data/dataset_train.json",
    valid_files="data/dataset_validation.json",
    test_files="data/dataset_test.json",
    return_format="df"
)
print(df.shape)
df.head()

(10929, 3)


Unnamed: 0,dyu,fr,split
0,A bi ji min na,Il boit de l’eau.,train
1,A le dalakolontɛ lon bɛ.,Il se plaint toujours.,train
2,Mun? Fɛn dɔ.,Quoi ? Quelque chose.,train
3,O bɛ bi bɔra fo Gubeta.,Tous sortent excepté Gubetta.,train
4,A ale lo bi da bugɔ la!,Ah ! c’est lui… il sonne…,train


In [4]:
if not os.path.exists("data"):
    os.mkdir("data")

with open("data/tokenizer_train_data.txt", "w") as f:
    for _, row in tqdm(df.iterrows()):
        f.write(f"{preproc(row['dyu'])}\n")
        if row['fr'] == "0": continue
        f.write(f"{preproc(row['fr'])}\n")

0it [00:00, ?it/s]

## Load the existing tokenizer

We load the existing tokenizer for the model and save it locally. We'll only use this to get the structure and format of the config files right.

In [5]:
tokenizer_old = NllbTokenizer.from_pretrained(BASE_MODEL_ID)
tokenizer_old.save_pretrained("/tmp")



('/tmp/tokenizer_config.json',
 '/tmp/special_tokens_map.json',
 '/tmp/sentencepiece.bpe.model',
 '/tmp/added_tokens.json')

Read the `tokenizer_config.json` file 

In [6]:
with open("/tmp/tokenizer_config.json", "r") as f:
    tokenizer_config = json.load(f)
tokenizer_config.keys()

dict_keys(['added_tokens_decoder', 'additional_special_tokens', 'bos_token', 'clean_up_tokenization_spaces', 'cls_token', 'eos_token', 'legacy_behaviour', 'mask_token', 'model_max_length', 'pad_token', 'sep_token', 'sp_model_kwargs', 'src_lang', 'tgt_lang', 'tokenizer_class', 'unk_token'])

Read the `special_tokens_map.json` file 

In [7]:
with open("/tmp/special_tokens_map.json", "r") as f:
    special_tokens_map = json.load(f)
special_tokens_map.keys()

dict_keys(['additional_special_tokens', 'bos_token', 'cls_token', 'eos_token', 'mask_token', 'pad_token', 'sep_token', 'unk_token'])

Read the `added_tokens.json` file 

In [8]:
with open("/tmp/added_tokens.json", "r") as f:
    added_tokens = json.load(f)
added_tokens.keys()

dict_keys(['<mask>', 'ace_Arab', 'ace_Latn', 'acm_Arab', 'acq_Arab', 'aeb_Arab', 'afr_Latn', 'ajp_Arab', 'aka_Latn', 'als_Latn', 'amh_Ethi', 'apc_Arab', 'arb_Arab', 'ars_Arab', 'ary_Arab', 'arz_Arab', 'asm_Beng', 'ast_Latn', 'awa_Deva', 'ayr_Latn', 'azb_Arab', 'azj_Latn', 'bak_Cyrl', 'bam_Latn', 'ban_Latn', 'bel_Cyrl', 'bem_Latn', 'ben_Beng', 'bho_Deva', 'bjn_Arab', 'bjn_Latn', 'bod_Tibt', 'bos_Latn', 'bug_Latn', 'bul_Cyrl', 'cat_Latn', 'ceb_Latn', 'ces_Latn', 'cjk_Latn', 'ckb_Arab', 'crh_Latn', 'cym_Latn', 'dan_Latn', 'deu_Latn', 'dik_Latn', 'dyu_Latn', 'dzo_Tibt', 'ell_Grek', 'eng_Latn', 'epo_Latn', 'est_Latn', 'eus_Latn', 'ewe_Latn', 'fao_Latn', 'fij_Latn', 'fin_Latn', 'fon_Latn', 'fra_Latn', 'fur_Latn', 'fuv_Latn', 'gaz_Latn', 'gla_Latn', 'gle_Latn', 'glg_Latn', 'grn_Latn', 'guj_Gujr', 'hat_Latn', 'hau_Latn', 'heb_Hebr', 'hin_Deva', 'hne_Deva', 'hrv_Latn', 'hun_Latn', 'hye_Armn', 'ibo_Latn', 'ilo_Latn', 'ind_Latn', 'isl_Latn', 'ita_Latn', 'jav_Latn', 'jpn_Jpan', 'kab_Latn', 'kac_La

## Train tokenizer

Specify the special tokens that we need for our tokenizer:

In [9]:
special_tokens = ['<s>', '<pad>', '</s>', '<unk>', 'dyu_Latn', 'fra_Latn', '<mask>']
add_special_tokens = ['dyu_Latn', 'fra_Latn']

In [10]:
tokenizer = SentencePieceBPETokenizer()
tokenizer.train(
    "../tokenizer_train_data.txt",
    vocab_size=VOCAB_SIZE,
    # min_frequency=5,
    show_progress=True,
    # limit_alphabet=500,
    special_tokens=special_tokens
)






Convert to a Huggingface `PreTrainedTokenizerFast` object:

In [11]:
tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer, clean_up_tokenization_spaces=False
)

Tokenize an example sentence with the new and old tokenizer:

In [12]:
tokenizer(['a bi ji min na']).input_ids

[[90, 117, 427, 187, 178]]

In [13]:
tokenizer_old(['a bi ji min na']).input_ids

[[256047, 9, 330, 850, 531, 62, 2]]

In [14]:
tokenizer.batch_decode(tokenizer(['a bi ji min na']).input_ids)

2024-08-30 12:57:52.887992: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-30 12:57:52.888049: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-30 12:57:52.889126: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-30 12:57:52.895175: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


['a bi ji min na']

In [15]:
tokenizer.convert_ids_to_tokens(tokenizer(['a bi ji min na']).input_ids[0])

['▁a', '▁bi', '▁ji', '▁min', '▁na']

## Update tokenizer config files

Update the `tokenizer_config` that was loaded from the `tokenizer_config.json` file:

In [16]:
def _added_tokens_decoder_to_dict(token):
    return {
        'content': token,
        'lstrip': False,
        'normalized': False,
        'rstrip': False,
        'single_word': False,
        'special': True
    }
tokenizer_config["added_tokens_decoder"] = {
    str(i): _added_tokens_decoder_to_dict(t) for t, i in tokenizer.vocab.items() if t in special_tokens
}
tokenizer_config["additional_special_tokens"] = add_special_tokens
tokenizer_config

{'added_tokens_decoder': {'4': {'content': 'dyu_Latn',
   'lstrip': False,
   'normalized': False,
   'rstrip': False,
   'single_word': False,
   'special': True},
  '6': {'content': '<mask>',
   'lstrip': False,
   'normalized': False,
   'rstrip': False,
   'single_word': False,
   'special': True},
  '1': {'content': '<pad>',
   'lstrip': False,
   'normalized': False,
   'rstrip': False,
   'single_word': False,
   'special': True},
  '0': {'content': '<s>',
   'lstrip': False,
   'normalized': False,
   'rstrip': False,
   'single_word': False,
   'special': True},
  '5': {'content': 'fra_Latn',
   'lstrip': False,
   'normalized': False,
   'rstrip': False,
   'single_word': False,
   'special': True},
  '3': {'content': '<unk>',
   'lstrip': False,
   'normalized': False,
   'rstrip': False,
   'single_word': False,
   'special': True},
  '2': {'content': '</s>',
   'lstrip': False,
   'normalized': False,
   'rstrip': False,
   'single_word': False,
   'special': True}},
 'add

Update the `special_tokens_map` that was loaded from the `special_tokens_map.json` file:

In [17]:
special_tokens_map["additional_special_tokens"] = add_special_tokens
special_tokens_map

{'additional_special_tokens': ['dyu_Latn', 'fra_Latn'],
 'bos_token': {'content': '<s>',
  'lstrip': False,
  'normalized': False,
  'rstrip': False,
  'single_word': False},
 'cls_token': {'content': '<s>',
  'lstrip': False,
  'normalized': False,
  'rstrip': False,
  'single_word': False},
 'eos_token': {'content': '</s>',
  'lstrip': False,
  'normalized': False,
  'rstrip': False,
  'single_word': False},
 'mask_token': {'content': '<mask>',
  'lstrip': True,
  'normalized': True,
  'rstrip': False,
  'single_word': False},
 'pad_token': {'content': '<pad>',
  'lstrip': False,
  'normalized': False,
  'rstrip': False,
  'single_word': False},
 'sep_token': {'content': '</s>',
  'lstrip': False,
  'normalized': False,
  'rstrip': False,
  'single_word': False},
 'unk_token': {'content': '<unk>',
  'lstrip': False,
  'normalized': False,
  'rstrip': False,
  'single_word': False}}

Update the `added_tokens` that was loaded from the `added_tokens.json` file:

In [18]:
added_tokens = {t: tokenizer.convert_tokens_to_ids(t) for t in ["<mask>", "dyu_Latn", "fra_Latn"]}
added_tokens

{'<mask>': 6, 'dyu_Latn': 4, 'fra_Latn': 5}

## Save the tokenizer

Create a folder to save the new tokenizer:

In [19]:
if not os.path.exists(SAVE_DIR):
    os.makedirs(SAVE_DIR)
new_tokenizer_dir = Path(SAVE_DIR)

Save the new `tokenizer_config.json` file:

In [20]:
with open(new_tokenizer_dir/"tokenizer_config.json", "w") as f:
    json.dump(tokenizer_config, f)

Save the `special_tokens_map.json` file:

In [21]:
with open(new_tokenizer_dir/"special_tokens_map.json", "w") as f:
    json.dump(special_tokens_map, f)

Save the `added_tokens.json` file:

In [22]:
with open(new_tokenizer_dir/"added_tokens.json", "w") as f:
    json.dump(added_tokens, f)

Update the pre-trained NllbTokenizer's `sentencepiece` model:

In [23]:
m = sp_model.ModelProto()
m.ParseFromString(open("/tmp/sentencepiece.bpe.model", "rb").read())

4852054

Loop over `m.pieces` and keep only keep the tokens that are in the new vocab. This takes a few minutes.

In [24]:
len(m.pieces)

256000

In [25]:
seen = set()
while True:
    if m.pieces[0].piece in seen:
        break
    x = m.pieces.pop(0)
    seen.add(x.piece)
    if x.piece in tokenizer.vocab:
        m.pieces.append(x)

Add tokens that were not in the old tokenizer's vocab:

In [26]:
add_tokens = set(tokenizer.vocab.keys()) - seen

for token in add_tokens:
    new_token = sp_model.ModelProto().SentencePiece()
    new_token.piece = token
    new_token.score = 0
    m.pieces.append(new_token)

In [27]:
assert len(m.pieces) == len(tokenizer.vocab)

In [28]:
m.pieces[0]

piece: "<unk>"
score: 0
type: UNKNOWN

In [29]:
with open(new_tokenizer_dir/'sentencepiece.bpe.model', 'wb') as f:
    f.write(m.SerializeToString())

## Test loading the new tokenizer

Load the new `sentencepiece` model:

In [30]:
sp = spm.SentencePieceProcessor()
sp.load(str(new_tokenizer_dir/'sentencepiece.bpe.model'))

True

In [31]:
print(sp.encode_as_pieces('this is a test'))
print(sp.encode_as_ids('this is a test'))

['▁th', 'is', '▁', 'is', '▁a', '▁t', 'est']
[134, 20, 1426, 20, 8, 6, 195]


Test loading the new tokenizer with `transformers`:

In [32]:
tokenizer = NllbTokenizer.from_pretrained(new_tokenizer_dir)

In [33]:
tokenizer.vocab_size

2001

In [38]:
t = "il boit de l’eau"
tokenizer.src_lang = "fra_Latn"
print(tokenizer.tokenize(t))
print(tokenizer.decode(tokenizer.encode(t)))

['▁il', '▁bo', 'it', '▁de', '▁l’', 'e', 'au']
fra_Latn il boit de l’eau</s>


In [39]:
tokenizer.convert_tokens_to_ids("dyu_Latn")

4

In [40]:
tokenizer.convert_tokens_to_ids("fra_Latn")

5

In [41]:
tokenizer.convert_tokens_to_ids("<mask>")

6