# Limit tokenizer vocabulary

This Notebook goes through the process of limiting a pretrained NllbTokenizer's vocabulary step-by-step.

**Process Overview:**
 1. Loads the pretrained tokenizer from `model_id`.
 2. Counts the frequency of each token appearing in the dataset.
 3. Retains tokens that meet the frequency threshold defined by `min_token_freq`.
 4. Checks for frequently occurring tokens that are not in the tokenizer's vocabulary and adds them.
 5. Saves the modified tokenizer's configuration, vocabulary, and sentencepiece model to `save_dir`.
 6. Ensures that the final tokenizer can be loaded and used with the `transformers` library.

**Steps in Detail:**
 1. The original tokenizer is loaded and applied to the dataset to generate token IDs.
 2. Token frequencies are computed for both source (`dyu_Latn`) and target (`fra_Latn`) languages.
 3. Tokens are filtered based on the minimum frequency and unknown tokens are flagged.
 4. The tokenizer's configuration files are modified to reflect the new vocabulary and saved.
 5. The sentencepiece model is updated by removing unused tokens and adding new tokens.
 6. The new tokenizer is tested to ensure compatibility with `transformers` and is saved.

**Notes:**
 - Special tokens are preserved and non-relevant special tokens are removed.
 - The tokenizer's vocabulary is updated to include high-frequency tokens that were previously unknown.
 - The function handles both updating the tokenizer's configuration files and the sentencepiece model.

## Setup environment

Restart the kernel after you have installed packages with `pip install` in the Notebook cell below.

In [1]:
!pip install -q -U sentencepiece transformers huggingface_hub datasets sacrebleu lxml sentence-transformers accelerate fastai

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gradient 2.0.6 requires attrs<=19, but you have attrs 23.1.0 which is incompatible.[0m[31m
[0m

In [2]:
BASE_MODEL_ID = "facebook/nllb-200-distilled-600M"
MIN_TOKEN_FREQ = 1
SAVE_DIR = "tokenizers/tokenizer_freq1"
HFHUB_LOGIN = False
LOAD_LOCAL_DATA = True

In [3]:
if HFHUB_LOGIN:
    from huggingface_hub import notebook_login
    notebook_login(new_session=False)

In [4]:
import os
from pathlib import Path
import json
from collections import Counter
import pandas as pd
from tqdm.auto import tqdm
from datasets import load_dataset
from tokenizers import SentencePieceBPETokenizer
from transformers import AutoTokenizer, PreTrainedTokenizerFast, NllbTokenizer
import sentencepiece as spm
from sentencepiece import sentencepiece_model_pb2 as sp_model

from data import load_from_json, save_data_locally, ds2df
from utils import preproc

## Prepare tokenizer training data

In [5]:
if LOAD_LOCAL_DATA:
    df = load_from_json(
        train_files="data/dataset_train.json",
        valid_files="data/dataset_validation.json",
        test_files="data/dataset_test.json",
        return_format="df"
    )
else:
    ds = load_dataset("uvci/Koumankan_mt_dyu_fr")
    save_data_locally(ds, save_dir="./data")
    df = ds2df(ds)
print(df.shape)
df.head()

(10929, 3)


Unnamed: 0,dyu,fr,split
0,A bi ji min na,Il boit de l’eau.,train
1,A le dalakolontɛ lon bɛ.,Il se plaint toujours.,train
2,Mun? Fɛn dɔ.,Quoi ? Quelque chose.,train
3,O bɛ bi bɔra fo Gubeta.,Tous sortent excepté Gubetta.,train
4,A ale lo bi da bugɔ la!,Ah ! c’est lui… il sonne…,train


## Load existing tokenizer

In [6]:
tokenizer_old = NllbTokenizer.from_pretrained(BASE_MODEL_ID)



Get a count of how frequently each token appears in the dataset

In [7]:
token_counts = Counter()

df_counts = df.copy()
tokenizer_old.src_lang = "dyu_Latn"
df_counts["tokens_dyu"] = df_counts["dyu"].apply(lambda s: tokenizer_old(preproc(s)).input_ids)
tokenizer_old.src_lang = "fra_Latn"
df_counts["tokens_fra"] = df_counts["fr"].apply(lambda s: tokenizer_old(preproc(s)).input_ids)
df_counts["tokens"] = df_counts["tokens_dyu"] + df_counts["tokens_fra"]

for tokens in df_counts["tokens"]:
    token_counts.update(dict(Counter(tokens)))
token_counts = dict(token_counts)

token_ids = sorted(list(token_counts.keys()))
print("Number of tokens used:", len(token_ids))

Number of tokens used: 10802


Covert to a Pandas `DataFrame`:

In [8]:
token_counts = pd.DataFrame({"token": token_counts.keys(), "count": token_counts.values()}).sort_values(by="count", ascending=False)
token_counts

Unnamed: 0,token,count
6,2,21858
0,256044,10929
7,256057,10929
1,9,4207
119,99,3522
...,...,...
7401,209546,1
7402,189625,1
7403,239427,1
7404,1083,1


Number of tokens with at least `MIN_TOKEN_FREQ` occurences:

In [9]:
(token_counts["count"] >= MIN_TOKEN_FREQ).sum()

10802

Let's limit the vocabulary to tokens with at least `MIN_TOKEN_FREQ` occurences:

In [10]:
token_ids = sorted(token_counts[token_counts["count"] >= MIN_TOKEN_FREQ]["token"].tolist())
print("Number of tokens used:", len(token_ids))

Number of tokens used: 10802


We also want to check if there are frequent tokens that are not in the tokenizer's vocabulary:

In [11]:
token_counts = Counter()

df_counts = df.copy()
tokenizer_old.src_lang = "dyu_Latn"
df_counts["tokens_dyu"] = df_counts["dyu"].apply(lambda s: tokenizer_old.tokenize(preproc(s)))
tokenizer_old.src_lang = "fra_Latn"
df_counts["tokens_fra"] = df_counts["fr"].apply(lambda s: tokenizer_old.tokenize(preproc(s)))
df_counts["tokens"] = df_counts["tokens_dyu"] + df_counts["tokens_fra"]

for tokens in df_counts["tokens"]:
    token_counts.update(dict(Counter(tokens)))
token_counts = dict(token_counts)

In [12]:
def _is_unk(t):
    return tokenizer_old.convert_tokens_to_ids(t) == tokenizer_old.unk_token_id

token_counts = pd.DataFrame({"token": token_counts.keys(), "count": token_counts.values()}).sort_values(by="count", ascending=False)
token_counts["is_unk"] = token_counts["token"].apply(_is_unk)
token_counts[token_counts["is_unk"]]

Unnamed: 0,token,count,is_unk
9,’,1625,True
4262,—,26,True
1821,»,15,True
4022,«,14,True
7072,“,3,True
7073,”,3,True
1489,–,2,True
8961,—«,2,True


The top token occurs very frequently. We'll add that to the tokenizer vocabulary.

In [13]:
new_tokens = set(token_counts[token_counts["is_unk"]].iloc[0, 0])
new_tokens

{'’'}

## Update the tokenizer's configuration files

In [14]:
tokenizer_old.save_pretrained("/tmp")

('/tmp/tokenizer_config.json',
 '/tmp/special_tokens_map.json',
 '/tmp/sentencepiece.bpe.model',
 '/tmp/added_tokens.json')

Special tokens - keep only relevant language tags

In [15]:
with open("/tmp/special_tokens_map.json", "r") as f:
    special_tokens_map = json.load(f)
add_special_tokens = ["dyu_Latn", "fra_Latn"]
add_special_tokens_remove = set([
    t for t in special_tokens_map["additional_special_tokens"] if t not in add_special_tokens
])
special_tokens_map["additional_special_tokens"] = add_special_tokens
special_tokens_map

{'additional_special_tokens': ['dyu_Latn', 'fra_Latn'],
 'bos_token': {'content': '<s>',
  'lstrip': False,
  'normalized': False,
  'rstrip': False,
  'single_word': False},
 'cls_token': {'content': '<s>',
  'lstrip': False,
  'normalized': False,
  'rstrip': False,
  'single_word': False},
 'eos_token': {'content': '</s>',
  'lstrip': False,
  'normalized': False,
  'rstrip': False,
  'single_word': False},
 'mask_token': {'content': '<mask>',
  'lstrip': True,
  'normalized': True,
  'rstrip': False,
  'single_word': False},
 'pad_token': {'content': '<pad>',
  'lstrip': False,
  'normalized': False,
  'rstrip': False,
  'single_word': False},
 'sep_token': {'content': '</s>',
  'lstrip': False,
  'normalized': False,
  'rstrip': False,
  'single_word': False},
 'unk_token': {'content': '<unk>',
  'lstrip': False,
  'normalized': False,
  'rstrip': False,
  'single_word': False}}

In [16]:
with open("/tmp/added_tokens.json", "r") as f:
    added_tokens = json.load(f)
added_tokens = {k: v for k, v in added_tokens.items() if k not in add_special_tokens_remove}
added_tokens

{'<mask>': 256203, 'dyu_Latn': 256044, 'fra_Latn': 256057}

In [17]:
with open("/tmp/tokenizer_config.json", "r") as f:
    tokenizer_config = json.load(f)

added_tokens_decoder = {}
for k, v in tokenizer_config["added_tokens_decoder"].items():
    if v["content"] not in add_special_tokens_remove:
        added_tokens_decoder[k] = v
tokenizer_config["added_tokens_decoder"] = added_tokens_decoder

tokenizer_config["additional_special_tokens"] = add_special_tokens
tokenizer_config

{'added_tokens_decoder': {'0': {'content': '<s>',
   'lstrip': False,
   'normalized': False,
   'rstrip': False,
   'single_word': False,
   'special': True},
  '1': {'content': '<pad>',
   'lstrip': False,
   'normalized': False,
   'rstrip': False,
   'single_word': False,
   'special': True},
  '2': {'content': '</s>',
   'lstrip': False,
   'normalized': False,
   'rstrip': False,
   'single_word': False,
   'special': True},
  '3': {'content': '<unk>',
   'lstrip': False,
   'normalized': False,
   'rstrip': False,
   'single_word': False,
   'special': True},
  '256044': {'content': 'dyu_Latn',
   'lstrip': False,
   'normalized': False,
   'rstrip': False,
   'single_word': False,
   'special': True},
  '256057': {'content': 'fra_Latn',
   'lstrip': False,
   'normalized': False,
   'rstrip': False,
   'single_word': False,
   'special': True},
  '256203': {'content': '<mask>',
   'lstrip': True,
   'normalized': True,
   'rstrip': False,
   'single_word': False,
   'special': 

In [18]:
if not os.path.exists(SAVE_DIR):
    os.makedirs(SAVE_DIR)
new_tokenizer_dir = Path(SAVE_DIR)

with open(new_tokenizer_dir/"tokenizer_config.json", "w") as f:
    json.dump(tokenizer_config, f,  indent=4)

with open(new_tokenizer_dir/"special_tokens_map.json", "w") as f:
    json.dump(special_tokens_map, f,  indent=4)

with open(new_tokenizer_dir/"added_tokens.json", "w") as f:
    json.dump(added_tokens, f,  indent=4)

## Build new vocabulary

The new vocabulary consists of:

 - Tokens present in the tokenizer's vocabulary and in our dataset
 - Tokens with high frequency in our dataset that are not in the tokenizer's vocab
 - The tokenizer's added and special tokens

In [19]:
new_vocab = (
    {tokenizer_old.convert_ids_to_tokens(i) for i in token_ids}
    .union(
        {v["content"] for v in tokenizer_config["added_tokens_decoder"].values()}
    )
    .union(
        new_tokens
    )
)
len(new_vocab)

10806

## Update the `sentencepiece` model

In [20]:
m = sp_model.ModelProto()
m.ParseFromString(open("/tmp/sentencepiece.bpe.model", "rb").read())

4852054

In [21]:
len(m.pieces)

256000

Iterate over `m.pieces` and keep only keep the tokens that are in the new vocab:

In [22]:
seen = set()
while True:
    if m.pieces[0].piece in seen:
        break
    x = m.pieces.pop(0)
    seen.add(x.piece)
    if x.piece in new_vocab:
        m.pieces.append(x)

Add new tokens:

In [23]:
add_tokens = new_vocab - seen

for token in add_tokens:
    new_token = sp_model.ModelProto().SentencePiece()
    new_token.piece = token
    new_token.score = 0
    m.pieces.append(new_token)

assert len(m.pieces) == len(new_vocab)

In [24]:
m.pieces[0]

piece: "<unk>"
score: 0
type: UNKNOWN

In [25]:
len(m.pieces)

10806

In [26]:
with open(new_tokenizer_dir/'sentencepiece.bpe.model', 'wb') as f:
    f.write(m.SerializeToString())

Test loading the updated model:

In [27]:
sp = spm.SentencePieceProcessor()
sp.load(str(new_tokenizer_dir/'sentencepiece.bpe.model'))

True

In [28]:
print(sp.encode_as_pieces('this is a test'))
print(sp.encode_as_ids('this is a test'))

['▁th', 'is', '▁is', '▁a', '▁test']
[162, 20, 170, 8, 2248]


Test loading the model with `transformers`:

In [29]:
tokenizer = NllbTokenizer.from_pretrained(new_tokenizer_dir)

In [30]:
tokenizer.vocab_size

10807

In [31]:
len(new_vocab)

10806

## Update indices of `added_tokens`

We need to update the indices in the config files for special tokens with indexes 

In [32]:
added_tokens_new = {}
for i in range(tokenizer.vocab_size):
    t = tokenizer.convert_ids_to_tokens(i)
    if t in added_tokens:
        print(f"Updated index: {i}: {t}")
        added_tokens_new[t] = i
added_tokens_new

Updated index: 10802: dyu_Latn
Updated index: 10804: fra_Latn
Updated index: 10806: <mask>


{'dyu_Latn': 10802, 'fra_Latn': 10804, '<mask>': 10806}

In [33]:
added_tokens_decoder_new = {}
for k, v in added_tokens_decoder.items():
    t = v["content"]
    i = int(k)
    assert t == tokenizer.convert_ids_to_tokens(i)
    if i >= tokenizer.vocab_size:
        i = added_tokens_new[t]
        print(f"Updated index: {i}: {t}")
    added_tokens_decoder_new[str(i)] = v
tokenizer_config["added_tokens_decoder"] = added_tokens_decoder_new
tokenizer_config

Updated index: 10802: dyu_Latn
Updated index: 10804: fra_Latn
Updated index: 10806: <mask>


{'added_tokens_decoder': {'0': {'content': '<s>',
   'lstrip': False,
   'normalized': False,
   'rstrip': False,
   'single_word': False,
   'special': True},
  '1': {'content': '<pad>',
   'lstrip': False,
   'normalized': False,
   'rstrip': False,
   'single_word': False,
   'special': True},
  '2': {'content': '</s>',
   'lstrip': False,
   'normalized': False,
   'rstrip': False,
   'single_word': False,
   'special': True},
  '3': {'content': '<unk>',
   'lstrip': False,
   'normalized': False,
   'rstrip': False,
   'single_word': False,
   'special': True},
  '10802': {'content': 'dyu_Latn',
   'lstrip': False,
   'normalized': False,
   'rstrip': False,
   'single_word': False,
   'special': True},
  '10804': {'content': 'fra_Latn',
   'lstrip': False,
   'normalized': False,
   'rstrip': False,
   'single_word': False,
   'special': True},
  '10806': {'content': '<mask>',
   'lstrip': True,
   'normalized': True,
   'rstrip': False,
   'single_word': False,
   'special': Tru

In [34]:
with open(new_tokenizer_dir/"tokenizer_config.json", "w") as f:
    json.dump(tokenizer_config, f,  indent=4)

with open(new_tokenizer_dir/"added_tokens.json", "w") as f:
    json.dump(added_tokens_new, f,  indent=4)

Test loading the model with `transformers`:

In [35]:
tokenizer = NllbTokenizer.from_pretrained(new_tokenizer_dir)

In [36]:
tokenizer.vocab_size

10807

In [37]:
t = "il boit de l’eau"
print(tokenizer.tokenize(t))
print(tokenizer.decode(tokenizer.encode(t)))

['▁il', '▁boit', '▁de', '▁l', '’', 'eau']


2024-08-30 19:18:44.839954: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-30 19:18:44.840015: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-30 19:18:44.841124: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-30 19:18:44.848449: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


fra_Latn il boit de l’eau</s>


In [38]:
tokenizer.convert_tokens_to_ids("dyu_Latn")

10802

In [39]:
tokenizer.convert_tokens_to_ids("fra_Latn")

10804

In [40]:
tokenizer.convert_tokens_to_ids("<mask>")

10806