Although language model tokenizers are trained on large amounts of data, they may still perform poorly on low-resource languages due to the scarcity of digitized data in those languages. To assess how effectively a given model’s tokenizer represents your text, you can use CalculateTokenizerStats.

CalculateTokenizerStats takes as input a list of tokenizer names you want to test (or paths to them), as well as the path (or name) of the dataset you want to use for the test. To display the final results nicely, you can use VisHelper - it will present the results in a table.

At the moment, the metrics computed by CalculateTokenizerStats are hard-coded. Currently implemented:
1. Total number of tokens
2. Number of sentences containing the unk token
3. Total number of unk tokens
4. Total number of characters in the text

In the future, metric computation will be more flexible - the user will be able to specify exactly what they want to compute.

In [7]:
import warnings
warnings.filterwarnings('ignore')

In [8]:
from eeve.utils.tokenizer_calc_stats import CalculateTokenizerStats
from eeve.utils.vis_helper import VisHelper

In [11]:
counter = CalculateTokenizerStats(
    tokenizer_names_or_paths=[
        'intfloat/multilingual-e5-large-instruct',
        'sentence-transformers/LaBSE',
        'facebook/nllb-200-distilled-600M'
    ],
    dataset='alexantonov/chuvash_russian_parallel',
    load_kwargs={'split': 'train', 'columns': ['chv']}
)

vis = VisHelper()

In [12]:
stats = counter.run()

Generating train split: 100%|██████████| 1461485/1461485 [00:01<00:00, 836223.31 examples/s]
2000it [00:00, 5002.92it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (985 > 512). Running this sequence through the model will result in indexing errors
6000it [00:00, 8539.54it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (517 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1059 > 1024). Running this sequence through the model will result in indexing errors
1461485it [02:31, 9659.80it/s] 


In [13]:
stats

{'intfloat/multilingual-e5-large-instruct': StatStorage(total_unk_tokens=4532187, overall_sentences=1461485, total_sentences_with_unk=1142502, overall_tokens=55880452, overall_chars=103103353),
 'sentence-transformers/LaBSE': StatStorage(total_unk_tokens=7312619, overall_sentences=1461485, total_sentences_with_unk=1351852, overall_tokens=28796686, overall_chars=103103353),
 'facebook/nllb-200-distilled-600M': StatStorage(total_unk_tokens=10283202, overall_sentences=1461485, total_sentences_with_unk=1391186, overall_tokens=55871597, overall_chars=103103353)}

In [14]:
vis.print_comparisons(stats)