# Tokenization

This section will be related to an exploration of the huggingface
tokenization word

## Regex

Notably, a lot of different features use regex. the regex feature
is located in tokenizers as "from tokenizers import Regex"

## Normalization

Normalization, and normalizers, are functions
responsible for translating and handling the cleanup
of strings. They are responsible for such activities
as separating concept-level languages, translating unicode
characters into more meaningful forms, and other sorts
of text refinement.

Anything involving "cleaning" the text, in other words,
should go here.

In [None]:
from tokenizers import normalizers

norms = [
    normalizers.NFKD(),
    normalizers.Strip(),
    normalizers.StripAccents(),
]
normalizer = normalizers.Sequence(norms)
normalizer.normalize_str("this is a test")



# Pre-Tokenization

Pre-Tokenization is the process of splitting up the string into
subcomponents which we will then be using for actual tokenization. It performs the "split into arrays" portion of tokenization.

Importantly, pretokenization can be applied in sequence with absolutely no problem.


In [24]:
from tokenizers import pre_tokenizers

#Standard tokenization
pretoksequence = [
    pre_tokenizers.Punctuation(),
    pre_tokenizers.Whitespace(),
    pre_tokenizers.Digits(individual_digits=True),

]

pretokenizer = pre_tokenizers.Sequence(pretoksequence)
item = pretokenizer.pre_tokenize_str("Hey, let's see how this does: 123445")



In [26]:
#Custom char pretokenizer. Note the usage of regex.
#
# The core library is written in rust, so this is faster
#than a custom class.
from tokenizers import Regex

pattern = Regex('.')
pretokenizer=  pre_tokenizers.Split(pattern, 'isolated')
pretokenizer.pre_tokenize_str("Hey, let's see how this does: 123445")

[('H', (0, 1)),
 ('e', (1, 2)),
 ('y', (2, 3)),
 (',', (3, 4)),
 (' ', (4, 5)),
 ('l', (5, 6)),
 ('e', (6, 7)),
 ('t', (7, 8)),
 ("'", (8, 9)),
 ('s', (9, 10)),
 (' ', (10, 11)),
 ('s', (11, 12)),
 ('e', (12, 13)),
 ('e', (13, 14)),
 (' ', (14, 15)),
 ('h', (15, 16)),
 ('o', (16, 17)),
 ('w', (17, 18)),
 (' ', (18, 19)),
 ('t', (19, 20)),
 ('h', (20, 21)),
 ('i', (21, 22)),
 ('s', (22, 23)),
 (' ', (23, 24)),
 ('d', (24, 25)),
 ('o', (25, 26)),
 ('e', (26, 27)),
 ('s', (27, 28)),
 (':', (28, 29)),
 (' ', (29, 30)),
 ('1', (30, 31)),
 ('2', (31, 32)),
 ('3', (32, 33)),
 ('4', (33, 34)),
 ('4', (34, 35)),
 ('5', (35, 36))]

# Model

Any kind of tokenization requires training. At some point along the way, the tokenization system must construct a vocabulary of some kind, such that words can be turned into tokens and vice versa. The degree of training, however, varies.

## WordLevel

The most simple variety of tokenizer is perhaps the word level tokenizer. It functions by means of simply mapping the incoming tokens onto a vocabulary, whose size is limited. Unknown words must, by virtue of being unknown, be represented in terms of a unique unknown token, and the model makes its best guess as to the content of the token.

## BPE, WordPiece

Both of these build up their vocabulary in much the same way.

Figure out all the individual characters which are within the input stream. Take those characters, and then look at the text. Figure out what groups of characters, when merged together, would reduce the overall token count. Repeat until vocabulary is within acceptable limits.

Byte Pair Encoding and WordPiece are both excellent tokenization agorithms which require training. They look at the entire vocabulary, and attempt to merge it down to a particular size.

## Unigram

Unigram does something else. It starts by initializing a complete vocabulary, and then prunes down from here.








% md


# Extra

A few prebuilt, additional tokenizers are available. They still require training

## Sentencepiece.

Sentencepiece works well when a language does not have a clear distinguishment between words, as in, for example, chinese. It treats an entire sentence as a token, and then attempts to break it apart into bits


The two varieties are

* SentencePieceUnigramTokenizer
* SentencePieceBPETokenizer,

Due to the way imports are run, pycharm, and likely other ide's, do not detect thes eas valid imports before running

## Extra

A few extra tokenizers are available, but perhaps not clearly listed

* CharBPETokenizer
* ByteLevelBPETokenizer
* BertWordPieceTokenizer


# Training

## Training a basic pretrained tokenizer.

Training a tokenizer takes quite a bit of time, but is nonetheless fairly straightforward. Make a dataset of the text content, shove it through an iterator, and let it work from there. One additional item of note: It is likely beneficial to start from a premade tokenizer which is reasonably close to your application.


In [1]:
import tokenizers
import transformers

## Load a premade tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-cased")
tokenizer.tokenize("Test this tokenization 1190")

  from .autonotebook import tqdm as notebook_tqdm


['Test', 'this', 'token', '##ization', '119', '##0']

In [2]:
from fastai.text.all import get_text_files
import datasets


#Update it to be more suitable to the particular application using more training.
#Make sure to use this to extract data. Significantly faster than raw python

src = r'C:\Users\chris\PycharmProjects\qa\Data\lambada-dataset\train-novels'
files = [str(item) for item in get_text_files(src)]
ds = datasets.load_dataset(src, '.txt', data_files=files )
ds

Resolving data files: 100%|██████████| 2662/2662 [00:00<00:00, 14921.24it/s]
Using custom data configuration train-novels-042faceeba55f51c
Reusing dataset text (C:\Users\chris\.cache\huggingface\datasets\text\train-novels-042faceeba55f51c\0.0.0\acc32f2f2ef863c93c2f30c52f7df6cc9053a1c2230b8d7da0d210404683ca08)
100%|██████████| 1/1 [00:00<00:00,  3.02it/s]


DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 14747362
    })
})

In [38]:

training = ds['train']
training.shuffle()
training[0:10]

{'text': [" prism story of ci prism -lrb- story of ci -rrb- copyright 2010 rachel moschell published by rachel moschell at smashwords third edition the prism table of contents 0 gilded 1 purple 2 gaudy gold 3 electric blue 4 mocha 5 aquamarine 6 emerald 7 white plaster 8 scarlet 9 silver 10 midnight blue 11 coffee 12 hazel 13 olive green 14 beet red 15 pale 16 canary yellow 17 brick red 18 dark 19 transparent 20 red white and blue 21 cinnamon 22 sickly pink 23 bittersweet 24 sea green 25 crimson 26 pale blue 27 sapphire 28 fiery 29 turquoise 30 blond 31 plaid 32 white 33 grape 34 lilac preview of reverb -lrb- story of ci 2 -rrb- 0 gilded the silvery branches of the molle tree whispered in the shade , sprinkling soft leaves in the dirt of the boy 's path . ",
  ' the sky shone sapphire behind the lacy branches , and the only other life along this back road were two tawny cows grazing in the silver white grass of the ditch . ',
  ' behind them ran the same crumbling adobe wall the boy sa

In [5]:
#Create a generator to load batches of data.

batch_size = 128
def loader(ds_split, max_batches = 10000):
    total_batches = min(len(ds_split) // batch_size + 1, max_batches)
    batch_count = 0
    print("total batches: %s" % total_batches)
    collection = []

    for i, instance in enumerate(ds_split):
        collection.append(instance['text'])
        if len(collection) > batch_size:
            if batch_count % 1000 == 0:
                print("batch %s of %s" % (batch_count, total_batches))
            yield collection
            batch_count += 1
            collection = []
        if batch_count > max_batches:
            break
    yield collection

#Train the data based on the vocab
ds_loader = loader(training)
new_tokenizer = tokenizer.train_new_from_iterator(ds_loader, vocab_size=60000)

total batches: 10000
batch 0 of 10000
batch 1000 of 10000
batch 2000 of 10000
batch 3000 of 10000
batch 4000 of 10000
batch 5000 of 10000
batch 6000 of 10000
batch 7000 of 10000
batch 8000 of 10000
batch 9000 of 10000
batch 10000 of 10000


In [6]:
#Once trained, you want to save it so you can use it again later
new_tokenizer.save_pretrained("my_example_pretrained")

('my_example_pretrained\\tokenizer_config.json',
 'my_example_pretrained\\special_tokens_map.json',
 'my_example_pretrained\\vocab.txt',
 'my_example_pretrained\\added_tokens.json',
 'my_example_pretrained\\tokenizer.json')

In [35]:
#And it can now be loaded from here!
import random
my_tokenizer = transformers.AutoTokenizer.from_pretrained("my_example_pretrained")
example = training[random.randint(0, 1000000)]['text']
print(example)
my_tokenizer.tokenize(example)

 they were in position , ten feet away from their home ship . 


['they',
 'were',
 'in',
 'position',
 ',',
 'ten',
 'feet',
 'away',
 'from',
 'their',
 'home',
 'ship',
 '.']

## Training tokenizers completely from scratch: A comparison

Sometimes, it might become necessary to train a tokenizer completely from scratch. In which case, it is possible to initialize a raw tokenizer, then go use the train iterator.

Lets train a few different tokenizers from scratch. We will keep track of the time for comparison, and only train them briefly. We will train, basically, all of them. These are all basically predefined pipelines.

* SentencePieceBPETokenizer
*



In [92]:
#Run imports
import datasets
import timeit

from timeit import default_timer as timer
from functools import partial
from itertools import product

#Get tokenizers
from tokenizers import SentencePieceBPETokenizer
from tokenizers import CharBPETokenizer
from tokenizers import SentencePieceUnigramTokenizer
from tokenizers import ByteLevelBPETokenizer
from tokenizers import BertWordPieceTokenizer

SentencePieceBPETokenizer()

Tokenizer(vocabulary_size=0, model=SentencePieceBPE, unk_token=<unk>, replacement=▁, add_prefix_space=True, dropout=None)

"<class 'tokenizers.implementations.sentencepiece_bpe.SentencePieceBPETokenizer'>"

In [95]:
#Configure our dataset

ds = datasets.load_dataset("lambada")
training = ds['train']

def filter_length(x, min_length, max_length):
    return min_length < len(x['text']) < max_length


def yield_batches(dataset: datasets.Dataset,
                 min_length: int,
                 max_length:int,
                 batch_size: int,
                 num_batches: int):
    """

    Gets the data for training

    """
    dataset = dataset.filter(partial(filter_length, min_length=min_length, max_length=max_length))
    batch_collection = []
    batch_num = 0
    for item in dataset:
        batch_collection.append(item['text'])
        if len(batch_collection) >= batch_size:
            if batch_num % (num_batches//10) == 0:
                print("batch %s of %s" % (batch_num, num_batches))

            yield batch_collection
            if batch_num >= num_batches:
                break
            batch_collection = []
            batch_num += 1
    yield batch_collection


def train_tokenizer(dset, n_batches, batch_size, tokenizer, lengths):
    print("training tokenizer: number %s, batch %s, token %s, lengths %s"
          % (n_batches, batch_size, str(tokenizer), lengths))
    statistics = {
        'n_batches' : n_batches,
        'batch_size' : batch_size,
        'tokenizer' : str(tokenizer)}

    train_batches = yield_batches(dset, lengths[0], lengths[1], batch_size, n_batches)
    tokenizer = tokenizer()
    start = timer()

    #Note it is "Train From Iterator" not "Train_new_from_iterator"
    tokenizer = tokenizer.train_from_iterator(train_batches)
    end = timer()
    print("elapsed %s" % (end-start))
    statistics['elapsed'] = end-start
    statistics['tkn'] = tokenizer
    return statistics





Reusing dataset lambada (C:\Users\chris\.cache\huggingface\datasets\lambada\plain_text\1.1.0\e32d76a7236c9ebb30099bc73d677c3acf32ddffb411836fe9ffc091ad3f3bec)
100%|██████████| 3/3 [00:00<00:00, 25.88it/s]


In [None]:
#Define our examination parameters

n_batches = [100, 1000, 2000]
batch_sizes = [2, 4, 8, 16]
tokenizers = [SentencePieceBPETokenizer,
              CharBPETokenizer,
              SentencePieceUnigramTokenizer,
              ByteLevelBPETokenizer,
              BertWordPieceTokenizer,
              ]
lengths = [[0, 5000], [5000, 10000], [10000, 15000]]
options = product(n_batches, batch_sizes, tokenizers, lengths)

#Train tokenizers
statistics = []
for option in options:
    print(option)
    statistics.append(train_tokenizer(training, *option))



(100, 2, <class 'tokenizers.implementations.sentencepiece_bpe.SentencePieceBPETokenizer'>, [0, 5000])
training tokenizer: number 100, batch 2, token <class 'tokenizers.implementations.sentencepiece_bpe.SentencePieceBPETokenizer'>, lengths [0, 5000]


100%|██████████| 3/3 [00:01<00:00,  1.85ba/s]


batch 0 of 100
elapsed 1.6916866999963531
(100, 2, <class 'tokenizers.implementations.sentencepiece_bpe.SentencePieceBPETokenizer'>, [5000, 10000])
training tokenizer: number 100, batch 2, token <class 'tokenizers.implementations.sentencepiece_bpe.SentencePieceBPETokenizer'>, lengths [5000, 10000]


100%|██████████| 3/3 [00:01<00:00,  2.44ba/s]


elapsed 1.2769699999989825
(100, 2, <class 'tokenizers.implementations.sentencepiece_bpe.SentencePieceBPETokenizer'>, [10000, 15000])
training tokenizer: number 100, batch 2, token <class 'tokenizers.implementations.sentencepiece_bpe.SentencePieceBPETokenizer'>, lengths [10000, 15000]


100%|██████████| 3/3 [00:01<00:00,  2.36ba/s]


batch 0 of 100
elapsed 1.3438907000017934
(100, 2, <class 'tokenizers.implementations.char_level_bpe.CharBPETokenizer'>, [0, 5000])
training tokenizer: number 100, batch 2, token <class 'tokenizers.implementations.char_level_bpe.CharBPETokenizer'>, lengths [0, 5000]


100%|██████████| 3/3 [00:01<00:00,  2.44ba/s]


batch 0 of 100
elapsed 1.2478760000012699
(100, 2, <class 'tokenizers.implementations.char_level_bpe.CharBPETokenizer'>, [5000, 10000])
training tokenizer: number 100, batch 2, token <class 'tokenizers.implementations.char_level_bpe.CharBPETokenizer'>, lengths [5000, 10000]


100%|██████████| 3/3 [00:01<00:00,  2.38ba/s]


elapsed 1.2954339999996591
(100, 2, <class 'tokenizers.implementations.char_level_bpe.CharBPETokenizer'>, [10000, 15000])
training tokenizer: number 100, batch 2, token <class 'tokenizers.implementations.char_level_bpe.CharBPETokenizer'>, lengths [10000, 15000]


100%|██████████| 3/3 [00:01<00:00,  2.33ba/s]


batch 0 of 100
elapsed 1.356816700004856
(100, 2, <class 'tokenizers.implementations.sentencepiece_unigram.SentencePieceUnigramTokenizer'>, [0, 5000])
training tokenizer: number 100, batch 2, token <class 'tokenizers.implementations.sentencepiece_unigram.SentencePieceUnigramTokenizer'>, lengths [0, 5000]


100%|██████████| 3/3 [00:01<00:00,  2.36ba/s]


batch 0 of 100
elapsed 1.3084346000032383
(100, 2, <class 'tokenizers.implementations.sentencepiece_unigram.SentencePieceUnigramTokenizer'>, [5000, 10000])
training tokenizer: number 100, batch 2, token <class 'tokenizers.implementations.sentencepiece_unigram.SentencePieceUnigramTokenizer'>, lengths [5000, 10000]


100%|██████████| 3/3 [00:01<00:00,  2.44ba/s]


elapsed 1.2705461000005016
(100, 2, <class 'tokenizers.implementations.sentencepiece_unigram.SentencePieceUnigramTokenizer'>, [10000, 15000])
training tokenizer: number 100, batch 2, token <class 'tokenizers.implementations.sentencepiece_unigram.SentencePieceUnigramTokenizer'>, lengths [10000, 15000]


100%|██████████| 3/3 [00:01<00:00,  2.34ba/s]


batch 0 of 100
elapsed 1.353767799999332
(100, 2, <class 'tokenizers.implementations.byte_level_bpe.ByteLevelBPETokenizer'>, [0, 5000])
training tokenizer: number 100, batch 2, token <class 'tokenizers.implementations.byte_level_bpe.ByteLevelBPETokenizer'>, lengths [0, 5000]


100%|██████████| 3/3 [00:01<00:00,  2.36ba/s]


batch 0 of 100
elapsed 1.3020074000014574
(100, 2, <class 'tokenizers.implementations.byte_level_bpe.ByteLevelBPETokenizer'>, [5000, 10000])
training tokenizer: number 100, batch 2, token <class 'tokenizers.implementations.byte_level_bpe.ByteLevelBPETokenizer'>, lengths [5000, 10000]


100%|██████████| 3/3 [00:01<00:00,  2.44ba/s]


elapsed 1.2809663999942131
(100, 2, <class 'tokenizers.implementations.byte_level_bpe.ByteLevelBPETokenizer'>, [10000, 15000])
training tokenizer: number 100, batch 2, token <class 'tokenizers.implementations.byte_level_bpe.ByteLevelBPETokenizer'>, lengths [10000, 15000]


100%|██████████| 3/3 [00:01<00:00,  2.49ba/s]


batch 0 of 100
elapsed 1.2934448999949382
(100, 2, <class 'tokenizers.implementations.bert_wordpiece.BertWordPieceTokenizer'>, [0, 5000])
training tokenizer: number 100, batch 2, token <class 'tokenizers.implementations.bert_wordpiece.BertWordPieceTokenizer'>, lengths [0, 5000]


100%|██████████| 3/3 [00:01<00:00,  2.39ba/s]


batch 0 of 100
elapsed 1.2777342000044882
(100, 2, <class 'tokenizers.implementations.bert_wordpiece.BertWordPieceTokenizer'>, [5000, 10000])
training tokenizer: number 100, batch 2, token <class 'tokenizers.implementations.bert_wordpiece.BertWordPieceTokenizer'>, lengths [5000, 10000]


100%|██████████| 3/3 [00:01<00:00,  2.48ba/s]


elapsed 1.2510975999975926
(100, 2, <class 'tokenizers.implementations.bert_wordpiece.BertWordPieceTokenizer'>, [10000, 15000])
training tokenizer: number 100, batch 2, token <class 'tokenizers.implementations.bert_wordpiece.BertWordPieceTokenizer'>, lengths [10000, 15000]


100%|██████████| 3/3 [00:01<00:00,  2.46ba/s]


batch 0 of 100
elapsed 1.2936815000066417
(100, 4, <class 'tokenizers.implementations.sentencepiece_bpe.SentencePieceBPETokenizer'>, [0, 5000])
training tokenizer: number 100, batch 4, token <class 'tokenizers.implementations.sentencepiece_bpe.SentencePieceBPETokenizer'>, lengths [0, 5000]


100%|██████████| 3/3 [00:01<00:00,  2.35ba/s]


elapsed 1.2979200999980094
(100, 4, <class 'tokenizers.implementations.sentencepiece_bpe.SentencePieceBPETokenizer'>, [5000, 10000])
training tokenizer: number 100, batch 4, token <class 'tokenizers.implementations.sentencepiece_bpe.SentencePieceBPETokenizer'>, lengths [5000, 10000]


100%|██████████| 3/3 [00:01<00:00,  2.38ba/s]


elapsed 1.3050847999984398
(100, 4, <class 'tokenizers.implementations.sentencepiece_bpe.SentencePieceBPETokenizer'>, [10000, 15000])
training tokenizer: number 100, batch 4, token <class 'tokenizers.implementations.sentencepiece_bpe.SentencePieceBPETokenizer'>, lengths [10000, 15000]


100%|██████████| 3/3 [00:01<00:00,  2.44ba/s]


elapsed 1.2955463999969652
(100, 4, <class 'tokenizers.implementations.char_level_bpe.CharBPETokenizer'>, [0, 5000])
training tokenizer: number 100, batch 4, token <class 'tokenizers.implementations.char_level_bpe.CharBPETokenizer'>, lengths [0, 5000]


100%|██████████| 3/3 [00:01<00:00,  2.49ba/s]


elapsed 1.2201088999936474
(100, 4, <class 'tokenizers.implementations.char_level_bpe.CharBPETokenizer'>, [5000, 10000])
training tokenizer: number 100, batch 4, token <class 'tokenizers.implementations.char_level_bpe.CharBPETokenizer'>, lengths [5000, 10000]


100%|██████████| 3/3 [00:01<00:00,  2.31ba/s]


elapsed 1.3380495000019437
(100, 4, <class 'tokenizers.implementations.char_level_bpe.CharBPETokenizer'>, [10000, 15000])
training tokenizer: number 100, batch 4, token <class 'tokenizers.implementations.char_level_bpe.CharBPETokenizer'>, lengths [10000, 15000]


100%|██████████| 3/3 [00:01<00:00,  2.46ba/s]


elapsed 1.2862878000014462
(100, 4, <class 'tokenizers.implementations.sentencepiece_unigram.SentencePieceUnigramTokenizer'>, [0, 5000])
training tokenizer: number 100, batch 4, token <class 'tokenizers.implementations.sentencepiece_unigram.SentencePieceUnigramTokenizer'>, lengths [0, 5000]


100%|██████████| 3/3 [00:01<00:00,  2.44ba/s]


elapsed 1.2521962999962852
(100, 4, <class 'tokenizers.implementations.sentencepiece_unigram.SentencePieceUnigramTokenizer'>, [5000, 10000])
training tokenizer: number 100, batch 4, token <class 'tokenizers.implementations.sentencepiece_unigram.SentencePieceUnigramTokenizer'>, lengths [5000, 10000]


100%|██████████| 3/3 [00:01<00:00,  2.33ba/s]


elapsed 1.330471100001887
(100, 4, <class 'tokenizers.implementations.sentencepiece_unigram.SentencePieceUnigramTokenizer'>, [10000, 15000])
training tokenizer: number 100, batch 4, token <class 'tokenizers.implementations.sentencepiece_unigram.SentencePieceUnigramTokenizer'>, lengths [10000, 15000]


100%|██████████| 3/3 [00:01<00:00,  2.39ba/s]


elapsed 1.3325036999958684
(100, 4, <class 'tokenizers.implementations.byte_level_bpe.ByteLevelBPETokenizer'>, [0, 5000])
training tokenizer: number 100, batch 4, token <class 'tokenizers.implementations.byte_level_bpe.ByteLevelBPETokenizer'>, lengths [0, 5000]


100%|██████████| 3/3 [00:01<00:00,  2.36ba/s]


elapsed 1.2884132000035606
(100, 4, <class 'tokenizers.implementations.byte_level_bpe.ByteLevelBPETokenizer'>, [5000, 10000])
training tokenizer: number 100, batch 4, token <class 'tokenizers.implementations.byte_level_bpe.ByteLevelBPETokenizer'>, lengths [5000, 10000]


100%|██████████| 3/3 [00:01<00:00,  2.45ba/s]


elapsed 1.2724520999981905
(100, 4, <class 'tokenizers.implementations.byte_level_bpe.ByteLevelBPETokenizer'>, [10000, 15000])
training tokenizer: number 100, batch 4, token <class 'tokenizers.implementations.byte_level_bpe.ByteLevelBPETokenizer'>, lengths [10000, 15000]


100%|██████████| 3/3 [00:01<00:00,  2.36ba/s]


elapsed 1.3571018000002368
(100, 4, <class 'tokenizers.implementations.bert_wordpiece.BertWordPieceTokenizer'>, [0, 5000])
training tokenizer: number 100, batch 4, token <class 'tokenizers.implementations.bert_wordpiece.BertWordPieceTokenizer'>, lengths [0, 5000]


100%|██████████| 3/3 [00:01<00:00,  2.49ba/s]


elapsed 1.2159792000020389
(100, 4, <class 'tokenizers.implementations.bert_wordpiece.BertWordPieceTokenizer'>, [5000, 10000])
training tokenizer: number 100, batch 4, token <class 'tokenizers.implementations.bert_wordpiece.BertWordPieceTokenizer'>, lengths [5000, 10000]


100%|██████████| 3/3 [00:01<00:00,  2.48ba/s]


elapsed 1.2472769000014523
(100, 4, <class 'tokenizers.implementations.bert_wordpiece.BertWordPieceTokenizer'>, [10000, 15000])
training tokenizer: number 100, batch 4, token <class 'tokenizers.implementations.bert_wordpiece.BertWordPieceTokenizer'>, lengths [10000, 15000]


100%|██████████| 3/3 [00:01<00:00,  2.38ba/s]


elapsed 1.340424100002565
(100, 8, <class 'tokenizers.implementations.sentencepiece_bpe.SentencePieceBPETokenizer'>, [0, 5000])
training tokenizer: number 100, batch 8, token <class 'tokenizers.implementations.sentencepiece_bpe.SentencePieceBPETokenizer'>, lengths [0, 5000]


100%|██████████| 3/3 [00:01<00:00,  2.39ba/s]


elapsed 1.2701121000063722
(100, 8, <class 'tokenizers.implementations.sentencepiece_bpe.SentencePieceBPETokenizer'>, [5000, 10000])
training tokenizer: number 100, batch 8, token <class 'tokenizers.implementations.sentencepiece_bpe.SentencePieceBPETokenizer'>, lengths [5000, 10000]


100%|██████████| 3/3 [00:01<00:00,  2.46ba/s]


elapsed 1.260712200004491
(100, 8, <class 'tokenizers.implementations.sentencepiece_bpe.SentencePieceBPETokenizer'>, [10000, 15000])
training tokenizer: number 100, batch 8, token <class 'tokenizers.implementations.sentencepiece_bpe.SentencePieceBPETokenizer'>, lengths [10000, 15000]


100%|██████████| 3/3 [00:01<00:00,  2.40ba/s]


elapsed 1.3175147000001743
(100, 8, <class 'tokenizers.implementations.char_level_bpe.CharBPETokenizer'>, [0, 5000])
training tokenizer: number 100, batch 8, token <class 'tokenizers.implementations.char_level_bpe.CharBPETokenizer'>, lengths [0, 5000]


100%|██████████| 3/3 [00:01<00:00,  2.37ba/s]


elapsed 1.2809833999999682
(100, 8, <class 'tokenizers.implementations.char_level_bpe.CharBPETokenizer'>, [5000, 10000])
training tokenizer: number 100, batch 8, token <class 'tokenizers.implementations.char_level_bpe.CharBPETokenizer'>, lengths [5000, 10000]


100%|██████████| 3/3 [00:01<00:00,  2.46ba/s]


elapsed 1.2536536999978125
(100, 8, <class 'tokenizers.implementations.char_level_bpe.CharBPETokenizer'>, [10000, 15000])
training tokenizer: number 100, batch 8, token <class 'tokenizers.implementations.char_level_bpe.CharBPETokenizer'>, lengths [10000, 15000]


100%|██████████| 3/3 [00:01<00:00,  2.42ba/s]


elapsed 1.296337699997821
(100, 8, <class 'tokenizers.implementations.sentencepiece_unigram.SentencePieceUnigramTokenizer'>, [0, 5000])
training tokenizer: number 100, batch 8, token <class 'tokenizers.implementations.sentencepiece_unigram.SentencePieceUnigramTokenizer'>, lengths [0, 5000]


100%|██████████| 3/3 [00:01<00:00,  2.24ba/s]


elapsed 1.3572767000005115
(100, 8, <class 'tokenizers.implementations.sentencepiece_unigram.SentencePieceUnigramTokenizer'>, [5000, 10000])
training tokenizer: number 100, batch 8, token <class 'tokenizers.implementations.sentencepiece_unigram.SentencePieceUnigramTokenizer'>, lengths [5000, 10000]


100%|██████████| 3/3 [00:01<00:00,  2.31ba/s]


elapsed 1.3498147999998764
(100, 8, <class 'tokenizers.implementations.sentencepiece_unigram.SentencePieceUnigramTokenizer'>, [10000, 15000])
training tokenizer: number 100, batch 8, token <class 'tokenizers.implementations.sentencepiece_unigram.SentencePieceUnigramTokenizer'>, lengths [10000, 15000]


100%|██████████| 3/3 [00:01<00:00,  2.33ba/s]


elapsed 1.367449499994109
(100, 8, <class 'tokenizers.implementations.byte_level_bpe.ByteLevelBPETokenizer'>, [0, 5000])
training tokenizer: number 100, batch 8, token <class 'tokenizers.implementations.byte_level_bpe.ByteLevelBPETokenizer'>, lengths [0, 5000]


100%|██████████| 3/3 [00:01<00:00,  2.34ba/s]


elapsed 1.3058919999966747
(100, 8, <class 'tokenizers.implementations.byte_level_bpe.ByteLevelBPETokenizer'>, [5000, 10000])
training tokenizer: number 100, batch 8, token <class 'tokenizers.implementations.byte_level_bpe.ByteLevelBPETokenizer'>, lengths [5000, 10000]


100%|██████████| 3/3 [00:01<00:00,  2.38ba/s]


elapsed 1.307646900000691
(100, 8, <class 'tokenizers.implementations.byte_level_bpe.ByteLevelBPETokenizer'>, [10000, 15000])
training tokenizer: number 100, batch 8, token <class 'tokenizers.implementations.byte_level_bpe.ByteLevelBPETokenizer'>, lengths [10000, 15000]


100%|██████████| 3/3 [00:01<00:00,  2.39ba/s]


elapsed 1.3384625999970012
(100, 8, <class 'tokenizers.implementations.bert_wordpiece.BertWordPieceTokenizer'>, [0, 5000])
training tokenizer: number 100, batch 8, token <class 'tokenizers.implementations.bert_wordpiece.BertWordPieceTokenizer'>, lengths [0, 5000]


100%|██████████| 3/3 [00:01<00:00,  2.30ba/s]


elapsed 1.3132393999985652
(100, 8, <class 'tokenizers.implementations.bert_wordpiece.BertWordPieceTokenizer'>, [5000, 10000])
training tokenizer: number 100, batch 8, token <class 'tokenizers.implementations.bert_wordpiece.BertWordPieceTokenizer'>, lengths [5000, 10000]


100%|██████████| 3/3 [00:01<00:00,  2.42ba/s]


elapsed 1.278508299998066
(100, 8, <class 'tokenizers.implementations.bert_wordpiece.BertWordPieceTokenizer'>, [10000, 15000])
training tokenizer: number 100, batch 8, token <class 'tokenizers.implementations.bert_wordpiece.BertWordPieceTokenizer'>, lengths [10000, 15000]


100%|██████████| 3/3 [00:01<00:00,  2.40ba/s]


elapsed 1.3063079999992624
(100, 16, <class 'tokenizers.implementations.sentencepiece_bpe.SentencePieceBPETokenizer'>, [0, 5000])
training tokenizer: number 100, batch 16, token <class 'tokenizers.implementations.sentencepiece_bpe.SentencePieceBPETokenizer'>, lengths [0, 5000]


100%|██████████| 3/3 [00:01<00:00,  2.33ba/s]


elapsed 1.2898939999940922
(100, 16, <class 'tokenizers.implementations.sentencepiece_bpe.SentencePieceBPETokenizer'>, [5000, 10000])
training tokenizer: number 100, batch 16, token <class 'tokenizers.implementations.sentencepiece_bpe.SentencePieceBPETokenizer'>, lengths [5000, 10000]


100%|██████████| 3/3 [00:01<00:00,  2.10ba/s]


elapsed 1.93993380000029
(100, 16, <class 'tokenizers.implementations.sentencepiece_bpe.SentencePieceBPETokenizer'>, [10000, 15000])
training tokenizer: number 100, batch 16, token <class 'tokenizers.implementations.sentencepiece_bpe.SentencePieceBPETokenizer'>, lengths [10000, 15000]


100%|██████████| 3/3 [00:01<00:00,  2.16ba/s]


elapsed 1.4736048999984632
(100, 16, <class 'tokenizers.implementations.char_level_bpe.CharBPETokenizer'>, [0, 5000])
training tokenizer: number 100, batch 16, token <class 'tokenizers.implementations.char_level_bpe.CharBPETokenizer'>, lengths [0, 5000]


100%|██████████| 3/3 [00:01<00:00,  2.28ba/s]


elapsed 1.3292274000050384
(100, 16, <class 'tokenizers.implementations.char_level_bpe.CharBPETokenizer'>, [5000, 10000])
training tokenizer: number 100, batch 16, token <class 'tokenizers.implementations.char_level_bpe.CharBPETokenizer'>, lengths [5000, 10000]


100%|██████████| 3/3 [00:01<00:00,  2.28ba/s]


elapsed 1.358046800000011
(100, 16, <class 'tokenizers.implementations.char_level_bpe.CharBPETokenizer'>, [10000, 15000])
training tokenizer: number 100, batch 16, token <class 'tokenizers.implementations.char_level_bpe.CharBPETokenizer'>, lengths [10000, 15000]


100%|██████████| 3/3 [00:01<00:00,  2.22ba/s]


elapsed 1.4136472000027425
(100, 16, <class 'tokenizers.implementations.sentencepiece_unigram.SentencePieceUnigramTokenizer'>, [0, 5000])
training tokenizer: number 100, batch 16, token <class 'tokenizers.implementations.sentencepiece_unigram.SentencePieceUnigramTokenizer'>, lengths [0, 5000]


100%|██████████| 3/3 [00:01<00:00,  2.40ba/s]


elapsed 1.2757873000009567
(100, 16, <class 'tokenizers.implementations.sentencepiece_unigram.SentencePieceUnigramTokenizer'>, [5000, 10000])
training tokenizer: number 100, batch 16, token <class 'tokenizers.implementations.sentencepiece_unigram.SentencePieceUnigramTokenizer'>, lengths [5000, 10000]


100%|██████████| 3/3 [00:01<00:00,  2.27ba/s]


elapsed 1.362846999996691
(100, 16, <class 'tokenizers.implementations.sentencepiece_unigram.SentencePieceUnigramTokenizer'>, [10000, 15000])
training tokenizer: number 100, batch 16, token <class 'tokenizers.implementations.sentencepiece_unigram.SentencePieceUnigramTokenizer'>, lengths [10000, 15000]


100%|██████████| 3/3 [00:01<00:00,  2.20ba/s]


elapsed 1.4509859000027063
(100, 16, <class 'tokenizers.implementations.byte_level_bpe.ByteLevelBPETokenizer'>, [0, 5000])
training tokenizer: number 100, batch 16, token <class 'tokenizers.implementations.byte_level_bpe.ByteLevelBPETokenizer'>, lengths [0, 5000]


100%|██████████| 3/3 [00:01<00:00,  2.39ba/s]


elapsed 1.2730617999986862
(100, 16, <class 'tokenizers.implementations.byte_level_bpe.ByteLevelBPETokenizer'>, [5000, 10000])
training tokenizer: number 100, batch 16, token <class 'tokenizers.implementations.byte_level_bpe.ByteLevelBPETokenizer'>, lengths [5000, 10000]


100%|██████████| 3/3 [00:01<00:00,  2.33ba/s]


elapsed 1.3370118000020739
(100, 16, <class 'tokenizers.implementations.byte_level_bpe.ByteLevelBPETokenizer'>, [10000, 15000])
training tokenizer: number 100, batch 16, token <class 'tokenizers.implementations.byte_level_bpe.ByteLevelBPETokenizer'>, lengths [10000, 15000]


100%|██████████| 3/3 [00:01<00:00,  2.39ba/s]


elapsed 1.3413400999997975
(100, 16, <class 'tokenizers.implementations.bert_wordpiece.BertWordPieceTokenizer'>, [0, 5000])
training tokenizer: number 100, batch 16, token <class 'tokenizers.implementations.bert_wordpiece.BertWordPieceTokenizer'>, lengths [0, 5000]


100%|██████████| 3/3 [00:01<00:00,  2.36ba/s]


elapsed 1.3001385000025039
(100, 16, <class 'tokenizers.implementations.bert_wordpiece.BertWordPieceTokenizer'>, [5000, 10000])
training tokenizer: number 100, batch 16, token <class 'tokenizers.implementations.bert_wordpiece.BertWordPieceTokenizer'>, lengths [5000, 10000]


100%|██████████| 3/3 [00:01<00:00,  2.29ba/s]


elapsed 1.3491367000024184
(100, 16, <class 'tokenizers.implementations.bert_wordpiece.BertWordPieceTokenizer'>, [10000, 15000])
training tokenizer: number 100, batch 16, token <class 'tokenizers.implementations.bert_wordpiece.BertWordPieceTokenizer'>, lengths [10000, 15000]


100%|██████████| 3/3 [00:01<00:00,  2.39ba/s]


elapsed 1.3230317999987165
(1000, 2, <class 'tokenizers.implementations.sentencepiece_bpe.SentencePieceBPETokenizer'>, [0, 5000])
training tokenizer: number 1000, batch 2, token <class 'tokenizers.implementations.sentencepiece_bpe.SentencePieceBPETokenizer'>, lengths [0, 5000]


100%|██████████| 3/3 [00:01<00:00,  2.36ba/s]


batch 0 of 1000
elapsed 1.279745500003628
(1000, 2, <class 'tokenizers.implementations.sentencepiece_bpe.SentencePieceBPETokenizer'>, [5000, 10000])
training tokenizer: number 1000, batch 2, token <class 'tokenizers.implementations.sentencepiece_bpe.SentencePieceBPETokenizer'>, lengths [5000, 10000]


100%|██████████| 3/3 [00:01<00:00,  2.31ba/s]


elapsed 1.3476896000065608
(1000, 2, <class 'tokenizers.implementations.sentencepiece_bpe.SentencePieceBPETokenizer'>, [10000, 15000])
training tokenizer: number 1000, batch 2, token <class 'tokenizers.implementations.sentencepiece_bpe.SentencePieceBPETokenizer'>, lengths [10000, 15000]


100%|██████████| 3/3 [00:01<00:00,  2.31ba/s]


batch 0 of 1000
elapsed 1.380329299994628
(1000, 2, <class 'tokenizers.implementations.char_level_bpe.CharBPETokenizer'>, [0, 5000])
training tokenizer: number 1000, batch 2, token <class 'tokenizers.implementations.char_level_bpe.CharBPETokenizer'>, lengths [0, 5000]


100%|██████████| 3/3 [00:01<00:00,  2.35ba/s]


batch 0 of 1000
elapsed 1.3010429999994813
(1000, 2, <class 'tokenizers.implementations.char_level_bpe.CharBPETokenizer'>, [5000, 10000])
training tokenizer: number 1000, batch 2, token <class 'tokenizers.implementations.char_level_bpe.CharBPETokenizer'>, lengths [5000, 10000]


 33%|███▎      | 1/3 [00:00<00:01,  1.82ba/s]