In [23]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Getting the text examples

We fetch some 'nice' chapters from various books in different languages from project Gutenberg. The diversity will demonstrate some of the challenges when tokenizing

In [24]:
text = """
I buy my parents' 10% of U.K. startup for $1.4 billion. Dr. Watson's cat called Mrs. Hersley and it was w.r.o.n.g., more to come ...
""".strip()

# Project Gutenberg, 244: A Study in Scarlet (en), Arthur Conan Doyle
text_en = """
This was a lofty chamber, lined and littered with countless bottles.
Broad, low tables were scattered about, which bristled with retorts,
test-tubes, and little Bunsen lamps, with their blue flickering flames.
There was only one student in the room, who was bending over a distant
table absorbed in his work. At the sound of our steps he glanced round
and sprang to his feet with a cry of pleasure. “I’ve found it! I’ve
found it,” he shouted to my companion, running towards us with a
test-tube in his hand. “I have found a re-agent which is precipitated
by hæmoglobin, and by nothing else.” Had he discovered a gold mine,
greater delight could not have shone upon his features.
""".strip()

# Project Gutenberg, 34811: Buddenbrooks: Verfall einer Familie (de), Thomas Mann
text_de = """
»Ich rechne«, sagte der Konsul trocken. Die Kerze flammte auf, und man
sah, wie er gerade aufgerichtet und mit Augen, so kalt und aufmerksam,
wie sie während des ganzen Nachmittags noch nicht darein geschaut
hatten, fest in die tanzende Flamme blickte. -- »Einerseits: Sie geben
33335 an Gotthold und 15000 an die in Frankfurt, und das macht 48335 in
Summa. Andererseits: Sie geben nur 25000 an die in Frankfurt, und das
bedeutet für die Firma einen Gewinn von 23335. Das ist aber nicht alles.
Gesetzt, Sie leisten an Gotthold eine Entschädigungssumme für den Anteil
am Hause, so ist das Prinzip durchbrochen, so ist er damals =nicht=
endgültig abgefunden worden, so kann er nach Ihrem Tode ein gleich
großes Erbe beanspruchen, wie meine Schwester und ich, und dann handelt
es sich für die Firma um einen Verlust von Hunderttausenden, mit dem sie
nicht rechnen kann, mit dem ich als künftiger alleiniger Inhaber nicht
rechnen kann ... Nein, Papa!« beschloß er mit einer energischen
Handbewegung und richtete sich noch höher auf. »Ich muß Ihnen abraten,
nachzugeben!«
""".strip()

# Project Gutenberg, 13951: Les trois mousquetaires (fr), Alexandre Dumas
text_fr = """
D’Artagnan, tout en marchant et en monologuant, était arrivé à quelques
pas de l’hôtel d’Aiguillon, et devant cet hôtel il avait aperçu Aramis
causant gaiement avec trois gentilshommes des gardes du roi. De son
côté, Aramis aperçut d’Artagnan; mais comme il n’oubliait point que
c’était devant ce jeune homme que M. de Tréville s’était si fort
emporté le matin, et qu’un témoin des reproches que les mousquetaires
avaient reçus ne lui était d’aucune façon agréable, il fit semblant de
ne pas le voir. D’Artagnan, tout entier au contraire à ses plans de
conciliation et de courtoisie, s’approcha des quatre jeunes gens en
leur faisant un grand salut accompagné du plus gracieux sourire. Aramis
inclina légèrement la tête, mais ne sourit point. Tous quatre, au
reste, interrompirent à l’instant même leur conversation.
""".strip()

# Project Gutenberg, 27729: Bajki (pl), Adam Mickiewicz
text_pl = """
Powolny bóg wszechżabstwu na króla użycza
Małego jako Łokiet Kija Kijowicza.
Spadł Kij i pluskiem wszemu obwieścił się błotu.
Struchlały żaby na ten majestat łoskotu.
Milczą, dzień i noc, ledwie śmiejąc dychać,
Nazajutrz jedna drugiej pytają: „Co słychać?
Czy niema co od króla?” Aż śmielsze i starsze
Ruszają przed oblicze stawić się monarsze.
Zrazu zdala, w bojaźni, by się nie narazić;
Potem, przemógłszy te strachy,
Brat za brat z królem biorą się pod pachy
I zaczynają na kark mu włazić.
„Toż to taki ma być król?... Najjaśniejszy Bela,
Nie wiele z niego będziem mieć wesela;
Król, co po karku bezkarnie go gładzim,
Niechaj nam abdykuje zaraz, niedołęga!
Potrzebna nam jest władza, ale władza tęga!”
""".strip()

# Project Gutenberg, 23585: 佛說四十二章經 (zh)
text_zh="""
沙門夜誦迦葉佛遺教經，其聲悲緊，思悔欲退。佛問之曰：汝昔在家，曾為何業？對
曰：愛彈琴。佛言：弦緩如何？對曰：不鳴矣！弦急如何？對曰：聲絕矣！急緩得中
如何？對曰：諸音普矣！佛言：沙門學道亦然，心若調適，道可得矣。於道若暴，暴
即身疲。其身若疲，意即生惱。意若生惱，行即退矣。其行既退，罪必加矣。但清淨
安樂，道不失矣。
""".strip()

texts = {
    'abbreviations': text,
    'english': text_en,
    'german': text_de,
    'french': text_fr,
    'polish': text_pl,
    'chinese': text_zh
}

# Word based tokenization

In [26]:
# !python -m spacy download en_core_web_sm
# !python -m spacy download fr_core_news_sm
# !python -m spacy download de_core_news_sm
# !python -m spacy download pl_core_news_sm
# !python -m spacy download zh_core_web_sm

In [27]:
import re, nltk, jieba, spacy
nltk.download('punkt')
spacy.prefer_gpu()
nlp = {
    'english': spacy.load('en_core_web_sm'),
    'french': spacy.load('fr_core_news_sm'),
    'german': spacy.load('de_core_news_sm'),
    'polish': spacy.load('pl_core_news_sm'),
    'chinese': spacy.load('zh_core_web_sm'),
}

def jieba_word(text: str, language: str = 'english') -> list[str]:
    return [t[0] for t in jieba.tokenize(text) if t[0] != ' ']

def python_word(text: str, language: str = 'english') -> list[str]:
    text = re.sub(r'[^\w\-]+', ' ', text)
    return [token for token in text.split(' ') if token]

def nltk_word(text: str, language: str = 'english') -> list[str]:
    return nltk.word_tokenize(text, language == 'chinese' and 'english' or language)

def spacy_word(text: str, language: str = 'english') -> list[str]:
    global nlp
    return [token.text for token in nlp[language](text)]

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\roger\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [5]:
import ipywidgets as widgets
from IPython.display import display, Markdown, clear_output
from tabulate import tabulate
from itertools import zip_longest
from unidecode import unidecode

def show_tokens(language: str, alpha: bool, single: bool, decode: bool):
    tokens = []
    for tokenizer in tokenizers.values():
        t = tokenizer(texts[language], language == "abbreviations" and "english" or language)
        if alpha:
            t = [token for token in t if re.match(r'^[a-zA-Z][\w\-\.]*$', token)]
        if decode:
            t = [unidecode(token) for token in t]
        if single:
            t = [token for token in t if len(token) > 1]
        tokens.append(t)
    with (out_text := widgets.Output()):
        display(Markdown(texts[language]))
    with (out_tokens := widgets.Output(layout = {'padding': '0px 50px', 'min_width': '60%'})):
        if language == 'chinsese':
            for i in range(len(tokens)):
                nl = '\n  '
                print(f'{list(tokenizers.keys())[i]}:\n  {nl.join(tokens[i][:10])}\n\n')
        else:
            headers = tokenizers.keys()
            rows = [[col[:16] for col in row] for row in zip_longest(*tokens, fillvalue='')]
            display(Markdown(tabulate(rows[:40], headers, tablefmt="pipe")))
    with out_result:
        clear_output()
        display(widgets.HBox([out_text, out_tokens]))

opt_alpha = widgets.Checkbox(description='only words')
opt_single = widgets.Checkbox(description='no single letter words')
opt_decode = widgets.Checkbox(description='unicode decode')
opt_language = widgets.Dropdown(description='language', options=['abbreviations', 'english', 'french', 'german', 'polish', 'chinese'])
opt_method = widgets.Dropdown(description='method', options=[
    ('python (split on whitespace)', 'python-split'),
    ('nltk (word)', 'nltk-word'),
    ('spaCy (word)', 'spacy-word'),
    ('jieba (chinese)', 'jieba-word'),
])
tokenizers = {
    'python-word': python_word,
    'nltk-word': nltk_word,
    'spacy-word': spacy_word,
    'jieba-word': jieba_word,
}

out_result = widgets.Output()
display(widgets.interactive(show_tokens, language=opt_language, alpha=opt_alpha, single=opt_single, decode=opt_decode))
display(out_result)

interactive(children=(Dropdown(description='language', options=('abbreviations', 'english', 'french', 'german'…

Output()

## Sub-word tokenization

In [6]:
def sub_words_tokenize(text: str, k: int, mark: str = '#') -> list[str]:
    text = re.sub(r'\W+', ' ', text)
    tokens = []
    for token in text.split():
        if len(token) <= k:
            tokens.append(token)
            continue
        for i in range(len(token) - k + 1):
            tokens.append(i == 0 and mark + token[i:i + k] or token[i:i + k])
    return tokens

In [8]:
print(sub_words_tokenize("teach multtimedia", 3))

['#tea', 'eac', 'ach', '#mul', 'ult', 'ltt', 'tti', 'tim', 'ime', 'med', 'edi', 'dia']


In [28]:
MAX_K = 11

def show_sub_words(language: str, mark: bool):
    tokens = []
    for k in range(1, MAX_K, 1):
        tokens.append(sub_words_tokenize(texts[language], k, mark and "#" or ""))
    with (out_text := widgets.Output()):
        display(Markdown(texts[language]))
    with (out_tokens := widgets.Output(layout = {'padding': '0px 50px', 'min_width': '60%'})):
        headers = [str(i) for i in range(1, MAX_K, 1)]
        rows = [[f'<{len(t)}>' for t in tokens]] + [row for row in zip_longest(*tokens, fillvalue='')]
        print(tabulate(rows[:30], headers, tablefmt="github"))
    with out_result:
        clear_output()
        display(widgets.HBox([out_text, out_tokens]))

opt_language = widgets.Dropdown(description='language', options=['abbreviations', 'english', 'french', 'german', 'polish', 'chinese'])
opt_mark = widgets.Checkbox(description='mark start-of-word sequences')

out_result = widgets.Output()
display(widgets.interactive(show_sub_words, language=opt_language, mark=opt_mark))
display(out_result)

interactive(children=(Dropdown(description='language', options=('abbreviations', 'english', 'french', 'german'…

Output()

## Bi-gram extraction

In [10]:
from utils.gutenberg import get_book
book = get_book(244)

Naive approach: create bi-grams, count frequencies in text, then pick top-20

In [11]:
from nltk.collocations import (
    BigramCollocationFinder, BigramAssocMeasures,  
    TrigramCollocationFinder, TrigramAssocMeasures,  
    QuadgramCollocationFinder, QuadgramAssocMeasures
)
from nltk.corpus import stopwords
nltk.download('stopwords')

ignored_words = stopwords.words('english')
stopword_filter = lambda w: len(w) < 3 or w.lower() in ignored_words

nResults = 20

def ngram_measures(n: int, metric: str):
    measure = BigramAssocMeasures
    if n == 3: measure = TrigramAssocMeasures
    if n == 4: measure = QuadgramAssocMeasures
    if metric == 'freq': return measure.raw_freq
    if metric == 'pmi': return measure.pmi
    return measure.likelihood_ratio

def ngrams_from_words(n: int, tokens: list[str]):
    if n == 3: return TrigramCollocationFinder.from_words(tokens)
    if n == 4: return QuadgramCollocationFinder.from_words(tokens)
    return BigramCollocationFinder.from_words(tokens)

def ngrams_result(n: int, finder, scores, metric: str):
    if n == 3:
        headers = ['trigram', 'tf(w1)', 'tf(w2)', 'tf(w3)', 'tf(trigram)', f'score ({metric})']
        rows = [[' '.join([t1,t2,t3]), finder.word_fd[t1], finder.word_fd[t2], finder.word_fd[t3], finder.ngram_fd[(t1,t2,t3)], score] for ((t1,t2,t3),score) in scores]
        return (headers, rows)
    if n == 4:
        headers = ['quadgram', 'tf(w1)', 'tf(w2)', 'tf(w3)', 'tf(w4)', 'tf(quadgram)', f'score ({metric})']
        rows = [[' '.join([t1,t2,t3,t4]), finder.word_fd[t1], finder.word_fd[t2], finder.word_fd[t3], finder.word_fd[t4], finder.ngram_fd[(t1,t2,t3,t4)], score] for ((t1,t2,t3,t4),score) in scores]
        return (headers, rows)
    headers = ['bigram', 'tf(w1)', 'tf(w2)', 'tf(bigram)', f'score ({metric})']
    rows = [[' '.join([t1,t2]), finder.word_fd[t1], finder.word_fd[t2], finder.ngram_fd[(t1,t2)], score] for ((t1,t2),score) in scores]
    return (headers, rows)

def ngrams(tokens: list[str], n: int, metric: str, filters: list[str]) -> tuple[list[str],list[list[str]]]:
    measure = ngram_measures(n, metric)
    finder = ngrams_from_words(n, tokens)
    for f in filters:
        if f == 'freq3': finder.apply_freq_filter(3)
        if f == 'freq5': finder.apply_freq_filter(5)
        if f == 'stopwords': finder.apply_word_filter(stopword_filter)
    scores = finder.score_ngrams(measure)[:nResults]
    return ngrams_result(n, finder, scores, metric)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\roger\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [29]:
import ipywidgets as widgets
from IPython.display import display, Markdown, clear_output
from tabulate import tabulate

last_book = 0
tokens = []

def show_ngrams(book, n, metric, stopwords, freq3, freq5):
    global last_book, tokens
    if last_book != book:
        tokens = [token for token in nltk.word_tokenize(get_book(book).page_content) if token.isalpha()]
        last_book = book
    headers, rows = ngrams(tokens, n, metric, [stopwords and 'stopwords', freq3 and 'freq3', freq5 and 'freq5'])
    display(Markdown(tabulate(rows, headers, "pipe")))

opt_book = widgets.Dropdown(description='book', options=[
    ('A Study in Scarlet (en)', 244),
    ('Buddenbrooks: Verfall einer Familie (de)', 34811),
    ('Les trois mousquetaires (fr)', 13951),
    ('Bajki (pl)', 27729),
])
opt_metric = widgets.Dropdown(description='metric', options=['freq', 'pmi', 'lhr'])
opt_n = widgets.BoundedIntText(description='n', value=2, min=2, max=4)
opt_stopword = widgets.Checkbox(description='no stopwords')
opt_freq3 = widgets.Checkbox(description='freq > 3')
opt_freq5 = widgets.Checkbox(description='freq > 5')

display(widgets.interactive(show_ngrams, book=opt_book, n=opt_n ,metric=opt_metric, stopwords=opt_stopword, freq3=opt_freq3, freq5=opt_freq5))


interactive(children=(Dropdown(description='book', options=(('A Study in Scarlet (en)', 244), ('Buddenbrooks: …

### Manual calclualtions of PMI

In [None]:
import math
N = len(tokens)
tf1, tf2, tf12 = 5, 5, 5
p1, p2, p12 = tf1/N, tf2/N, tf12/N
math.log(p12/p1/p2, 2)

### Manual calulcation of LHR

In [None]:
c1=48
c2=94
c12=48
p=c2/N
p1=c12/c1
p2=(c2-c12)/(N-c1)
def L(k,n,p):
    return (p**k)*(1-p)**(n-k)
-2*(math.log(L(c12,c1,p))+math.log(L(c2-c12,N-c1,p))-math.log(L(c12,c1,p1))-math.log(L(c2-c12,N-c1,p2)))

### Simple summary of bi-gram calculation

In [13]:
from nltk.collocations import (
    BigramCollocationFinder, BigramAssocMeasures,  
    TrigramCollocationFinder, TrigramAssocMeasures,  
    QuadgramCollocationFinder, QuadgramAssocMeasures
)
from nltk.corpus import stopwords
tokens = [token for token in nltk.word_tokenize(get_book(244).page_content) if token.isalpha()]

# choose bi-grams, tri-grams, quad-grams
finder = QuadgramCollocationFinder.from_words(tokens)
finder = TrigramCollocationFinder.from_words(tokens)
finder = BigramCollocationFinder.from_words(tokens)

# choose a measure (must match with the finder, here for bi-grams)
measure = BigramAssocMeasures.raw_freq
measure = BigramAssocMeasures.pmi
measure = BigramAssocMeasures.likelihood_ratio


# apply frequency filter
finder.apply_freq_filter(3)

#apply stop word filter
ignored_words = stopwords.words('english')
stopword_filter = lambda w: len(w) < 3 or w.lower() in ignored_words
finder.apply_word_filter(stopword_filter)

# obtain results (top-k)
k = 20
scores = finder.score_ngrams(measure)[:k]

# output term 1, term 2, freq of term 1, freq of term 2, freq of bigram, score
for ((t1,t2),score) in scores:
    print(f'{t1} {t2} {finder.word_fd[t1]} {finder.word_fd[t2]} {finder.ngram_fd[(t1,t2)]} {score}')


Sherlock Holmes 48 94 48 618.2639397345179
Jefferson Hope 37 42 34 491.94785591179664
John Ferrier 31 58 26 330.1825164533987
Brixton Road 15 13 13 224.9205970576053
Salt Lake 9 9 9 170.48968950333003
Lake City 9 13 8 129.82912069112837
Enoch Drebber 9 62 9 119.12601098633465
Scotland Yard 8 6 6 109.52843062923058
Baker Street 6 11 6 103.36758969662593
Private Hotel 5 5 5 100.59482597348887
Lucy Ferrier 29 58 10 96.68084062615571
Lauriston Gardens 4 4 4 82.26110221688567
Joseph Stangerson 13 43 7 79.9799404000117
Never mind 5 37 5 71.28837734133977
little girl 80 27 8 68.66609037830294
young hunter 40 14 6 65.60029780468453
Audley Court 3 3 3 63.42198886696671
could see 96 56 9 61.56793780013763
young man 40 154 9 59.46754617578934
CHAPTER III 28 4 4 59.29458839273476


# Tokenization for machine learning / language models

## Byte Pair Encoding

### Step-by-step BPE Implementation (not efficient)

In [2]:
from collections import defaultdict
import nltk

text = "This course is about this topic"
word_list = [token.lower() for token in nltk.word_tokenize(text) if token.isalpha()]
word_list[:10]

['this', 'course', 'is', 'about', 'this', 'topic']

In [3]:
from collections import Counter

words = { word: (freq, [c for c in word]) for word, freq in Counter(word_list).items()}
list(words.items())[0:10]

[('this', (2, ['t', 'h', 'i', 's'])),
 ('course', (1, ['c', 'o', 'u', 'r', 's', 'e'])),
 ('is', (1, ['i', 's'])),
 ('about', (1, ['a', 'b', 'o', 'u', 't'])),
 ('topic', (1, ['t', 'o', 'p', 'i', 'c']))]

In [4]:
vocab = set()
for (freq, parts) in words.values():
    vocab = vocab | set(parts)
vocabulary = sorted(list(vocab))
print(vocabulary)

['a', 'b', 'c', 'e', 'h', 'i', 'o', 'p', 'r', 's', 't', 'u']


In [5]:
def new_pair_freqs(words):
    pair_freqs = Counter()
    for token, (freq, parts) in words.items():
        for pair in zip(parts[:-1], parts[1:]):
            pair_freqs[pair] += freq
    return pair_freqs

pair_freqs = new_pair_freqs(words)
print(pair_freqs.most_common(10))
best_pair = pair_freqs.most_common()[0][0]
best_pair

[(('i', 's'), 3), (('t', 'h'), 2), (('h', 'i'), 2), (('o', 'u'), 2), (('c', 'o'), 1), (('u', 'r'), 1), (('r', 's'), 1), (('s', 'e'), 1), (('a', 'b'), 1), (('b', 'o'), 1)]


('i', 's')

In [6]:
def merge_pair(pair, words):
    merged = [pair[0] + pair[1]]
    for token, (freq, parts) in words.items():
        i = 0
        while i < len(parts) - 1:
            if parts[i] == pair[0] and parts[i+1] == pair[1]:
                parts[i:i+2] = merged
            else:
                i += 1

vocabulary.append(''.join(best_pair))
merge_pair(best_pair, words)
words['this']

(2, ['t', 'h', 'is'])

In [7]:
vocabulary_size = 20

while len(vocabulary) < vocabulary_size:
    pair_freqs = new_pair_freqs(words)
    best_pair = pair_freqs.most_common()[0][0]
    vocabulary.append(''.join(best_pair))
    merge_pair(best_pair, words)

vocabulary.sort()
for i in range(0, len(vocabulary), 30):
    print(' '.join(vocabulary[i:i+30]))

a b c cou cour cours course e h i is o ou p r s t th this u


In [8]:
words

{'this': (2, ['this']),
 'course': (1, ['course']),
 'is': (1, ['is']),
 'about': (1, ['a', 'b', 'ou', 't']),
 'topic': (1, ['t', 'o', 'p', 'i', 'c'])}

### BPE trainer from transformers library

In [15]:
from IPython.display import display, Markdown, clear_output
from tabulate import tabulate

def print_vocabulary(tokenizer):
    headers = []
    columns = []
    for start in range(0,400,40):
        headers.append(f'{start} -> {start+40}')
        columns.append([f'{i} -> {tokenizer.decode([i])}' for i in range(start,start+40)])
    display(Markdown(tabulate(zip(*columns), headers, tablefmt="github")))

In [17]:
from tokenizers import decoders, models, normalizers, pre_tokenizers, processors, trainers, Tokenizer
from utils.gutenberg import get_book

def batch_loader():
    yield get_book(244).page_content

# 1) define special tokens and create tokenizer object
unknown_token = "[UNK]"
special_tokens = [unknown_token, "[SEP]", "[MASK]", "[CLS]"]
tokenizer = Tokenizer(models.BPE(unk_token = unknown_token))

# 2) setup the trainer for BPE tokenization
trainer = trainers.BpeTrainer(
    vocab_size=5000,  
    min_frequency=3, 
    special_tokens = special_tokens, 
    continuing_subword_prefix='#', 
    end_of_word_suffix='>'
)

# 3) define how to split the text and normalize words
tokenizer.pre_tokenizer = pre_tokenizers.WhitespaceSplit()
tokenizer.normalizer = normalizers.Sequence(
    [normalizers.NFD(), normalizers.Lowercase(), normalizers.StripAccents()]
)

# 4) train the tokens from an iterator (we just load one book)
tokenizer.train_from_iterator(batch_loader(), trainer=trainer)

# 5) Build a post-processor templates for classification (example)
tokenizer.post_processor = processors.TemplateProcessing(
    single=f"[CLS]:0 $A:0 [SEP]:0",
    pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", tokenizer.token_to_id("[CLS]")),
        ("[SEP]", tokenizer.token_to_id("[SEP]")),
    ])   
tokenizer.decoder = decoders.ByteLevel()

# 6) encode a test text
display(Markdown(" ".join(tokenizer.encode(text_en).tokens)))
print(tokenizer.encode(text_en).tokens)

# 7) print vocabulary
print_vocabulary(tokenizer)


[CLS] this> was> a> lo #f #ty> ch #am #b #er,> l #ined> and> lit #tered> with> coun #tless> bo #t #tl #es.> bro #ad,> low> t #abl #es> were> scattered> about,> which> br #ist #led> with> ret #ort #s,> test #-t #ub #es,> and> little> bun #s #en> l #amp #s,> with> their> blue> fl #ick #ering> fl #am #es.> there> was> only> one> stud #ent> in> the> room,> who> was> b #ending> over> a> distant> table> absor #bed> in> his> work.> at> the> sound> of> our> steps> he> glanced> round> and> sprang> to> his> feet> with> a> cry> of> pleas #ure.> “i’ve> found> it #!> i’ve> found> it,”> he> shouted> to> my> companion,> running> towards> us> with> a> test #-t #u #be> in> his> hand.> “i> have> found> a> re- #ag #ent> which> is> precip #itated> by> h #æ #m #og #l #ob #in,> and> by> nothing> els #e.”> had> he> discovered> a> gold> min #e,> grea #ter> del #ight> could> not> have> sh #one> upon> his> features.> [SEP]

['[CLS]', 'this>', 'was>', 'a>', 'lo', '#f', '#ty>', 'ch', '#am', '#b', '#er,>', 'l', '#ined>', 'and>', 'lit', '#tered>', 'with>', 'coun', '#tless>', 'bo', '#t', '#tl', '#es.>', 'bro', '#ad,>', 'low>', 't', '#abl', '#es>', 'were>', 'scattered>', 'about,>', 'which>', 'br', '#ist', '#led>', 'with>', 'ret', '#ort', '#s,>', 'test', '#-t', '#ub', '#es,>', 'and>', 'little>', 'bun', '#s', '#en>', 'l', '#amp', '#s,>', 'with>', 'their>', 'blue>', 'fl', '#ick', '#ering>', 'fl', '#am', '#es.>', 'there>', 'was>', 'only>', 'one>', 'stud', '#ent>', 'in>', 'the>', 'room,>', 'who>', 'was>', 'b', '#ending>', 'over>', 'a>', 'distant>', 'table>', 'absor', '#bed>', 'in>', 'his>', 'work.>', 'at>', 'the>', 'sound>', 'of>', 'our>', 'steps>', 'he>', 'glanced>', 'round>', 'and>', 'sprang>', 'to>', 'his>', 'feet>', 'with>', 'a>', 'cry>', 'of>', 'pleas', '#ure.>', '“i’ve>', 'found>', 'it', '#!>', 'i’ve>', 'found>', 'it,”>', 'he>', 'shouted>', 'to>', 'my>', 'companion,>', 'running>', 'towards>', 'us>', 'with>', '

| 0 -> 40   | 40 -> 80   | 80 -> 120   | 120 -> 160   | 160 -> 200   | 200 -> 240    | 240 -> 280    | 280 -> 320   | 320 -> 360    | 360 -> 400    |
|-----------|------------|-------------|--------------|--------------|---------------|---------------|--------------|---------------|---------------|
| 0 ->      | 40 -> o    | 80 -> #d    | 120 -> #2    | 160 -> and>  | 200 -> #ri    | 240 -> #ing   | 280 -> #ter> | 320 -> man    | 360 -> up>    |
| 1 ->      | 41 -> p    | 81 -> #m    | 121 -> #1    | 161 -> of>   | 201 -> the    | 241 -> have>  | 281 -> fro   | 321 -> #un    | 361 -> #es,>  |
| 2 ->      | 42 -> q    | 82 -> #g>   | 122 -> a>    | 162 -> ha    | 202 -> #ai    | 242 -> is>    | 282 -> #ce>  | 322 -> been>  | 362 -> do     |
| 3 ->      | 43 -> r    | 83 -> #v    | 123 -> #_>   | 163 -> #on   | 203 -> #e,>   | 243 -> #us    | 283 -> #t,>  | 323 -> #ab    | 363 -> #sel   |
| 4 -> !    | 44 -> s    | 84 -> #—    | 124 -> #c>   | 164 -> #er>  | 204 -> #ol    | 244 -> #os    | 284 -> se    | 324 -> #ill>  | 364 -> #led>  |
| 5 -> (    | 45 -> t    | 85 -> #e>   | 125 -> #x>   | 165 -> to>   | 205 -> #gh    | 245 -> #et    | 285 -> #ion  | 325 -> kn     | 365 -> #own>  |
| 6 -> )    | 46 -> u    | 86 -> #l>   | 126 -> #6    | 166 -> #ea   | 206 -> sh     | 246 -> #ould> | 286 -> from> | 326 -> #ch    | 366 -> #itt   |
| 7 -> ,    | 47 -> v    | 87 -> #-    | 127 -> #4    | 167 -> #es   | 207 -> it>    | 247 -> an     | 287 -> #ain  | 327 -> un     | 367 -> who>   |
| 8 -> -    | 48 -> w    | 88 -> #,    | 128 -> #u>   | 168 -> hi    | 208 -> #ou>   | 248 -> #ear   | 288 -> #ul   | 328 -> #ag    | 368 -> or>    |
| 9 -> .    | 49 -> x    | 89 -> #”>   | 129 -> #:>   | 169 -> #ing> | 209 -> be     | 249 -> on     | 289 -> ex    | 329 -> #im    | 369 -> #ist   |
| 10 -> 0   | 50 -> y    | 90 -> #n>   | 130 -> #8    | 170 -> wa    | 210 -> #re    | 250 -> #ent   | 290 -> #t.>  | 330 -> ab     | 370 -> di     |
| 11 -> 1   | 51 -> z    | 91 -> #d>   | 131 -> #0    | 171 -> #en   | 211 -> wit    | 251 -> #ra    | 291 -> but>  | 331 -> #ked>  | 371 -> said>  |
| 12 -> 2   | 52 -> �    | 92 -> #k    | 132 -> #_    | 172 -> #it   | 212 -> #or>   | 252 -> be>    | 292 -> they> | 332 -> ne     | 372 -> app    |
| 13 -> 3   | 53 -> œ    | 93 -> #x    | 133 -> #—>   | 173 -> wh    | 213 -> #ac    | 253 -> up     | 293 -> #ay>  | 333 -> #ers   | 373 -> #ay    |
| 14 -> 4   | 54 -> —    | 94 -> #!>   | 134 -> #)>   | 174 -> #or   | 214 -> #es>   | 254 -> #is>   | 294 -> not>  | 334 -> en     | 374 -> #st    |
| 15 -> 5   | 55 -> ‘    | 95 -> #t>   | 135 -> #b>   | 175 -> #an   | 215 -> in     | 255 -> #ver   | 295 -> were> | 335 -> so>    | 375 -> #ir>   |
| 16 -> 6   | 56 -> ’    | 96 -> #s>   | 136 -> #q>   | 176 -> #at>  | 216 -> #al    | 256 -> com    | 296 -> #ore> | 336 -> #ame>  | 376 -> pro    |
| 17 -> 7   | 57 -> “    | 97 -> #f    | 137 -> #[    | 177 -> he>   | 217 -> #ti    | 257 -> #s,>   | 297 -> him>  | 337 -> #ak    | 377 -> #es.>  |
| 18 -> 8   | 58 -> ”    | 98 -> #;>   | 138 -> #]>   | 178 -> #ar   | 218 -> you>   | 258 -> you    | 298 -> li    | 338 -> when>  | 378 -> some>  |
| 19 -> 9   | 59 -> #a   | 99 -> #r>   | 139 -> #7    | 179 -> in>   | 219 -> #oo    | 259 -> #am    | 299 -> we>   | 339 -> one>   | 379 -> pl     |
| 20 -> :   | 60 -> #w   | 100 -> #’   | 140 -> #8>   | 180 -> #ic   | 220 -> #ly>   | 260 -> #id    | 300 -> #ff   | 340 -> #op    | 380 -> rea    |
| 21 -> ;   | 61 -> #.>  | 101 -> #.   | 141 -> #�    | 181 -> #at   | 221 -> with>  | 261 -> #ter   | 301 -> #oc   | 341 -> #res   | 381 -> tw     |
| 22 -> ?   | 62 -> #o   | 102 -> #’>  | 142 -> #;    | 182 -> his>  | 222 -> as>    | 262 -> #nd    | 302 -> him   | 342 -> #gh>   | 382 -> “i>    |
| 23 -> [   | 63 -> #b   | 103 -> #3   | 143 -> #“    | 183 -> #el   | 223 -> #e.>   | 263 -> ar     | 303 -> all>  | 343 -> are>   | 383 -> #our   |
| 24 -> ]   | 64 -> #e   | 104 -> #f>  | 144 -> #3>   | 184 -> #on>  | 224 -> #as    | 264 -> #pp    | 304 -> #ow>  | 344 -> #ight> | 384 -> #ess>  |
| 25 -> _   | 65 -> #r   | 105 -> #z   | 145 -> #9    | 185 -> was>  | 225 -> #ut>   | 265 -> #igh   | 305 -> #ion> | 345 -> an>    | 385 -> “w     |
| 26 -> a   | 66 -> #y>  | 106 -> #k>  | 146 -> #5>   | 186 -> #en>  | 226 -> whic   | 266 -> on>    | 306 -> #ent> | 346 -> int    | 386 -> #ts>   |
| 27 -> b   | 67 -> #u   | 107 -> #?   | 147 -> #:    | 187 -> that> | 227 -> al     | 267 -> sai    | 307 -> som   | 347 -> #ther> | 387 -> #er,>  |
| 28 -> c   | 68 -> #p   | 108 -> #p>  | 148 -> i>    | 188 -> #om   | 228 -> at>    | 268 -> ch     | 308 -> #et>  | 348 -> cl     | 388 -> #ong>  |
| 29 -> d   | 69 -> #s   | 109 -> #q   | 149 -> #0>   | 189 -> #ro   | 229 -> #ld>   | 269 -> #ir    | 309 -> by>   | 349 -> “th    | 389 -> #ap    |
| 30 -> e   | 70 -> #i   | 110 -> #j   | 150 -> #7>   | 190 -> #is   | 230 -> which> | 270 -> #uc    | 310 -> her>  | 350 -> hol    | 390 -> #out>  |
| 31 -> f   | 71 -> #t   | 111 -> #i>  | 151 -> 2>    | 191 -> st    | 231 -> #ec    | 271 -> there> | 311 -> no>   | 351 -> #ang   | 391 -> into>  |
| 32 -> g   | 72 -> #n   | 112 -> #?>  | 152 -> #)    | 192 -> #ed   | 232 -> #an>   | 272 -> upon>  | 312 -> #ed.> | 352 -> #ad    | 392 -> ac     |
| 33 -> h   | 73 -> #,>  | 113 -> #w>  | 153 -> th    | 193 -> #th   | 233 -> no     | 273 -> #s.>   | 313 -> #rea  | 353 -> #ep    | 393 -> could> |
| 34 -> i   | 74 -> #h   | 114 -> #m>  | 154 -> the>  | 194 -> #ve>  | 234 -> #,”>   | 274 -> #y,>   | 314 -> #?”>  | 354 -> #rou   | 394 -> #ef    |
| 35 -> j   | 75 -> #c   | 115 -> #!   | 155 -> #er   | 195 -> #il   | 235 -> for>   | 275 -> con    | 315 -> #y.>  | 355 -> sp     | 395 -> holm   |
| 36 -> k   | 76 -> #l   | 116 -> #‘   | 156 -> #in   | 196 -> #ur   | 236 -> #le>   | 276 -> #ver>  | 316 -> me>   | 356 -> #al>   | 396 -> fac    |
| 37 -> l   | 77 -> #y   | 117 -> #o>  | 157 -> #ed>  | 197 -> had>  | 237 -> my>    | 277 -> #’s>   | 317 -> #d,>  | 357 -> ou     | 397 -> wor    |
| 38 -> m   | 78 -> #g   | 118 -> #a>  | 158 -> #nd>  | 198 -> #ow   | 238 -> #.”>   | 278 -> this>  | 318 -> as    | 358 -> #um    | 398 -> what>  |
| 39 -> n   | 79 -> #h>  | 119 -> #œ   | 159 -> #ou   | 199 -> #ere> | 239 -> #em    | 279 -> #ed,>  | 319 -> #ted> | 359 -> would> | 399 -> #se>   |

### Using a pre-trained BPE tokenizer (GPT-2)

In [17]:
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

text_example="A mouse called Petar sits on the legendary throne in the ivory tower."
print(tokenizer(text_example)["input_ids"])
print([tokenizer.decode(token) for token in tokenizer(text_example)["input_ids"]])
print(tokenizer.tokenize(text_example))
print(tokenizer.decode(tokenizer(text_example)["input_ids"]))
print()

text_example="Auf dem legendären Thron im Elfenbeinturm sitzt eine Maus namens Petar."
print(tokenizer(text_example)["input_ids"])
print([tokenizer.decode(token) for token in tokenizer(text_example)["input_ids"]])
print(tokenizer.tokenize(text_example))
print(tokenizer.decode(tokenizer(text_example)["input_ids"]))

[32, 10211, 1444, 4767, 283, 10718, 319, 262, 13273, 19262, 287, 262, 32630, 10580, 13]
['A', ' mouse', ' called', ' Pet', 'ar', ' sits', ' on', ' the', ' legendary', ' throne', ' in', ' the', ' ivory', ' tower', '.']
['A', 'Ġmouse', 'Ġcalled', 'ĠPet', 'ar', 'Ġsits', 'Ġon', 'Ġthe', 'Ġlegendary', 'Ġthrone', 'Ġin', 'Ġthe', 'Ġivory', 'Ġtower', '.']
A mouse called Petar sits on the legendary throne in the ivory tower.

[32, 3046, 1357, 8177, 11033, 918, 536, 1313, 545, 19067, 268, 1350, 600, 333, 76, 1650, 89, 83, 304, 500, 6669, 385, 299, 321, 641, 4767, 283, 13]
['A', 'uf', ' dem', ' legend', 'ä', 'ren', ' Th', 'ron', ' im', ' Elf', 'en', 'be', 'int', 'ur', 'm', ' sit', 'z', 't', ' e', 'ine', ' Ma', 'us', ' n', 'am', 'ens', ' Pet', 'ar', '.']
['A', 'uf', 'Ġdem', 'Ġlegend', 'Ã¤', 'ren', 'ĠTh', 'ron', 'Ġim', 'ĠElf', 'en', 'be', 'int', 'ur', 'm', 'Ġsit', 'z', 't', 'Ġe', 'ine', 'ĠMa', 'us', 'Ġn', 'am', 'ens', 'ĠPet', 'ar', '.']
Auf dem legendären Thron im Elfenbeinturm sitzt eine Maus namens

## WordPiece Encoding

### Step-by-step WordPiece Implementation (not efficient)

In [73]:
from collections import defaultdict
import nltk

text = "This course is about this topic"
word_list = [token.lower() for token in nltk.word_tokenize(text) if token.isalpha()]
word_list[:10]

['this', 'course', 'is', 'about', 'this', 'topic']

In [74]:
from collections import Counter

words = { word: (freq, [word[0]] + ['##'+c for c in word[1:]]) for word, freq in Counter(word_list).items()}
list(words.items())[0:10]

[('this', (2, ['t', '##h', '##i', '##s'])),
 ('course', (1, ['c', '##o', '##u', '##r', '##s', '##e'])),
 ('is', (1, ['i', '##s'])),
 ('about', (1, ['a', '##b', '##o', '##u', '##t'])),
 ('topic', (1, ['t', '##o', '##p', '##i', '##c']))]

In [75]:
vocab = set()
for (freq, parts) in words.values():
    vocab = vocab | set(parts)
vocabulary = sorted(list(vocab))
print(vocabulary)

['##b', '##c', '##e', '##h', '##i', '##o', '##p', '##r', '##s', '##t', '##u', 'a', 'c', 'i', 't']


In [77]:
def new_pair_freqs(words):
    part_freqs = defaultdict(int)
    pair_freqs = defaultdict(int)
    for token, (freq, parts) in words.items():
        for p in parts:
            part_freqs[p] += freq
        for pair in zip(parts[:-1], parts[1:]):
            pair_freqs[pair] += freq
    return part_freqs, pair_freqs

def best_pair_wordpiece(part_freqs, pair_freqs):
    for pair, freq in pair_freqs.items():
        pair_freqs[pair] = freq / part_freqs[pair[0]] / part_freqs[pair[1]]
    return max(pair_freqs, key=pair_freqs.get)

part_freqs, pair_freqs = new_pair_freqs(words)
print(sorted(pair_freqs.items(), key=lambda x: x[1], reverse=True))
best_pair = best_pair_wordpiece(part_freqs, pair_freqs)
print(sorted(pair_freqs.items(), key=lambda x: x[1], reverse=True))
best_pair

[(('t', '##h'), 2), (('##h', '##i'), 2), (('##i', '##s'), 2), (('##o', '##u'), 2), (('c', '##o'), 1), (('##u', '##r'), 1), (('##r', '##s'), 1), (('##s', '##e'), 1), (('i', '##s'), 1), (('a', '##b'), 1), (('##b', '##o'), 1), (('##u', '##t'), 1), (('t', '##o'), 1), (('##o', '##p'), 1), (('##p', '##i'), 1), (('##i', '##c'), 1)]
[(('a', '##b'), 1.0), (('##u', '##r'), 0.5), (('##u', '##t'), 0.5), (('t', '##h'), 0.3333333333333333), (('##h', '##i'), 0.3333333333333333), (('c', '##o'), 0.3333333333333333), (('##o', '##u'), 0.3333333333333333), (('##b', '##o'), 0.3333333333333333), (('##o', '##p'), 0.3333333333333333), (('##p', '##i'), 0.3333333333333333), (('##i', '##c'), 0.3333333333333333), (('##r', '##s'), 0.25), (('##s', '##e'), 0.25), (('i', '##s'), 0.25), (('##i', '##s'), 0.16666666666666666), (('t', '##o'), 0.1111111111111111)]


('a', '##b')

In [67]:
def merge_pair(pair, words):
    merged = [pair[0] + pair[1][2:]]
    for token, (freq, parts) in words.items():
        i = 0
        while i < len(parts) - 1:
            if parts[i] == pair[0] and parts[i+1] == pair[1]:
                parts[i:i+2] = merged
            else:
                i += 1

vocabulary.append(best_pair[0] + best_pair[1][2:])
merge_pair(best_pair, words)
words['about']

(1, ['ab', '##o', '##u', '##t'])

In [70]:
vocabulary_size = 25

while len(vocabulary) < vocabulary_size:
    part_freqs, pair_freqs = new_pair_freqs(words)
    best_pair = best_pair_wordpiece(part_freqs, pair_freqs)
    vocabulary.append(best_pair[0] + best_pair[1][2:])
    merge_pair(best_pair, words)

vocabulary.sort()
for i in range(0, len(vocabulary), 30):
    print(' '.join(vocabulary[i:i+30]))

##b ##c ##e ##h ##i ##o ##p ##pi ##pic ##r ##s ##t ##u ##ur ##ut a ab abo c co cour i t th thi


In [71]:
words

{'this': (2, ['thi', '##s']),
 'course': (1, ['cour', '##s', '##e']),
 'is': (1, ['i', '##s']),
 'about': (1, ['abo', '##ut']),
 'topic': (1, ['t', '##o', '##pic'])}

### WordPiece Trainer from transformers library

In [79]:
from tokenizers import decoders, models, normalizers, pre_tokenizers, processors, trainers, Tokenizer
from utils.gutenberg import get_book

def batch_loader():
    yield get_book(244).page_content

# 1) define special tokens and create tokenizer object
unknown_token = "[UNK]"
special_tokens = [unknown_token, "[SEP]", "[MASK]", "[CLS]"]
tokenizer = Tokenizer(models.WordPiece(unk_token = unknown_token))

# 2) setup the trainer for WordPiece tokenization
trainer = trainers.WordPieceTrainer(
    vocab_size=5000,  
    min_frequency=3, 
    special_tokens = special_tokens, 
    continuing_subword_prefix='#', 
    end_of_word_suffix='>'
)

# 3) define how to split the text and normalize words
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()
tokenizer.normalizer = normalizers.Sequence(
    [normalizers.NFD(), normalizers.Lowercase(), normalizers.StripAccents()]
)

# 4) train the tokens from an iterator (we just load one book)
tokenizer.train_from_iterator(batch_loader(), trainer=trainer)

# 5) Build a post-processor templates for classification (example)
tokenizer.post_processor = processors.TemplateProcessing(
    single=f"[CLS]:0 $A:0 [SEP]:0",
    pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", tokenizer.token_to_id("[CLS]")),
        ("[SEP]", tokenizer.token_to_id("[SEP]")),
    ])   
tokenizer.decoder = decoders.ByteLevel()

# 6) encode a test text
display(Markdown(" ".join(tokenizer.encode(text_en).tokens)))
print(tokenizer.encode(text_en).tokens)

# 7) print vocabulary
print_vocabulary(tokenizer)


[CLS] th #is wa #s a lo #f #t #y cha #mb #er , lin #ed an #d litt #er #ed with count #l #ess bott #l #es . broad , low ta #bl #es we #re sc #att #er #ed ab #ou #t , whi #ch bri #st #l #ed with ret #ort #s , te #st - tu #b #es , an #d litt #l #e bun #s #en la #mp #s , with the #ir bl #u #e fl #ick #er #ing fla #m #es . ther #e wa #s on #l #y on #e stud #ent in the roo #m , who wa #s ben #din #g over a dist #ant ta #bl #e absor #b #ed in hi #s wor #k . at the sou #n #d o #f our ste #p #s he gla #n #ce #d rou #n #d an #d spr #ang to hi #s fee #t with a cr #y o #f pleas #ur #e . “ i ’ v #e fo #un #d it ! i ’ v #e fo #un #d it , ” he sho #ut #ed to my compan #ion , ru #n #ni #n #g tow #ar #ds us with a te #st - tu #b #e in hi #s hand . “ i ha #v #e fo #un #d a re - ag #ent whi #ch is precipit #at #ed by h #æ #mo #g #lo #b #in , an #d by not #h #ing el #s #e . ” ha #d he discov #er #ed a gol #d min #e , grea #ter del #ight cou #l #d not ha #v #e sho #n #e up #on hi #s fea #t #ur #es . [SEP]

['[CLS]', 'th', '#is', 'wa', '#s', 'a', 'lo', '#f', '#t', '#y', 'cha', '#mb', '#er', ',', 'lin', '#ed', 'an', '#d', 'litt', '#er', '#ed', 'with', 'count', '#l', '#ess', 'bott', '#l', '#es', '.', 'broad', ',', 'low', 'ta', '#bl', '#es', 'we', '#re', 'sc', '#att', '#er', '#ed', 'ab', '#ou', '#t', ',', 'whi', '#ch', 'bri', '#st', '#l', '#ed', 'with', 'ret', '#ort', '#s', ',', 'te', '#st', '-', 'tu', '#b', '#es', ',', 'an', '#d', 'litt', '#l', '#e', 'bun', '#s', '#en', 'la', '#mp', '#s', ',', 'with', 'the', '#ir', 'bl', '#u', '#e', 'fl', '#ick', '#er', '#ing', 'fla', '#m', '#es', '.', 'ther', '#e', 'wa', '#s', 'on', '#l', '#y', 'on', '#e', 'stud', '#ent', 'in', 'the', 'roo', '#m', ',', 'who', 'wa', '#s', 'ben', '#din', '#g', 'over', 'a', 'dist', '#ant', 'ta', '#bl', '#e', 'absor', '#b', '#ed', 'in', 'hi', '#s', 'wor', '#k', '.', 'at', 'the', 'sou', '#n', '#d', 'o', '#f', 'our', 'ste', '#p', '#s', 'he', 'gla', '#n', '#ce', '#d', 'rou', '#n', '#d', 'an', '#d', 'spr', '#ang', 'to', 'hi', '#s'

| 0 -> 40   | 40 -> 80   | 80 -> 120   | 120 -> 160   | 160 -> 200   | 200 -> 240   | 240 -> 280    | 280 -> 320    | 320 -> 360    | 360 -> 400     |
|-----------|------------|-------------|--------------|--------------|--------------|---------------|---------------|---------------|----------------|
| 0 ->      | 40 -> o    | 80 -> #w    | 120 -> #2>   | 160 -> 3>    | 200 -> #es>  | 240 -> #ich>  | 280 -> up     | 320 -> #ery>  | 360 -> ev      |
| 1 ->      | 41 -> p    | 81 -> #t>   | 121 -> )>    | 161 -> :>    | 201 -> #el   | 241 -> which> | 281 -> #ow>   | 321 -> int    | 361 -> would>  |
| 2 ->      | 42 -> q    | 82 -> #d    | 122 -> #5>   | 162 -> ->    | 202 -> #at   | 242 -> for>   | 282 -> #id    | 322 -> som    | 362 -> #as     |
| 3 ->      | 43 -> r    | 83 -> #r>   | 123 -> #6    | 163 -> #7>   | 203 -> it>   | 243 -> #ol    | 283 -> li     | 323 -> #ff    | 363 -> #mes>   |
| 4 -> !    | 44 -> s    | 84 -> #u    | 124 -> #4    | 164 -> _>    | 204 -> #om   | 244 -> my>    | 284 -> #al    | 324 -> #ble>  | 364 -> cl      |
| 5 -> (    | 45 -> t    | 85 -> #n>   | 125 -> 2>    | 165 -> j>    | 205 -> #ve>  | 245 -> #ra    | 285 -> upon>  | 325 -> #rea   | 365 -> #ong>   |
| 6 -> )    | 46 -> u    | 86 -> 9>    | 126 -> #3>   | 166 -> #4>   | 206 -> #ere> | 246 -> #st>   | 286 -> con    | 326 -> #ds>   | 366 -> ab      |
| 7 -> ,    | 47 -> v    | 87 -> #m>   | 127 -> ]>    | 167 -> th    | 207 -> #ro   | 247 -> #ec    | 287 -> #ore>  | 327 -> #un    | 367 -> out>    |
| 8 -> -    | 48 -> w    | 88 -> #f    | 128 -> #u>   | 168 -> the>  | 208 -> st    | 248 -> is>    | 288 -> we>    | 328 -> been>  | 368 -> #us     |
| 9 -> .    | 49 -> x    | 89 -> #b    | 129 -> [>    | 169 -> #ed>  | 209 -> #ri   | 249 -> have>  | 289 -> #uc    | 329 -> en     | 369 -> who>    |
| 10 -> 0   | 50 -> y    | 90 -> #j    | 130 -> ’>    | 170 -> #in   | 210 -> #ly>  | 250 -> #em    | 290 -> but>   | 330 -> #ke>   | 370 -> sp      |
| 11 -> 1   | 51 -> z    | 91 -> #h>   | 131 -> a>    | 171 -> #er   | 211 -> #ai   | 251 -> #ould> | 291 -> #ted>  | 331 -> #al>   | 371 -> #ch     |
| 12 -> 2   | 52 -> �    | 92 -> #g>   | 132 -> t>    | 172 -> #nd>  | 212 -> had>  | 252 -> him>   | 292 -> all>   | 332 -> #ge>   | 372 -> to      |
| 13 -> 3   | 53 -> œ    | 93 -> #g    | 133 -> —>    | 173 -> #er>  | 213 -> #ll>  | 253 -> #ver>  | 293 -> fro    | 333 -> them>  | 373 -> #ess    |
| 14 -> 4   | 54 -> —    | 94 -> #k>   | 134 -> 4>    | 174 -> #ou   | 214 -> yo    | 254 -> there> | 294 -> un     | 334 -> kn     | 374 -> di      |
| 15 -> 5   | 55 -> ‘    | 95 -> #y    | 135 -> #9>   | 175 -> and>  | 215 -> #le>  | 255 -> #ght>  | 295 -> #ked>  | 335 -> when>  | 375 -> sa      |
| 16 -> 6   | 56 -> ’    | 96 -> #q    | 136 -> !>    | 176 -> of>   | 216 -> you>  | 256 -> be>    | 296 -> from>  | 336 -> #ill>  | 376 -> #sel    |
| 17 -> 7   | 57 -> “    | 97 -> #x    | 137 -> #v>   | 177 -> ha    | 217 -> #ur   | 257 -> #me>   | 297 -> #ir    | 337 -> #our>  | 377 -> #one>   |
| 18 -> 8   | 58 -> ”    | 98 -> #z    | 138 -> ;>    | 178 -> #ing> | 218 -> #or>  | 258 -> an     | 298 -> they>  | 338 -> ne     | 378 -> #ain>   |
| 19 -> 9   | 59 -> #e   | 99 -> #a>   | 139 -> #8    | 179 -> to>   | 219 -> the   | 259 -> ca     | 299 -> ex     | 339 -> #ough> | 379 -> or>     |
| 20 -> :   | 60 -> #a   | 100 -> #c>  | 140 -> #0>   | 180 -> hi    | 220 -> #is   | 260 -> #ter>  | 300 -> not>   | 340 -> are>   | 380 -> #ell>   |
| 21 -> ;   | 61 -> #d>  | 101 -> ?>   | 141 -> o>    | 181 -> #ea   | 221 -> sh    | 261 -> #ear   | 301 -> no>    | 341 -> up>    | 381 -> #ep     |
| 22 -> ?   | 62 -> #l   | 102 -> #o>  | 142 -> ‘>    | 182 -> wh    | 222 -> #an>  | 262 -> #ti    | 302 -> #la    | 342 -> #ir>   | 382 -> #ing    |
| 23 -> [   | 63 -> #p>  | 103 -> #w>  | 143 -> #7    | 183 -> #on>  | 223 -> #ld>  | 263 -> on     | 303 -> her>   | 343 -> al     | 383 -> #itt    |
| 24 -> ]   | 64 -> #i   | 104 -> #i>  | 144 -> #8>   | 184 -> wa    | 224 -> be    | 264 -> #is>   | 304 -> se     | 344 -> la     | 384 -> #king>  |
| 25 -> _   | 65 -> #s   | 105 -> 5>   | 145 -> 6>    | 185 -> #it   | 225 -> #re   | 265 -> #ent>  | 305 -> #il    | 345 -> #de>   | 385 -> #um     |
| 26 -> a   | 66 -> #y>  | 106 -> s>   | 146 -> h>    | 186 -> #at>  | 226 -> wit   | 266 -> on>    | 306 -> #ered> | 346 -> #ul    | 386 -> #ood>   |
| 27 -> b   | 67 -> #s>  | 107 -> ,>   | 147 -> e>    | 187 -> #or   | 227 -> #ion> | 267 -> #st    | 307 -> were>  | 347 -> #ent   | 387 -> hol     |
| 28 -> c   | 68 -> #k   | 108 -> #x>  | 148 -> #�    | 188 -> he>   | 228 -> #ce>  | 268 -> com    | 308 -> #son>  | 348 -> as     | 388 -> #out>   |
| 29 -> d   | 69 -> #t   | 109 -> #f>  | 149 -> c>    | 189 -> #en   | 229 -> #gh   | 269 -> #et>   | 309 -> #ts>   | 349 -> an>    | 389 -> #tion>  |
| 30 -> e   | 70 -> #c   | 110 -> “>   | 150 -> #q>   | 190 -> #on   | 230 -> #ow   | 270 -> #ay>   | 310 -> #oun   | 350 -> fa     | 390 -> ma      |
| 31 -> f   | 71 -> #h   | 111 -> #2   | 151 -> #6>   | 191 -> #es   | 231 -> in    | 271 -> ar     | 311 -> #ther> | 351 -> #led>  | 391 -> #ed     |
| 32 -> g   | 72 -> #n   | 112 -> #1   | 152 -> .>    | 192 -> in>   | 232 -> #oo   | 272 -> #aid>  | 312 -> #ess>  | 352 -> #ound> | 392 -> app     |
| 33 -> h   | 73 -> #o   | 113 -> #b>  | 153 -> (>    | 193 -> #en>  | 233 -> #ut>  | 273 -> #ac    | 313 -> one>   | 353 -> what>  | 393 -> pro     |
| 34 -> i   | 74 -> #l>  | 114 -> i>   | 154 -> 7>    | 194 -> #ar   | 234 -> no    | 274 -> this>  | 314 -> by>    | 354 -> #ain   | 394 -> rea     |
| 35 -> j   | 75 -> #m   | 115 -> l>   | 155 -> d>    | 195 -> #ic   | 235 -> #th   | 275 -> #et    | 315 -> so>    | 355 -> #th>   | 395 -> tw      |
| 36 -> k   | 76 -> #p   | 116 -> 8>   | 156 -> v>    | 196 -> that> | 236 -> with> | 276 -> #pp    | 316 -> #oc    | 356 -> #os    | 396 -> if>     |
| 37 -> l   | 77 -> #r   | 117 -> u>   | 157 -> 1>    | 197 -> #an   | 237 -> as>   | 277 -> me>    | 317 -> #gh>   | 357 -> #ers>  | 397 -> #ation> |
| 38 -> m   | 78 -> #v   | 118 -> m>   | 158 -> ”>    | 198 -> his>  | 238 -> #se>  | 278 -> ch     | 318 -> man>   | 358 -> #own>  | 398 -> some>   |
| 39 -> n   | 79 -> #e>  | 119 -> #œ   | 159 -> #1>   | 199 -> was>  | 239 -> at>   | 279 -> said>  | 319 -> #ight> | 359 -> #op    | 399 -> #thing> |

### Pre-defined WordPiece Tokenizer

In [21]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
text_example="A mouse called Petar sits on the legendary throne in the ivory tower."
print(tokenizer(text_example)["input_ids"])
print([tokenizer.decode(token) for token in tokenizer(text_example)["input_ids"]])
print(tokenizer.tokenize(text_example))
print(tokenizer.decode(tokenizer(text_example)["input_ids"]))
print()

text_example="Auf dem legendären Thron im Elfenbeinturm sitzt eine Maus namens Petar."
print(tokenizer(text_example)["input_ids"])
print([tokenizer.decode(token) for token in tokenizer(text_example)["input_ids"]])
print(tokenizer.tokenize(text_example))
print(tokenizer.decode(tokenizer(text_example)["input_ids"]))

[101, 1037, 8000, 2170, 9004, 2906, 7719, 2006, 1996, 8987, 6106, 1999, 1996, 11554, 3578, 1012, 102]
['[ C L S ]', 'a', 'm o u s e', 'c a l l e d', 'p e t', '# # a r', 's i t s', 'o n', 't h e', 'l e g e n d a r y', 't h r o n e', 'i n', 't h e', 'i v o r y', 't o w e r', '.', '[ S E P ]']
['a', 'mouse', 'called', 'pet', '##ar', 'sits', 'on', 'the', 'legendary', 'throne', 'in', 'the', 'ivory', 'tower', '.']
[CLS] a mouse called petar sits on the legendary throne in the ivory tower. [SEP]

[101, 21200, 17183, 5722, 12069, 2078, 16215, 4948, 10047, 17163, 2368, 19205, 3372, 3126, 2213, 4133, 2480, 2102, 27665, 5003, 2271, 2171, 3619, 9004, 2906, 1012, 102]
['[ C L S ]', 'a u f', 'd e m', 'l e g e n d', '# # a r e', '# # n', 't h', '# # r o n', 'i m', 'e l f', '# # e n', '# # b e i', '# # n t', '# # u r', '# # m', 's i t', '# # z', '# # t', 'e i n e', 'm a', '# # u s', 'n a m e', '# # n s', 'p e t', '# # a r', '.', '[ S E P ]']
['auf', 'dem', 'legend', '##are', '##n', 'th', '##ron', 'im'