Run first the [setup notebook](./00-setup.ipynb)

# Getting the text examples

We fetch some 'nice' chapters from various books in different languages from project Gutenberg. The diversity will demonstrate some of the challenges when tokenizing

In [1]:
text = """
I buy my parents' 10% of U.K. startup for $1.4 billion. Dr. Watson's cat called Mrs. Hersley and it was w.r.o.n.g., more to come ...
""".strip()

# Project Gutenberg, 244: A Study in Scarlet (en), Arthur Conan Doyle
text_en = """
This was a lofty chamber, lined and littered with countless bottles.
Broad, low tables were scattered about, which bristled with retorts,
test-tubes, and little Bunsen lamps, with their blue flickering flames.
There was only one student in the room, who was bending over a distant
table absorbed in his work. At the sound of our steps he glanced round
and sprang to his feet with a cry of pleasure. “I’ve found it! I’ve
found it,” he shouted to my companion, running towards us with a
test-tube in his hand. “I have found a re-agent which is precipitated
by hæmoglobin, and by nothing else.” Had he discovered a gold mine,
greater delight could not have shone upon his features.
""".strip()

# Project Gutenberg, 34811: Buddenbrooks: Verfall einer Familie (de), Thomas Mann
text_de = """
»Ich rechne«, sagte der Konsul trocken. Die Kerze flammte auf, und man
sah, wie er gerade aufgerichtet und mit Augen, so kalt und aufmerksam,
wie sie während des ganzen Nachmittags noch nicht darein geschaut
hatten, fest in die tanzende Flamme blickte. -- »Einerseits: Sie geben
33335 an Gotthold und 15000 an die in Frankfurt, und das macht 48335 in
Summa. Andererseits: Sie geben nur 25000 an die in Frankfurt, und das
bedeutet für die Firma einen Gewinn von 23335. Das ist aber nicht alles.
Gesetzt, Sie leisten an Gotthold eine Entschädigungssumme für den Anteil
am Hause, so ist das Prinzip durchbrochen, so ist er damals =nicht=
endgültig abgefunden worden, so kann er nach Ihrem Tode ein gleich
großes Erbe beanspruchen, wie meine Schwester und ich, und dann handelt
es sich für die Firma um einen Verlust von Hunderttausenden, mit dem sie
nicht rechnen kann, mit dem ich als künftiger alleiniger Inhaber nicht
rechnen kann ... Nein, Papa!« beschloß er mit einer energischen
Handbewegung und richtete sich noch höher auf. »Ich muß Ihnen abraten,
nachzugeben!«
""".strip()

# Project Gutenberg, 13951: Les trois mousquetaires (fr), Alexandre Dumas
text_fr = """
D’Artagnan, tout en marchant et en monologuant, était arrivé à quelques
pas de l’hôtel d’Aiguillon, et devant cet hôtel il avait aperçu Aramis
causant gaiement avec trois gentilshommes des gardes du roi. De son
côté, Aramis aperçut d’Artagnan; mais comme il n’oubliait point que
c’était devant ce jeune homme que M. de Tréville s’était si fort
emporté le matin, et qu’un témoin des reproches que les mousquetaires
avaient reçus ne lui était d’aucune façon agréable, il fit semblant de
ne pas le voir. D’Artagnan, tout entier au contraire à ses plans de
conciliation et de courtoisie, s’approcha des quatre jeunes gens en
leur faisant un grand salut accompagné du plus gracieux sourire. Aramis
inclina légèrement la tête, mais ne sourit point. Tous quatre, au
reste, interrompirent à l’instant même leur conversation.
""".strip()

# Project Gutenberg, 27729: Bajki (pl), Adam Mickiewicz
text_pl = """
Powolny bóg wszechżabstwu na króla użycza
Małego jako Łokiet Kija Kijowicza.
Spadł Kij i pluskiem wszemu obwieścił się błotu.
Struchlały żaby na ten majestat łoskotu.
Milczą, dzień i noc, ledwie śmiejąc dychać,
Nazajutrz jedna drugiej pytają: „Co słychać?
Czy niema co od króla?” Aż śmielsze i starsze
Ruszają przed oblicze stawić się monarsze.
Zrazu zdala, w bojaźni, by się nie narazić;
Potem, przemógłszy te strachy,
Brat za brat z królem biorą się pod pachy
I zaczynają na kark mu włazić.
„Toż to taki ma być król?... Najjaśniejszy Bela,
Nie wiele z niego będziem mieć wesela;
Król, co po karku bezkarnie go gładzim,
Niechaj nam abdykuje zaraz, niedołęga!
Potrzebna nam jest władza, ale władza tęga!”
""".strip()

# Project Gutenberg, 23585: 佛說四十二章經 (zh)
text_zh="""
沙門夜誦迦葉佛遺教經，其聲悲緊，思悔欲退。佛問之曰：汝昔在家，曾為何業？對
曰：愛彈琴。佛言：弦緩如何？對曰：不鳴矣！弦急如何？對曰：聲絕矣！急緩得中
如何？對曰：諸音普矣！佛言：沙門學道亦然，心若調適，道可得矣。於道若暴，暴
即身疲。其身若疲，意即生惱。意若生惱，行即退矣。其行既退，罪必加矣。但清淨
安樂，道不失矣。
""".strip()

texts = {
    'abbreviations': text,
    'english': text_en,
    'german': text_de,
    'french': text_fr,
    'polish': text_pl,
    'chinese': text_zh
}

# Word based tokenization

In [2]:
import re, nltk, jieba, spacy

spacy.prefer_gpu()
nlp = {
    'english': spacy.load('en_core_web_sm'),
    'french': spacy.load('fr_core_news_sm'),
    'german': spacy.load('de_core_news_sm'),
    'polish': spacy.load('pl_core_news_sm'),
    'chinese': spacy.load('zh_core_web_sm'),
}

def jieba_word(text: str, language: str = 'english') -> list[str]:
    return [t[0] for t in jieba.tokenize(text) if t[0] != ' ']

def python_word(text: str, language: str = 'english') -> list[str]:
    text = re.sub(r'[^\w\-]+', ' ', text)
    return [token for token in text.split(' ') if token]

def nltk_word(text: str, language: str = 'english') -> list[str]:
    return nltk.word_tokenize(text, language == 'chinese' and 'english' or language)

def spacy_word(text: str, language: str = 'english') -> list[str]:
    global nlp
    return [token.text for token in nlp[language](text)]

  import pkg_resources


In [7]:
import ipywidgets as widgets
from IPython.display import display, Markdown, clear_output
from tabulate import tabulate
from itertools import zip_longest
from unidecode import unidecode

def show_tokens(language: str, alpha: bool, single: bool, decode: bool, lower: bool):
    tokens = []
    for tokenizer in tokenizers.values():
        t = tokenizer(texts[language], language == "abbreviations" and "english" or language)
        if alpha:
            t = [token for token in t if re.match(r'^[a-zA-Z][\w\-\.]*$', token)]
        if decode:
            t = [unidecode(token) for token in t]
        if single:
            t = [token for token in t if len(token) > 1]
        if lower:
            t = [token.lower() for token in t if len(token) > 1]
        tokens.append(t)
    with (out_text := widgets.Output()):
        display(Markdown(texts[language]))
    with (out_tokens := widgets.Output(layout = {'padding': '0px 50px', 'min_width': '60%'})):
        if language == 'chinsese':
            for i in range(len(tokens)):
                nl = '\n  '
                print(f'{list(tokenizers.keys())[i]}:\n  {nl.join(tokens[i][:10])}\n\n')
        else:
            headers = tokenizers.keys()
            rows = [[col[:16] for col in row] for row in zip_longest(*tokens, fillvalue='')]
            display(Markdown(tabulate(rows[:40], headers, tablefmt="pipe")))
    with out_result:
        clear_output()
        display(widgets.HBox([out_text, out_tokens]))

opt_alpha = widgets.Checkbox(description='only words')
opt_single = widgets.Checkbox(description='no single letter words')
opt_decode = widgets.Checkbox(description='unicode decode')
opt_lower = widgets.Checkbox(description='lowercase')
opt_language = widgets.Dropdown(description='language', options=['abbreviations', 'english', 'french', 'german', 'polish', 'chinese'])
opt_method = widgets.Dropdown(description='method', options=[
    ('python (split on whitespace)', 'python-split'),
    ('nltk (word)', 'nltk-word'),
    ('spaCy (word)', 'spacy-word'),
    ('jieba (chinese)', 'jieba-word'),
])
tokenizers = {
    'python-word': python_word,
    'nltk-word': nltk_word,
    'spacy-word': spacy_word,
    'jieba-word': jieba_word,
}

out_result = widgets.Output()
display(widgets.interactive(show_tokens, language=opt_language, alpha=opt_alpha, single=opt_single, decode=opt_decode, lower=opt_lower))
display(out_result)

interactive(children=(Dropdown(description='language', options=('abbreviations', 'english', 'french', 'german'…

Output()

## Sub-word tokenization

In [8]:
def sub_words_tokenize(text: str, k: int, mark: str = '#') -> list[str]:
    text = re.sub(r'\W+', ' ', text)
    tokens = []
    for token in text.split():
        if len(token) <= k:
            tokens.append(token)
            continue
        for i in range(len(token) - k + 1):
            tokens.append(i == 0 and mark + token[i:i + k] or token[i:i + k])
    return tokens

In [9]:
print(sub_words_tokenize("teach multtimedia", 3))

['#tea', 'eac', 'ach', '#mul', 'ult', 'ltt', 'tti', 'tim', 'ime', 'med', 'edi', 'dia']


In [None]:
MAX_K = 11

def show_sub_words(language: str, mark: bool):
    tokens = []
    for k in range(1, MAX_K, 1):
        tokens.append(sub_words_tokenize(texts[language], k, mark and "#" or ""))
    with (out_text := widgets.Output()):
        display(Markdown(texts[language]))
    with (out_tokens := widgets.Output(layout = {'padding': '0px 50px', 'min_width': '60%'})):
        headers = [str(i) for i in range(1, MAX_K, 1)]
        rows = [[f'<{len(t)}>' for t in tokens]] + [row for row in zip_longest(*tokens, fillvalue='')]
        print(tabulate(rows[:30], headers, tablefmt="github"))
    with out_result:
        clear_output()
        display(widgets.HBox([out_text, out_tokens]))

opt_language = widgets.Dropdown(description='language', options=['abbreviations', 'english', 'french', 'german', 'polish', 'chinese'])
opt_mark = widgets.Checkbox(description='mark start-of-word sequences')

out_result = widgets.Output()
display(widgets.interactive(show_sub_words, language=opt_language, mark=opt_mark))
display(out_result)

interactive(children=(Dropdown(description='language', options=('abbreviations', 'english', 'french', 'german'…

Output()

---