# Tokenization
By Youssef Al Hariri

This notebook aims to simplify the tokenization concept in the Text processing and LLM modelling



## Learning objectives:
- Practice different tokenization strategies (split, regex, NLTK, spaCy, Hugging Face).
- Interactively compare tokenizer outputs side-by-side.
- Extract and view slides from the lecture PPTX at runtime.

## What is Tokenization?

Tokenization is the process of splitting text into smaller units (tokens). A token is an instance of a sequence of characters. 

For example, here is how the simple form of tokenization works:


## 1) Install the required libraries:

This step is not always required. You need to install the required library when an error message shown. It is always better to activate a virtual environment before installing libraries.

## Quick setup and install hints

If you run this notebook locally or in a colab, you may need to install a few packages. Recommended commands:\n

- For the interactive widgets: `pip install ipywidgets`\n
- For spaCy: `pip install spacy` and `python -m spacy download en_core_web_sm`\n
- For Hugging Face tokenizers: `pip install transformers tokenizers`\n
- For NLTK (if not already): `pip install nltk` and then run `nltk.download('punkt')` in Python.\n

The next cell has the required lines to install the libraries. You can simply uncomment these lines and run them. The examples below attempt safe imports and will print a friendly message if a package is missing.

In [None]:
# !pip install ipywidgets
# !pip install transformers tokenizers
# !pip install nltk
# !pip install spacy
# !python -m spacy download en_core_web_sm

# import nltk
# nltk.download('punkt')

## 2) Let us define the example sentence:

In [64]:
sentence = "@Youssef truly loves #AIresearch, but ðŸ¤– & ðŸ§  make him go 'hmm...?'"
print('Sentence:', sentence)

Sentence: @Youssef truly loves #AIresearch, but ðŸ¤– & ðŸ§  make him go 'hmm...?'


## 3) Let us exercise different tools and methods for tokenization:

#### A) Practice the methods:

##### 1. Using the str.split() method:


In [55]:
tokens_split = sentence.split()
print(f'Tokens (split): {tokens_split}')

Tokens (split): ['@Youssef', 'truly', 'loves', '#AIresearch,', 'but', 'ðŸ¤–', '&', 'ðŸ§ ', 'make', 'him', 'go', "'hmm...?'"]


##### 2. Using regex:

We need to use the re library with findall function as shown below.

In [42]:
import re
tokens_regex = re.findall(r'\w+', sentence)
print(f'Tokens (regex): {tokens_regex}')


Tokens (regex): ['Youssef', 'truly', 'loves', 'AIresearch', 'but', 'make', 'him', 'go', 'hmm']


Notice that the extracted are only the alphabetical characters. 


Below, we will use another method that handles different unicodes:

In [43]:
tokens_regex = re.findall(r'\w+|[^\w\s]', sentence, re.UNICODE)
print(f'Tokens (regex): {tokens_regex}')

Tokens (regex): ['@', 'Youssef', 'truly', 'loves', '#', 'AIresearch', ',', 'but', 'ðŸ¤–', '&', 'ðŸ§ ', 'make', 'him', 'go', "'", 'hmm', '.', '.', '.', '?', "'"]


##### 3. Using NLTK:

A method by using word_tokenize from the NLTK:

In [67]:
try:
    import nltk
    tokens_nltk = nltk.word_tokenize(sentence)
    print(f'Tokens (NLTK): {tokens_nltk}')
except ImportError:
    print('NLTK is not installed. Run: pip install nltk')

Tokens (NLTK): ['@', 'Youssef', 'truly', 'loves', '#', 'AIresearch', ',', 'but', 'ðŸ¤–', '&', 'ðŸ§ ', 'make', 'him', 'go', "'hmm", '...', '?', "'"]


Another method by using TweetTokenizer from the NLTK:

In [68]:
try:
    import nltk
    tokens_nltk = nltk.TweetTokenizer().tokenize(sentence)
    print(f'Tokens (NLTK TweetTokenizer): {tokens_nltk}')
except ImportError:
    print('NLTK is not installed. Run: pip install nltk')


Tokens (NLTK TweetTokenizer): ['@Youssef', 'truly', 'loves', '#AIresearch', ',', 'but', 'ðŸ¤–', '&', 'ðŸ§ ', 'make', 'him', 'go', "'", 'hmm', '...', '?', "'"]


Notice how different methods handle the same word, such as the mention and the hashtag.


##### 4. Using SpaCy:
SpaCy is a powerful, open-source Python library designed for advanced Natural Language Processing (NLP).

In [66]:
try:
    import spacy
    try:
        nlp = spacy.load('en_core_web_sm')
    except Exception:
        # fallback: create blank English model (will still tokenize)
        nlp = spacy.blank('en')
    doc = nlp(sentence)
    print('spaCy tokens:', [t.text for t in doc])
except ImportError:
    print('spaCy is not installed. Run: pip install spacy && python -m spacy download en_core_web_sm')


spaCy tokens: ['@Youssef', 'truly', 'loves', '#', 'AIresearch', ',', 'but', 'ðŸ¤–', '&', 'ðŸ§ ', 'make', 'him', 'go', "'", 'hmm', '...', '?', "'"]


##### 5. Hugging Face / subword tokenizers

Subword tokenizers are used by modern LLMs. This demo uses a small pretrained tokenizer (e.g., distilbert) to show how text is broken into subwords and token ids. If transformers isn't installed the cell will instruct how to install it.

A. AutoTokenizer:

In [69]:
try:
    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
    tokens_hf = tokenizer.tokenize(sentence)
    print(f'Tokens (HuggingFace): {tokens_hf}')
except ImportError:
    print('Transformers is not installed. Run: pip install transformers tokenizers')

Tokens (HuggingFace): ['@', 'you', '##sse', '##f', 'truly', 'loves', '#', 'aires', '##ear', '##ch', ',', 'but', '[UNK]', '&', '[UNK]', 'make', 'him', 'go', "'", 'hmm', '.', '.', '.', '?', "'"]


B. BertTokenizer:


In [75]:
try:
    from transformers import BertTokenizer
    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
    tokens_hf = tokenizer.tokenize(sentence)
    print(f'Tokens (HuggingFace BertTokenizer): {tokens_hf}')
except ImportError:
    print('Transformers is not installed. Run: pip install transformers tokenizers')
except Exception as e:
    print(f'Error loading BertTokenizer: {e}')

Tokens (HuggingFace BertTokenizer): ['@', 'you', '##sse', '##f', 'truly', 'loves', '#', 'aires', '##ear', '##ch', ',', 'but', '[UNK]', '&', '[UNK]', 'make', 'him', 'go', "'", 'hmm', '.', '.', '.', '?', "'"]


In both cases, we observe how the name "Youssef" is tokenized into three subword pieces: 'you', '##sse', and '##f'. This reflects BERT's WordPiece tokenizer strategy, where uncommon words are broken into known subword units using the ## prefix to indicate _continuation_.

Additionally, emojis such as ðŸ¤– and ðŸ§  are replaced with the special token [UNK], which stands for "unknown", meaning these characters are ___out-of-vocabulary___ (OOV) and not represented in the model's learned embeddings.

Below, we display the tokenized output and corresponding token IDs. These IDs are used to retrieve embeddings from the model's vocabulary table and serve as input to the transformer layers.

In [74]:
try:
    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

    tokens_hf = tokenizer.tokenize(sentence)
    tokens_ids = tokenizer(sentence)['input_ids']
    print(f'Tokens (HuggingFace AutoTokenizer): {tokens_hf}')
    print(f'Token IDs (HuggingFace AutoTokenizer): {tokens_ids}')
except ImportError:
    print('Transformers is not installed. Run: pip install transformers tokenizers')
except Exception as e:
    print(f'Error loading AutoTokenizer: {e}')

Tokens (HuggingFace AutoTokenizer): ['@', 'you', '##sse', '##f', 'truly', 'loves', '#', 'aires', '##ear', '##ch', ',', 'but', '[UNK]', '&', '[UNK]', 'make', 'him', 'go', "'", 'hmm', '.', '.', '.', '?', "'"]
Token IDs (HuggingFace AutoTokenizer): [101, 1030, 2017, 11393, 2546, 5621, 7459, 1001, 9149, 14644, 2818, 1010, 2021, 100, 1004, 100, 2191, 2032, 2175, 1005, 17012, 1012, 1012, 1012, 1029, 1005, 102]


## Comparison and notes

- String split is fast but naive and will keep punctuation attached.
- regex-based splitting can remove punctuation but may drop tokens like hashtags or mentions depending on the pattern.
- NLTK and spaCy provide linguistically informed tokenization (better for downstream NLP).
- Hugging Face tokenizers break words into subwords and produce ids for model input (essential for LLMs).

When choosing a tokenizer, consider downstream task, vocabulary, and whether the model you're using expects specific subword tokenization.

In [72]:
# Interactive tokenizer playground (ipywidgets)
# This cell creates a small UI to compare tokenizers side-by-side. It handles missing packages gracefully.
try:
    import ipywidgets as widgets
    from IPython.display import display, HTML, Markdown, clear_output
    import re
    # safe imports for tokenizers
    try:
        from nltk.tokenize import word_tokenize
    except Exception:
        word_tokenize = None
    try:
        import spacy
        try:
            _nlp = spacy.load('en_core_web_sm')
        except Exception:
            _nlp = spacy.blank('en')
    except Exception:
        _nlp = None
    try:
        from transformers import AutoTokenizer
    except Exception:
        AutoTokenizer = None

    # Controls
    text_in = widgets.Textarea(value=sentence, description='Text:', layout=widgets.Layout(width='100%', height='80px'))
    cb_split = widgets.Checkbox(value=True, description='split()')
    cb_regex = widgets.Checkbox(value=True, description=r'regex split (\W+)')
    cb_nltk = widgets.Checkbox(value=False, description='NLTK')
    cb_spacy = widgets.Checkbox(value=False, description='spaCy')
    cb_hf = widgets.Checkbox(value=False, description='HuggingFace (subword)')
    hf_model = widgets.Dropdown(options=['distilbert-base-uncased','bert-base-uncased'], value='distilbert-base-uncased', description='HF model:')
    out = widgets.Output()

    def run_tokenizers(_=None):
        with out:
            clear_output()
            s = text_in.value
            results = {}
            if cb_split.value:
                results['split'] = [f"\"{x}\"" for x in s.split()]
            if cb_regex.value:
                # use regex "\W+" to split on non-word characters (keeps underscores and letters/digits)
                results['regex'] = [f"\"{t}\"" for t in re.split(r'\W+', s) if t!='']
            if cb_nltk.value:
                if word_tokenize is not None:
                    results['nltk'] = [f"\"{x}\"" for x in word_tokenize(s)]
                else:
                    results['nltk'] = 'NLTK not installed (pip install nltk; then nltk.download(punkt))'
            if cb_spacy.value:
                if _nlp is not None:
                    results['spacy'] = [f"\"{t.text}\"" for t in _nlp(s)]
                else:
                    results['spacy'] = 'spaCy not installed (pip install spacy; python -m spacy download en_core_web_sm)'
            if cb_hf.value:
                if AutoTokenizer is not None:
                    try:
                        tok = AutoTokenizer.from_pretrained(hf_model.value)
                        results['hf_tokens'] = [f"\"{x}\"" for x in tok.tokenize(s)]
                        # tokenizer(...) returns a dict-like BatchEncoding; 'input_ids' gives token ids
                        results['hf_ids'] = [f"\"{x}\"" for x in tok(s)['input_ids']]
                    except Exception as e:
                        results['hf_tokens'] = f'Failed to load HF tokenizer: {e}'
                else:
                    results['hf_tokens'] = 'Transformers not installed (pip install transformers tokenizers)'

            cols = list(results.keys())
            if not cols:
                display(Markdown('No tokenizers selected.'))
                return
            # Build a simple HTML table showing each tokenizer's output in a column
            html_lines = []
            # header row
            html_lines.append('<table style=\"width:100%; border-collapse:collapse; font-family: monospace;\">')
            html_lines.append('<tr>')
            html_lines.append(f'<th style=\"text-align:left; border-bottom:1px solid #ccc; padding:6px\">Method</th><th style=\"text-align:left; border-bottom:1px solid #ccc; padding:6px\">Value</th>')
            html_lines.append('</tr>')
            for c in cols:
                html_lines.append('<tr>')
                val = results[c]
                if isinstance(val, list):
                    out_txt = ' '.join(str(x) for x in val)
                else:
                    out_txt = str(val)
                html_lines.append(f'<th style=\"text-align:left; border-bottom:1px solid #ccc; padding:6px\">{c}</th>')
                # escape < and > to avoid HTML issues in tokens
                out_txt = out_txt.replace('<', '&lt;').replace('>', '&gt;')
                html_lines.append(f'<td style=\"vertical-align:top; text-align:left; padding:6px\"><pre>{out_txt}</pre></td>')
                
                html_lines.append('</tr>')
            # data row
            html_lines.append('</table>')
            display(HTML('\n'.join(html_lines)))

    run_btn = widgets.Button(description='Run tokenizers', button_style='primary')
    run_btn.on_click(run_tokenizers)

    controls = widgets.HBox([widgets.VBox([text_in, widgets.HBox([run_btn])]), widgets.VBox([cb_split, cb_regex, cb_nltk, cb_spacy, cb_hf, hf_model])])
    display(controls, out)

except Exception as e:
    print('ipywidgets not available or failed to render widgets. Install via: pip install ipywidgets')
    print('Error:', e)

HBox(children=(VBox(children=(Textarea(value="@Youssef truly loves #AIresearch, but ðŸ¤– & ðŸ§  make him go 'hmm...?â€¦

Output()

I hope this explains the concept of Tokenization with practical examples of different methods and applications.