# Positional Encoding and BPE Tokenization — Interactive Demo (Gradio)

This notebook walks step-by-step through:

1. Installing required libraries (Gradio for UI, Tokenizers for BPE, NumPy for math).
2. Implementing a minimal Byte-Pair Encoding (BPE) tokenizer using `huggingface/tokenizers`.
3. Building a simple embedding matrix and producing token embeddings for an input sentence.
4. Displaying the number of tokens created by BPE and the token→id mapping (tokenization).
5. Saving a new vocabulary file built from the tokens encountered.
6. Computing sinusoidal positional encodings, as described in the original Transformer paper:
   - Even dimensions: `PE[pos, 2i]   = sin(pos / (10000^(2i/d_model)))`
   - Odd  dimensions: `PE[pos, 2i+1] = cos(pos / (10000^(2i/d_model)))`

You will be able to type a sentence and see:
- The embeddings for its tokens,
- The token count,
- The token→id mapping,
- The saved vocabulary file generated from these tokens,
- The positional encodings for the sequence.

All code cells are thoroughly documented line-by-line.


In [8]:
# Install required libraries using pip. Each line is documented.
# The exclamation mark tells Jupyter to run a shell command.
# We install:
#  - gradio: to build an interactive UI right inside this notebook.
#  - tokenizers: Hugging Face's fast BPE implementation for tokenization.
#  - numpy: numerical computations for embeddings and positional encodings.
!pip -q install gradio tokenizers numpy


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [9]:
# Import standard libraries and third-party packages. Each import is commented.
import os  # Operating system interfaces (used for file paths and saving vocab).
from typing import Dict, List, Tuple  # Type hints to clarify function inputs/outputs.

# We attempt to import third-party packages and, if missing, install them on the fly.
try:
    import numpy as np  # Core numerical library for arrays, random numbers, and math ops.
except ImportError:
    import sys, subprocess  # Fallback to install numpy if not present.
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "numpy"])
    import numpy as np

# Try to import gradio, install if it's missing so this cell never fails when run first.
try:
    import gradio as gr  # Gradio provides a simple UI for interacting with Python functions.
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "gradio"])
    import gradio as gr

# Try to import tokenizers, install if it's missing.
try:
    from tokenizers import Tokenizer  # The main Tokenizer object.
    from tokenizers.models import BPE  # The BPE model specification.
    from tokenizers.trainers import BpeTrainer  # Trainer to learn merges/vocab.
    from tokenizers.pre_tokenizers import ByteLevel  # Byte-level pre-tokenization (robust for text).
    from tokenizers.decoders import ByteLevel as ByteLevelDecoder  # Decoder to map ids back to text.
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "tokenizers"])
    from tokenizers import Tokenizer
    from tokenizers.models import BPE
    from tokenizers.trainers import BpeTrainer
    from tokenizers.pre_tokenizers import ByteLevel
    from tokenizers.decoders import ByteLevel as ByteLevelDecoder

# Try to import reportlab for PDF generation; install on-the-fly if missing.
try:
    from reportlab.lib.pagesizes import letter  # Standard US letter page size.
    from reportlab.pdfgen import canvas  # Canvas API to draw text on PDF pages.
    from reportlab.lib.units import inch  # Convenience for positioning.
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "reportlab"])
    from reportlab.lib.pagesizes import letter
    from reportlab.pdfgen import canvas
    from reportlab.lib.units import inch

# Set a consistent random seed so results are reproducible between runs.
np.random.seed(42)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [10]:
# Helper: Train a small BPE tokenizer from an iterator of texts.
# We use ByteLevel pre-tokenization which operates on bytes (robust and GPT-like).
def train_bpe_tokenizer(texts: List[str],
                        vocab_size: int = 2000,
                        special_tokens: List[str] = (
                            "[PAD]", "[UNK]", "[BOS]", "[EOS]"
                        )) -> Tokenizer:
    """
    Trains a BPE tokenizer on the provided texts.

    Parameters
    ----------
    texts : List[str]
        Sentences or documents to learn BPE vocab/merges from.
    vocab_size : int
        Maximum vocabulary size (including special tokens).
    special_tokens : List[str]
        Special tokens to reserve at the start of the vocabulary.

    Returns
    -------
    Tokenizer
        A trained Hugging Face `Tokenizer` configured for BPE.
    """
    # Initialize a BPE model with an unknown token (used when encountering OOV tokens).
    model = BPE(unk_token="[UNK]")

    # Create the Tokenizer object with this model.
    tokenizer = Tokenizer(model)

    # Use byte-level pre-tokenizer to split text into a consistent stream of bytes.
    tokenizer.pre_tokenizer = ByteLevel(add_prefix_space=True)

    # Use a matching byte-level decoder to reconstruct text if needed.
    tokenizer.decoder = ByteLevelDecoder()

    # Trainer defines training parameters like vocab size and special tokens.
    trainer = BpeTrainer(vocab_size=vocab_size, special_tokens=list(special_tokens))

    # Train the tokenizer on the provided texts iterator.
    tokenizer.train_from_iterator(texts, trainer=trainer)

    # Return the trained tokenizer ready for encoding.
    return tokenizer


In [11]:
# Helper: Create an embedding matrix and look up embeddings for a token id sequence.
def build_embeddings_and_lookup(vocab_size: int,
                                 d_model: int,
                                 token_ids: List[int]) -> Tuple[np.ndarray, np.ndarray]:
    """
    Builds a random embedding matrix and returns the matrix and the embeddings
    corresponding to the provided token ids.

    Parameters
    ----------
    vocab_size : int
        Size of the vocabulary (number of rows in the embedding matrix).
    d_model : int
        Dimensionality of each embedding vector (number of columns).
    token_ids : List[int]
        Sequence of token indices to look up embeddings for.

    Returns
    -------
    Tuple[np.ndarray, np.ndarray]
        (embedding_matrix, token_embeddings) where:
        - embedding_matrix has shape (vocab_size, d_model)
        - token_embeddings has shape (len(token_ids), d_model)
    """
    # Create a random embedding matrix with small values (mean=0, std ~ 0.02 typical for NLP inits).
    embedding_matrix = np.random.normal(loc=0.0, scale=0.02, size=(vocab_size, d_model)).astype(np.float32)

    # Convert token_ids to a NumPy array for indexing the embedding matrix.
    token_ids_array = np.array(token_ids, dtype=np.int64)

    # Gather the rows corresponding to each token id to get their embeddings.
    token_embeddings = embedding_matrix[token_ids_array]

    # Return both the full matrix (for inspection) and the looked-up embeddings.
    return embedding_matrix, token_embeddings


In [12]:
# Helper: Compute sinusoidal positional encodings as used in Transformers.
def positional_encoding(seq_len: int, d_model: int) -> np.ndarray:
    """
    Computes a (seq_len, d_model) positional encoding matrix using sinusoidal functions:

      PE[pos, 2i]   = sin(pos / (10000^(2i/d_model)))
      PE[pos, 2i+1] = cos(pos / (10000^(2i/d_model)))

    Parameters
    ----------
    seq_len : int
        Length of the sequence (number of positions).
    d_model : int
        Embedding dimension of the model.

    Returns
    -------
    np.ndarray
        Positional encodings of shape (seq_len, d_model).
    """
    # Create an array of positions [0, 1, 2, ..., seq_len-1] with shape (seq_len, 1).
    positions = np.arange(seq_len, dtype=np.float32)[:, np.newaxis]

    # Create an array of even indices [0, 2, 4, ..., d_model-2] with shape (1, d_model/2).
    i = np.arange(0, d_model, 2, dtype=np.float32)[np.newaxis, :]

    # Compute the denominator term 10000^(2i/d_model) as in the paper.
    denom = np.power(10000.0, i / d_model)

    # Compute the angle rates = positions / denom with broadcasting to shape (seq_len, d_model/2).
    angle_rates = positions / denom

    # Initialize PE matrix with zeros for both even and odd columns.
    pe = np.zeros((seq_len, d_model), dtype=np.float32)

    # Even indices (0, 2, 4, ...): apply sine.
    pe[:, 0::2] = np.sin(angle_rates)

    # Odd indices (1, 3, 5, ...): apply cosine.
    pe[:, 1::2] = np.cos(angle_rates)

    # Return the completed positional encoding matrix.
    return pe


In [13]:
# Helper: Create a concise PDF report capturing inputs, calculations, and results.
def create_pdf_report(
    sentence: str,
    d_model: int,
    tokens: List[str],
    token_ids: List[int],
    token_embeddings: np.ndarray,
    positional_enc: np.ndarray,
    summed_matrix: np.ndarray,
    out_path: str,
    preview_rows: int = 6,
    preview_cols: int = 8,
) -> str:
    """
    Builds a PDF report summarizing the pipeline:
      - Input sentence and d_model
      - Tokens and IDs
      - Shapes of matrices
      - A small preview (top-left corner) of embeddings, positional encodings, and the summed matrix

    Parameters
    ----------
    sentence : str
        The input sentence provided by the user.
    d_model : int
        The embedding/hidden dimension used.
    tokens : List[str]
        Token strings from BPE.
    token_ids : List[int]
        Corresponding token ids.
    token_embeddings : np.ndarray
        Raw token embeddings (seq_len, d_model).
    positional_enc : np.ndarray
        Positional encodings (seq_len, d_model).
    summed_matrix : np.ndarray
        Result of sqrt(d_model) * token_embeddings + positional_enc.
    out_path : str
        Full file path where the PDF should be saved.
    preview_rows : int
        Number of rows to preview for matrices.
    preview_cols : int
        Number of columns to preview for matrices.

    Returns
    -------
    str
        The path to the created PDF file.
    """
    c = canvas.Canvas(out_path, pagesize=letter)
    width, height = letter

    def writeln(text, x=1*inch, y=None, leading=14):
        nonlocal current_y
        if y is not None:
            current_y = y
        # Wrap long lines manually by splitting if needed
        max_chars = 95
        lines = [text[i:i+max_chars] for i in range(0, len(text), max_chars)] if len(text) > max_chars else [text]
        for line in lines:
            c.drawString(x, current_y, line)
            current_y -= leading

    def write_matrix_preview(title, mat: np.ndarray):
        nonlocal current_y
        writeln(title)
        r = min(preview_rows, mat.shape[0])
        k = min(preview_cols, mat.shape[1])
        # Header
        header = "cols 0..{} (rounded to 4 dp)".format(k-1)
        writeln(header)
        for i in range(r):
            row_vals = ", ".join(f"{v:.4f}" for v in mat[i, :k])
            writeln(f"row {i}: [ {row_vals} ]")
            if current_y < 1*inch:
                c.showPage()
                current_y = height - 1*inch

    current_y = height - 1*inch
    writeln("Positional Encoding & BPE — Report", y=current_y)
    writeln("Reference: Vaswani et al., 2017 — 'Attention Is All You Need'")
    writeln("")
    writeln(f"Input sentence: {sentence}")
    writeln(f"d_model: {d_model}")
    writeln(f"Token count: {len(token_ids)}")
    writeln(f"Tokens: {tokens}")
    writeln(f"Token IDs: {token_ids}")
    writeln("")

    # Shapes and formula
    writeln("Shapes:")
    writeln(f" - token_embeddings: {token_embeddings.shape}")
    writeln(f" - positional_enc:   {positional_enc.shape}")
    writeln(f" - summed_matrix:     {summed_matrix.shape}")
    writeln("")
    writeln("We compute: X = sqrt(d_model) * E + PE, where E are token embeddings and PE are sinusoidal encodings.")
    writeln("")

    # Previews
    write_matrix_preview("Preview: token embeddings (E)", token_embeddings)
    write_matrix_preview("Preview: positional encodings (PE)", positional_enc)
    write_matrix_preview("Preview: X = sqrt(d_model) * E + PE", summed_matrix)

    c.showPage()
    c.save()
    return out_path


In [14]:
# Core processing function used by the Gradio UI.
def process_sentence(sentence: str, d_model: int = 64):
    """
    Given an input sentence and embedding size (d_model), this function:
      1) Trains a small BPE tokenizer on the sentence itself (for demo purposes).
      2) Encodes the sentence to token ids and collects token strings.
      3) Builds a random embedding matrix and looks up embeddings for these token ids.
      4) Saves a new vocabulary file built from the tokenizer's learned vocab.
      5) Computes sinusoidal positional encodings for the sequence length.

    Returns a tuple of objects compatible with Gradio components in this order:
      - embeddings_df: 2D list (seq_len x d_model) initial token embedding values.
      - posenc_df: 2D list (seq_len x d_model) positional encoding values.
      - summed_df: 2D list (seq_len x d_model) for X = sqrt(d_model) * E + PE.
      - token_count: integer number of tokens.
      - token_map_df: 2D list mapping token string -> token id.
      - vocab_file_path: path to the saved vocab JSON file.
      - pdf_report_path: path to a generated PDF capturing inputs and results.
    """
    # Fallback to a default sentence if user input is empty to avoid training on nothing.
    if not sentence or not sentence.strip():
        sentence = "Transformers are powerful sequence models!"

    # 1) Train a small BPE tokenizer on the provided sentence (demo-only training).
    tokenizer = train_bpe_tokenizer([sentence], vocab_size=256)

    # 2) Encode the sentence to get token ids and attention offsets.
    encoding = tokenizer.encode(sentence)
    token_ids = encoding.ids  # List of integers representing tokens.
    tokens = encoding.tokens  # Corresponding token strings.

    # Number of tokens created by BPE for the input sentence.
    token_count = len(token_ids)

    # Create a token→id mapping table (list of [token, id]) for human-readable display.
    token_map_df = [[tok, int(tid)] for tok, tid in zip(tokens, token_ids)]

    # 3) Build a random embedding matrix and look up embeddings for the token ids.
    vocab_size = len(tokenizer.get_vocab())  # Size of the learned vocabulary.
    _, token_embeddings = build_embeddings_and_lookup(vocab_size=vocab_size,
                                                     d_model=int(d_model),
                                                     token_ids=token_ids)

    # Convert embeddings to a nested list for Gradio Dataframe display (initial embeddings E).
    embeddings_df = token_embeddings.astype(float).tolist()

    # 4) Save the vocabulary learned by the tokenizer to a JSON file.
    #    We store it under the current directory with a clear name.
    vocab_dict: Dict[str, int] = tokenizer.get_vocab()
    # Sort by id for readability (tokenizers may return unsorted dict).
    vocab_items = sorted(vocab_dict.items(), key=lambda kv: kv[1])

    # Ensure the output directory exists (use the same folder as this notebook).
    out_dir = os.path.abspath(".")
    vocab_file_path = os.path.join(out_dir, "demo_vocab.json")

    # Write vocab to JSON using the standard library.
    import json  # Standard library for JSON serialization.
    with open(vocab_file_path, "w", encoding="utf-8") as f:
        json.dump({k: int(v) for k, v in vocab_items}, f, ensure_ascii=False, indent=2)

    # 5) Compute positional encodings for the sequence length token_count.
    pe = positional_encoding(seq_len=token_count, d_model=int(d_model))

    # Compute X = sqrt(d_model) * E + PE, as in the Transformer paper (Vaswani et al., 2017).
    emb_scaled = token_embeddings * np.sqrt(float(d_model))
    summed = emb_scaled + pe

    # Prepare tables for display.
    posenc_df = pe.astype(float).tolist()
    summed_df = summed.astype(float).tolist()

    # 6) Create a PDF report with a concise summary and previews.
    out_dir = os.path.abspath(".")
    pdf_file_path = os.path.join(out_dir, "positional_encoding_report.pdf")
    try:
        _ = create_pdf_report(
            sentence=sentence,
            d_model=int(d_model),
            tokens=tokens,
            token_ids=token_ids,
            token_embeddings=token_embeddings,
            positional_enc=pe,
            summed_matrix=summed,
            out_path=pdf_file_path,
        )
    except Exception as e:
        # If PDF generation fails for any reason, we still want the app to work.
        # In that case, we create a small text file explaining the error.
        pdf_file_path = os.path.join(out_dir, "positional_encoding_report_failed.txt")
        with open(pdf_file_path, "w", encoding="utf-8") as f:
            f.write(f"PDF generation failed: {e}\n")
            f.write("Proceeding without the PDF preview.\n")

    # Return all outputs in the order expected by the Gradio interface.
    return embeddings_df, posenc_df, summed_df, int(token_count), token_map_df, vocab_file_path, pdf_file_path


In [15]:
# Build the Gradio Interface. Each component and parameter is explained.
# Define input components:
#  - A Textbox for entering the sentence.
#  - A Slider to choose d_model (embedding dimension).
inp_sentence = gr.Textbox(label="Enter a sentence",
                          placeholder="Type any sentence to tokenize with BPE...")
inp_dmodel = gr.Slider(minimum=16, maximum=512, step=16, value=64,
                       label="Embedding dimension (d_model)")

# Define output components in the requested order:
out_embeddings = gr.Dataframe(label="Initial Token Embeddings E (rows=tokens, cols=d_model)")
out_posenc = gr.Dataframe(label="Positional Encodings PE (rows=positions, cols=d_model)")
out_summed = gr.Dataframe(label="X = sqrt(d_model) * E + PE (rows=tokens, cols=d_model)")
out_token_count = gr.Number(label="Number of BPE tokens")
out_token_map = gr.Dataframe(headers=["token", "id"],
                             label="Token→ID mapping (BPE tokenization)")
out_vocab_file = gr.File(label="Saved vocabulary file (JSON)")
out_pdf = gr.File(label="Download Report (PDF)")

# Create the Interface, binding inputs to the 'process_sentence' function and mapping outputs.
demo = gr.Interface(
    fn=process_sentence,
    inputs=[inp_sentence, inp_dmodel],
    outputs=[
        out_embeddings,   # initial embeddings E
        out_posenc,       # positional encodings PE
        out_summed,       # X = sqrt(d_model) * E + PE
        out_token_count,  # token count
        out_token_map,    # token->id map
        out_vocab_file,   # vocab JSON
        out_pdf           # PDF report
    ],
    title="BPE Tokenization, Embeddings, Positional Encodings, and Summation",
    description=(
        "Enter a sentence to see BPE tokenization, initial token embeddings (E), sinusoidal positional encodings (PE),\n"
        "their sum X = sqrt(d_model) * E + PE as in 'Attention Is All You Need', a saved vocab, and a downloadable PDF report."
    )
)

# Launch the app in the notebook (inline).
# share=False keeps it local; set share=True if you want a public link (not needed here).
demo.launch(share=False)


* Running on local URL:  http://127.0.0.1:7861
* To create a public link, set `share=True` in `launch()`.


