<img src="https://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

# Deep Learning Basics with PyTorch

**Dr. Yves J. Hilpisch with GPT-5**

# Chapter 13 — Sequences and Language Data

## Overview

This Colab‑ready notebook mirrors Chapter 13 and walks through:
- Tokenization + padding/mask
- Embedding lookup with simple sinusoidal positions
- A mean‑embedding baseline and a GRU classifier
- A 3D toy embedding space for intuition

Run cells top‑to‑bottom; keep seeds fixed when comparing variants.

## Tokenization and padding

Turn raw strings into token IDs and pad to a common length. We also build a mask (1 on tokens, 0 on pads) for use in pooling/attention.

In [None]:
# Minimal vocabulary for demonstration (pad has ID 0)  #  build lookup
vocab = {'<pad>':0, 'good':1, 'bad':2, 'movie':3, 'is':4, 'indeed':5, 'great':6}

import torch  #  tensor library
from torch.nn.utils.rnn import pad_sequence  #  pad variable-length sequences

def encode(text: str, vocab: dict) -> torch.Tensor:
    """Turn a whitespace string into token IDs."""
    ids = [vocab.get(tok, vocab['<pad>']) for tok in text.split()]  #  OOV -> <pad> for simplicity
    return torch.tensor(ids, dtype=torch.long)

def encode_batch(texts, vocab):
    """Return (ids, lengths, mask) for a list of texts."""
    seqs = [encode(t, vocab) for t in texts]  #  list of 1D tensors
    lengths = torch.tensor([len(s) for s in seqs], dtype=torch.long)  #  original lengths
    ids = pad_sequence(seqs, batch_first=True, padding_value=vocab['<pad>'])  #  (B, T_max)
    mask = (ids != vocab['<pad>']).long()  #  1 on tokens, 0 on pads
    return ids, lengths, mask

texts = ["this movie is indeed great", "good movie", "bad movie"]
ids, lengths, mask = encode_batch(texts, vocab)
print(ids)      #  (B, T_max)
print(lengths)  #  originals
print(mask[0])  #  mask for sample 0

## Embedding lookup

Combine learned token embeddings with simple sinusoidal positions to create (B, T, D) representations.

In [None]:
import torch
from torch import nn

embed = nn.Embedding(num_embeddings=len(vocab), embedding_dim=8)  #  tiny dim for demo
ids, lengths, _ = encode_batch(["good movie", "bad movie"], vocab)  #  batch of IDs
pos = torch.arange(ids.size(1)).unsqueeze(0)  #  (1, T)

def sinusoidal_positions(pos: torch.Tensor, dim: int) -> torch.Tensor:
    i = torch.arange(dim)
    angle = pos.unsqueeze(-1) / (10000 ** (2 * (i//2) / dim))
    enc = torch.zeros(pos.size(0), pos.size(1), dim)
    enc[..., 0::2] = torch.sin(angle[..., 0::2])
    enc[..., 1::2] = torch.cos(angle[..., 1::2])
    return enc

pos_enc = sinusoidal_positions(pos, dim=8)  #  (1, T, D)
emb = embed(ids) + pos_enc  #  token + position -> (B, T, D)
print(emb.shape)

## Tiny sentiment toy (mean embedding)

A fast baseline: average embeddings over time and feed to a linear head.

In [None]:
import torch, torch.nn.functional as F
from torch import nn

class MeanEmbeddingClassifier(nn.Module):  #  baseline without RNN
    def __init__(self, vocab_size, dim):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, dim)
        self.fc = nn.Linear(dim, 2)  #  binary
    def forward(self, ids):
        x = self.embed(ids).mean(dim=1)  #  mean over time
        return self.fc(x)

model = MeanEmbeddingClassifier(len(vocab), 16)
ids, _, _ = encode_batch(["good movie", "bad movie"], vocab)
probs = model(ids).softmax(-1)
print(probs)

## GRU classifier (packed sequences)

Pack padded sequences to avoid computing on pad tokens and classify with the final hidden state.

In [None]:
import torch
from torch import nn

class GRUClassifier(nn.Module):
    def __init__(self, vocab_size, emb_dim=16, hidden=32, num_classes=2):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, emb_dim)
        self.gru = nn.GRU(emb_dim, hidden, batch_first=True)
        self.fc = nn.Linear(hidden, num_classes)
    def forward(self, ids, lengths):
        x = self.embed(ids)
        packed = nn.utils.rnn.pack_padded_sequence(x, lengths.cpu(), batch_first=True, enforce_sorted=False)
        _, h = self.gru(packed)
        return self.fc(h[-1])

texts = ["good movie", "bad movie", "good good movie"]
ids, lengths, _ = encode_batch(texts, vocab)
probs = GRUClassifier(len(vocab))(ids, lengths).softmax(-1)
print(probs)

## 3D Toy embedding space (visual check)

A quick inline plot to see clusters in 3D; no file writes needed.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D  # noqa

points = {
    'good': (1.6, 1.1, 0.9),
    'great': (1.7, 1.0, 1.1),
    'bad': (-1.6, -1.1, -0.9),
    'awful': (-1.7, -0.9, -1.1),
    'movie': (0.2, 0.4, 0.15),
    'film': (0.25, 0.45, 0.25),
}
colors = {'good':'tab:green','great':'tab:green','bad':'tab:red','awful':'tab:red','movie':'tab:blue','film':'tab:blue'}

fig = plt.figure(figsize=(5.2, 3.6))
ax = fig.add_subplot(111, projection='3d')
for w,(x,y,z) in points.items():
    ax.scatter([x],[y],[z], c=colors[w], s=50)
    dz = 0.10 if w in {'good','awful','movie'} else -0.12
    va = 'bottom' if dz>0 else 'top'
    ax.text(x, y, z+dz, w, ha='center', va=va)
ax.set_box_aspect((1.2,1.0,0.9))
ax.view_init(elev=24, azim=-35)
ax.grid(True, alpha=0.25)
plt.show()

<img src="https://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>