# Workshop: Build GPT from Scratch Using Thai Text
Train a Minimal GPT Model on Thai Ramayana: rammajana.txt

**Objectives:**
- Learn word-level preprocessing and tokenization for Thai
- Build a minimal GPT (Transformer) model in PyTorch
- Train on Thai text at the word level
- Generate new Thai text word-by-word

You need the file `rammajana.txt` in the same directory.

## Step 1: Data Preparation (Word Tokenization)
We'll use PyThaiNLP for Thai word tokenization.
- Build a vocabulary of words, map each word to an index.
- Encode all text as a sequence of word indices.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import Counter
try:
    from pythainlp.tokenize import word_tokenize
except ImportError:
    print('Please install PyThaiNLP: pip install pythainlp')

# Load Thai text
with open('rammajana.txt', encoding='utf-8') as f:
    text = f.read()

# Word-tokenize the text
words = word_tokenize(text)
print(f'Number of words: {len(words):,}')
print('First 20 words:', words[:20])

Number of words: 49,689
First 20 words: ['รามเกียรติ์', '\n', 'ความเป็นมา', 'ของ', 'เรื่อง', 'รามเกียรติ์', '\n', 'รามเกียรติ์', ' ', 'มี', 'ที่', 'มาจาก', 'เรื่อง', ' ', 'ราม', 'ยณะ', ' ', 'ที่', 'ฤาษี', 'วาล']


In [2]:
# Build word vocabulary
word_counts = Counter(words)
vocab = sorted(word_counts, key=lambda w: -word_counts[w])
vocab_size = len(vocab)
print(f'Vocabulary size: {vocab_size}')
word2idx = {w: i for i, w in enumerate(vocab)}
idx2word = {i: w for i, w in enumerate(vocab)}

# Encode the text as word indices
data = np.array([word2idx[w] for w in words], dtype=np.int32)
print('Encoded:', data[:20])
print('Decoded:', [idx2word[i] for i in data[:20]])

Vocabulary size: 4913
Encoded: [ 123    6 2765   19   43  123    6  123    0   16    9  199   43    0
  637 1931    0    9   61 1932]
Decoded: ['รามเกียรติ์', '\n', 'ความเป็นมา', 'ของ', 'เรื่อง', 'รามเกียรติ์', '\n', 'รามเกียรติ์', ' ', 'มี', 'ที่', 'มาจาก', 'เรื่อง', ' ', 'ราม', 'ยณะ', ' ', 'ที่', 'ฤาษี', 'วาล']


## Step 2: Build a Mini Word-level GPT Model
We use a simple Transformer encoder as our GPT backbone.
- Embedding for words and positions
- Transformer layers
- Linear layer to output vocab probability
- No masking for simplicity

In [3]:
class WordMiniGPT(nn.Module):
    def __init__(self, vocab_size, n_embd=128, n_head=4, n_layer=2, block_size=16):
        super().__init__()
        self.block_size = block_size
        self.token_emb = nn.Embedding(vocab_size, n_embd)
        self.pos_emb = nn.Embedding(block_size, n_embd)
        self.drop = nn.Dropout(0.1)
        encoder_layer = nn.TransformerEncoderLayer(d_model=n_embd, nhead=n_head)
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=n_layer)
        self.fc_out = nn.Linear(n_embd, vocab_size)
    def forward(self, x):
        B, T = x.size()
        tok_emb = self.token_emb(x)
        pos = torch.arange(T, device=x.device).unsqueeze(0)
        pos_emb = self.pos_emb(pos)
        h = self.drop(tok_emb + pos_emb)
        h = h.permute(1, 0, 2)  # (seq_len, batch, emb)
        h = self.transformer(h)
        h = h.permute(1, 0, 2)
        logits = self.fc_out(h)
        return logits

## Step 3: Prepare Training Batches (Word sequences)
- Each batch is a set of random windows (sequences) of `block_size` words.
- Targets are next-word sequences.

In [4]:
block_size = 16  # sequence length
def get_batch(data, batch_size=32):
    ix = np.random.randint(0, len(data) - block_size - 1, (batch_size,))
    x = np.stack([data[i:i+block_size] for i in ix])
    y = np.stack([data[i+1:i+block_size+1] for i in ix])
    return torch.tensor(x, dtype=torch.long), torch.tensor(y, dtype=torch.long)

# Example batch
x, y = get_batch(data)
print('Batch:', x.shape, y.shape)
print('First sequence:', [idx2word[i.item()] for i in x[0]])

Batch: torch.Size([32, 16]) torch.Size([32, 16])
First sequence: ['วัน', ' ', 'จึง', 'จะ', 'เสร็จ', 'พิธี', ' ', 'ระหว่าง', 'ที่', 'อิน', 'ท', 'ชิต', 'ไป', 'ทำพิธี', 'ชุบ', 'ศร']


## Step 4: Training the Word-level GPT
- We'll keep the epochs small for demonstration.

In [8]:
n_embd, n_head, n_layer = 128, 4, 2
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = WordMiniGPT(vocab_size, n_embd, n_head, n_layer, block_size).to(device)
optimizer = optim.AdamW(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()

# Training loop (short for demo)
for step in range(50000):
    x, y = get_batch(data, batch_size=32)
    x, y = x.to(device), y.to(device)
    optimizer.zero_grad()
    logits = model(x)
    loss = loss_fn(logits.view(-1, vocab_size), y.view(-1))
    loss.backward()
    optimizer.step()
    if step % 100 == 0:
        print(f"Step {step}: loss {loss.item():.4f}")



Step 0: loss 8.6596
Step 100: loss 5.5689
Step 200: loss 3.0961
Step 300: loss 2.0265
Step 400: loss 1.5017
Step 500: loss 1.0633
Step 600: loss 0.9228
Step 700: loss 0.7065
Step 800: loss 0.7099
Step 900: loss 0.6685
Step 1000: loss 0.5328
Step 1100: loss 0.4799
Step 1200: loss 0.4502
Step 1300: loss 0.4091
Step 1400: loss 0.4800
Step 1500: loss 0.4294
Step 1600: loss 0.4189
Step 1700: loss 0.3899
Step 1800: loss 0.3573
Step 1900: loss 0.4289
Step 2000: loss 0.4078
Step 2100: loss 0.3676
Step 2200: loss 0.3375
Step 2300: loss 0.3648
Step 2400: loss 0.3517
Step 2500: loss 0.3449
Step 2600: loss 0.3429
Step 2700: loss 0.3305
Step 2800: loss 0.3887
Step 2900: loss 0.3441
Step 3000: loss 0.3526
Step 3100: loss 0.3171
Step 3200: loss 0.3667
Step 3300: loss 0.2930
Step 3400: loss 0.3836
Step 3500: loss 0.3370
Step 3600: loss 0.2848
Step 3700: loss 0.3489
Step 3800: loss 0.3382
Step 3900: loss 0.2666
Step 4000: loss 0.2384
Step 4100: loss 0.3590
Step 4200: loss 0.3245
Step 4300: loss 0.2962


## Step 5: Generate Thai Text (Word-by-word)
- Start from a prompt, generate next words one at a time.

In [9]:
def generate_words(model, prompt, length=40, temperature=1.0):
    model.eval()
    words_list = word_tokenize(prompt)
    idxs = [word2idx[w] if w in word2idx else 0 for w in words_list]
    idxs = idxs[-block_size:]
    context = torch.tensor([idxs], dtype=torch.long).to(device)
    generated = list(words_list)
    for _ in range(length):
        if context.size(1) < block_size:
            pad = torch.zeros((1, block_size-context.size(1)), dtype=torch.long).to(device)
            inp = torch.cat([pad, context], dim=1)
        else:
            inp = context[:, -block_size:]
        logits = model(inp)
        next_logits = logits[0, -1, :] / temperature
        probs = torch.softmax(next_logits, dim=-1).cpu().detach().numpy()
        next_idx = np.random.choice(vocab_size, p=probs)
        next_word = idx2word[next_idx]
        generated.append(next_word)
        context = torch.cat([context, torch.tensor([[next_idx]], device=device)], dim=1)
    return ' '.join(generated)

In [11]:
# Example: Generate Thai text from a word prompt
prompt = 'พระราม'
print(generate_words(model, prompt, length=50))

พระราม   สุครีพ ถือ   นารายณ์   ท่าน   แต่ ต้อง   ตัด อัศ กรรณ ได้ ฟัง จึง ขึ้น อีกครั้ง จะ ติดมือ มา จะ ไม่ สนุก   ดัง มา มัน หมด   ให้ มัน   ระหว่าง ได้ ฟัง จึง ขอให้ พบ กับ พระราม กลับ ยัง ไม่ ยอม เปิด ผอบ   จึง
