# Machine Translation with Attention RNN (PyTorch)

Notebook ini mereplikasi pipeline **machine translation** sederhana seperti yang kamu bangun di Rust (tch-rs), namun dalam **PyTorch (Python)**:

1. **Data loader**: baca CSV dua kolom `(source, target)`.
2. **Dual tokenizer** (source & target, word-level sederhana) + special tokens `PAD=0, SOS=1, EOS=2, UNK=3`.
3. **Model**: Encoder bi-LSTM + Decoder LSTM + **Bahdanau Attention** + output projection ke vocab **TARGET**.
4. **Training** dengan teacher forcing, **CrossEntropy(ignore_index=PAD)**.
5. **Evaluasi**: mini BLEU & ROUGE-1, dan **greedy decoding** untuk contoh.

> Catatan: Notebook ini **tidak butuh internet**. Pastikan `torch`, `pandas`, `numpy` tersedia di environment kamu.

In [2]:
# %% [markdown]
# ## 0) Setup & Utilities
import math, os, random, time, gc
from collections import Counter
from typing import List, Tuple

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

try:
    import pandas as pd
except Exception as e:
    raise RuntimeError("Pandas diperlukan untuk membaca CSV. Install dengan: pip install pandas")

SEED = 42
random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"✓ Device: {device}")

def set_seed(seed=42):
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
set_seed(SEED)

✓ Device: cpu


In [None]:
# %% [markdown]
# ## 1) Load Data (CSV dua kolom: source, target)
from pathlib import Path

def robust_read_csv(path):
    """Robust CSV reader: coba beberapa opsi agar tidak error saat parsing."""
    try:
        return pd.read_csv(path)
    except Exception:
        pass
    try:
        return pd.read_csv(path, engine='python')
    except Exception:
        pass
    try:
        return pd.read_csv(path, on_bad_lines='skip', engine='python')
    except Exception as e:
        raise e

CSV_CANDIDATES = [
    'Translate500.csv',
    '../Translate500.csv',
]

df = None
picked = None
for p in CSV_CANDIDATES:
    if Path(p).exists():
        try:
            tmp = robust_read_csv(p)
            if tmp.shape[1] >= 2:
                df = tmp
                picked = p
                break
        except Exception as e:
            print(f"Gagal baca {p}: {e}")

if df is None:
    print("CSV tidak ditemukan. Membuat dataset kecil mainan...")
    df = pd.DataFrame({
        'source': [
            'hello how are you',
            'what is your name',
            'good morning',
            'see you later',
            'thank you very much',
            'i love machine learning',
            'this model uses attention',
        ],
        'target': [
            'halo apa kabar',
            'siapa namamu',
            'selamat pagi',
            'sampai jumpa',
            'terima kasih banyak',
            'aku suka pembelajaran mesin',
            'model ini memakai atensi',
        ]
    })
else:
    print(f"✓ Memakai CSV: {picked}")

# Normalisasi kolom dan cleaning ringan
if 'source' not in df.columns or 'target' not in df.columns:
    # Ambil dua kolom pertama jika header tidak pas
    df = df.iloc[:, :2].copy()
    df.columns = ['source', 'target']
df = df[['source', 'target']].dropna()
df['source'] = df['source'].astype(str).str.strip()
df['target'] = df['target'].astype(str).str.strip()
df = df[(df['source']!='') & (df['target']!='')]
df = df.drop_duplicates()
print(df.head())
print("Total pairs:", len(df))

# Train/Val split
ratio = 0.9
idx = list(range(len(df)))
random.shuffle(idx)
cut = int(len(idx)*ratio)
train_df = df.iloc[idx[:cut]].reset_index(drop=True)
val_df   = df.iloc[idx[cut:]].reset_index(drop=True)
print(f"Train: {len(train_df)} | Val: {len(val_df)}")

✓ Memakai CSV: Translate500.csv
                    source                    target
0              shop, store               toko, kedai
1                    price                     harga
2  to be near, to be close                     dekat
3  sometimes, occasionally  kadang-kadang, terkadang
4           to be possible              memungkinkan
Total pairs: 492
Train: 442 | Val: 50


In [4]:
# %% [markdown]
# ## 2) Dual Tokenizer (word-level sederhana)
PAD, SOS, EOS, UNK = 0, 1, 2, 3

class SimpleWordTokenizer:
    def __init__(self, max_vocab=10000):
        self.max_vocab = max_vocab
        self.stoi = {"<pad>":PAD, "<sos>":SOS, "<eos>":EOS, "<unk>":UNK}
        self.itos = {v:k for k,v in self.stoi.items()}

    def fit(self, texts: List[str]):
        counter = Counter()
        for t in texts:
            counter.update(t.split())
        # sisakan slot untuk 4 special tokens
        for tok, _ in counter.most_common(self.max_vocab - len(self.stoi)):
            if tok not in self.stoi:
                idx = len(self.stoi)
                self.stoi[tok] = idx
                self.itos[idx] = tok

    def vocab_size(self):
        return len(self.stoi)

    def encode(self, text: str) -> List[int]:
        return [self.stoi.get(tok, UNK) for tok in text.split()]

    def encode_with_special(self, text: str, max_len: int, add_sos=False, add_eos=False) -> List[int]:
        ids = self.encode(text)
        if add_sos:
            ids = [SOS] + ids
        if add_eos:
            ids = ids + [EOS]
        # pad/trunc
        ids = ids[:max_len]
        ids = ids + [PAD] * (max_len - len(ids))
        return ids

    def decode(self, ids: List[int]) -> str:
        toks = []
        for i in ids:
            if i == EOS:
                break
            if i in (PAD, SOS):
                continue
            toks.append(self.itos.get(int(i), '<unk>'))
        return ' '.join(toks)

class DualTokenizer:
    def __init__(self, src_max_vocab=10000, tgt_max_vocab=10000):
        self.source = SimpleWordTokenizer(src_max_vocab)
        self.target = SimpleWordTokenizer(tgt_max_vocab)

    def fit(self, src_texts: List[str], tgt_texts: List[str]):
        self.source.fit(src_texts)
        self.target.fit(tgt_texts)

dual_tok = DualTokenizer(10000, 10000)
dual_tok.fit(train_df['source'].tolist(), train_df['target'].tolist())
srcV = dual_tok.source.vocab_size()
tgtV = dual_tok.target.vocab_size()
print(f"Source Vocab: {srcV} | Target Vocab: {tgtV}")

Source Vocab: 662 | Target Vocab: 599


In [5]:
# %% [markdown]
# ## 3) Dataset & DataLoader
class TranslationDataset(Dataset):
    def __init__(self, df, dual_tok: DualTokenizer, max_src_len=64, max_tgt_len=64):
        self.df = df.reset_index(drop=True)
        self.tok = dual_tok
        self.max_src_len = max_src_len
        self.max_tgt_len = max_tgt_len

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        src = str(self.df.loc[idx, 'source'])
        tgt = str(self.df.loc[idx, 'target'])

        src_ids = self.tok.source.encode_with_special(src, self.max_src_len, add_sos=False, add_eos=True)
        tgt_in  = self.tok.target.encode_with_special(tgt, self.max_tgt_len, add_sos=True,  add_eos=False)
        tgt_out = self.tok.target.encode_with_special(tgt, self.max_tgt_len, add_sos=False, add_eos=True)

        # mask 1/0: 1 untuk token selain PAD
        src_mask = [1 if t != PAD else 0 for t in src_ids]

        return (
            torch.tensor(src_ids, dtype=torch.long),
            torch.tensor(tgt_in,  dtype=torch.long),
            torch.tensor(tgt_out, dtype=torch.long),
            torch.tensor(src_mask, dtype=torch.bool),
        )

BATCH_SIZE = 32
MAX_SRC_LEN = 64
MAX_TGT_LEN = 64

train_ds = TranslationDataset(train_df, dual_tok, MAX_SRC_LEN, MAX_TGT_LEN)
val_ds   = TranslationDataset(val_df,   dual_tok, MAX_SRC_LEN, MAX_TGT_LEN)

train_dl = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True, drop_last=False)
val_dl   = DataLoader(val_ds,   batch_size=BATCH_SIZE, shuffle=False, drop_last=False)
len(train_dl), len(val_dl)

(14, 2)

In [6]:
# %% [markdown]
# ## 4) Bahdanau Attention & Model (Encoder bi-LSTM, Decoder LSTM + Attention)
class BahdanauAttention(nn.Module):
    def __init__(self, hidden_dim2):
        super().__init__()
        self.wq = nn.Linear(hidden_dim2, hidden_dim2, bias=False)
        self.wk = nn.Linear(hidden_dim2, hidden_dim2, bias=False)
        self.v  = nn.Linear(hidden_dim2, 1, bias=False)

    def forward(self, query, keys, mask=None):
        # query: [B, 2H], keys: [B, T, 2H], mask: [B, T]
        q = self.wq(query).unsqueeze(1)          # [B,1,2H]
        k = self.wk(keys)                        # [B,T,2H]
        e = self.v(torch.tanh(q + k)).squeeze(-1) # [B,T]
        if mask is not None:
            m = mask.float()
            e = e * m + (1.0 - m) * (-1e30)
        attn = torch.softmax(e, dim=-1)          # [B,T]
        context = torch.bmm(attn.unsqueeze(1), keys).squeeze(1)  # [B,2H]
        return context, attn

class AttentionRNN(nn.Module):
    def __init__(self, src_vocab, tgt_vocab, emb_dim=256, hidden_dim=512, num_layers=2, dropout=0.3):
        super().__init__()
        self.src_emb = nn.Embedding(src_vocab, emb_dim, padding_idx=PAD)
        self.tgt_emb = nn.Embedding(tgt_vocab, emb_dim, padding_idx=PAD)

        self.encoder = nn.LSTM(emb_dim, hidden_dim, num_layers=num_layers, batch_first=True, bidirectional=True, dropout=dropout)
        # decoder input: [emb_dim + 2*hidden_dim]
        self.decoder = nn.LSTM(emb_dim + 2*hidden_dim, hidden_dim, num_layers=num_layers, batch_first=True, dropout=dropout)

        self.attn = BahdanauAttention(hidden_dim * 2)
        self.query_proj = nn.Linear(hidden_dim, hidden_dim * 2, bias=False)
        self.out = nn.Linear(hidden_dim, tgt_vocab)

    def encode(self, src_ids):
        # src_ids: [B,T]
        x = self.src_emb(src_ids)                     # [B,T,E]
        enc_out, (h, c) = self.encoder(x)             # enc_out: [B,T,2H], h: [2L,B,H]
        # init decoder state nol (sesuai implementasi Rust)
        B = src_ids.size(0)
        L = self.decoder.num_layers
        H = self.decoder.hidden_size
        device = src_ids.device
        h0 = torch.zeros(L, B, H, device=device)
        c0 = torch.zeros(L, B, H, device=device)
        return enc_out, (h0, c0)

    def decode_step(self, y_prev, state, enc_out, mask):
        # y_prev: [B,1], state: (h,c) [L,B,H], enc_out: [B,T,2H], mask: [B,T]
        emb = self.tgt_emb(y_prev)                    # [B,1,E]
        h, c = state                                  # h: [L,B,H]
        query = h[-1]                                 # [B,H]
        q2 = self.query_proj(query)                   # [B,2H]
        context, attn = self.attn(q2, enc_out, mask)  # [B,2H], [B,T]
        dec_in = torch.cat([emb, context.unsqueeze(1)], dim=-1)  # [B,1,E+2H]
        out, (h2, c2) = self.decoder(dec_in, (h, c))            # out: [B,1,H]
        logits = self.out(out.squeeze(1))                        # [B,V]
        return logits, (h2, c2), attn

    def forward(self, src_ids, tgt_in, mask):
        # Teacher forcing full pass → logits [B,T,V]
        enc_out, state = self.encode(src_ids)
        T = tgt_in.size(1)
        logits_list = []
        for t in range(T):
            y_prev = tgt_in[:, t:t+1]            # [B,1]
            logits, state, _ = self.decode_step(y_prev, state, enc_out, mask)
            logits_list.append(logits.unsqueeze(1))
        return torch.cat(logits_list, dim=1)      # [B,T,V]

    @torch.no_grad()
    def generate(self, src_text, tok: DualTokenizer, max_len=64, device='cpu'):
        self.eval()
        src_ids = tok.source.encode_with_special(src_text, max_len, add_sos=False, add_eos=True)
        mask = [1 if t != PAD else 0 for t in src_ids]
        src = torch.tensor([src_ids], dtype=torch.long, device=device)
        msk = torch.tensor([mask],    dtype=torch.bool, device=device)
        enc_out, state = self.encode(src)
        cur = torch.tensor([[SOS]], dtype=torch.long, device=device)
        outs = []
        for _ in range(max_len):
            logits, state, _ = self.decode_step(cur, state, enc_out, msk)
            cur = torch.argmax(logits, dim=-1, keepdim=True)  # greedy
            token = cur.item()
            if token == EOS:
                break
            outs.append(token)
        return tok.target.decode(outs)

emb_dim   = 256
hidden_dim= 512
num_layers= 2
dropout   = 0.3

model = AttentionRNN(srcV, tgtV, emb_dim, hidden_dim, num_layers, dropout).to(device)
sum(p.numel() for p in model.parameters() if p.requires_grad)

18481495

In [7]:
# %% [markdown]
# ## 5) Training Utilities (Loss, Loop, Eval)
def train_one_epoch(model, dl, optimizer, pad_id=PAD, max_norm=1.0):
    model.train()
    total = 0.0
    n = 0
    criterion = nn.CrossEntropyLoss(ignore_index=pad_id)
    for batch in dl:
        src, tgt_in, tgt_out, mask = [x.to(device) for x in batch]
        optimizer.zero_grad(set_to_none=True)
        logits = model(src, tgt_in, mask)            # [B,T,V]
        B,T,V = logits.shape
        loss = criterion(logits.view(B*T, V), tgt_out.view(B*T))
        loss.backward()
        if max_norm is not None:
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
        optimizer.step()
        total += loss.item()
        n += 1
    return total / max(n,1)

@torch.no_grad()
def evaluate(model, dl, pad_id=PAD):
    model.eval()
    total = 0.0
    n = 0
    criterion = nn.CrossEntropyLoss(ignore_index=pad_id)
    for batch in dl:
        src, tgt_in, tgt_out, mask = [x.to(device) for x in batch]
        logits = model(src, tgt_in, mask)            # [B,T,V]
        B,T,V = logits.shape
        loss = criterion(logits.view(B*T, V), tgt_out.view(B*T))
        total += loss.item()
        n += 1
    return total / max(n,1)

def bleu_score_simple(ref: str, hyp: str, max_n=4):
    # BLEU sederhana (tanpa brevity penalty rumit – cukup indikatif)
    def ngrams(tokens, n):
        return [tuple(tokens[i:i+n]) for i in range(len(tokens)-n+1)]
    ref_toks = ref.split()
    hyp_toks = hyp.split()
    precisions = []
    for n in range(1, max_n+1):
        ref_ngrams = Counter(ngrams(ref_toks, n))
        hyp_ngrams = Counter(ngrams(hyp_toks, n))
        overlap = sum(min(count, ref_ngrams[ng]) for ng, count in hyp_ngrams.items())
        total = max(sum(hyp_ngrams.values()), 1)
        precisions.append(overlap / total)
    # geometric mean
    score = 1.0
    for p in precisions:
        score *= max(p, 1e-9)
    score = score ** (1/len(precisions))
    # brevity penalty (sederhana)
    bp = math.exp(min(0, 1 - len(ref_toks) / max(len(hyp_toks),1)))
    return bp * score

def rouge1(ref: str, hyp: str):
    ref_t = ref.split(); hyp_t = hyp.split()
    ref_c = Counter(ref_t); hyp_c = Counter(hyp_t)
    overlap = sum(min(ref_c[w], hyp_c[w]) for w in ref_c)
    P = overlap / max(len(hyp_t), 1)
    R = overlap / max(len(ref_t), 1)
    F1 = 2*P*R / max(P+R, 1e-9)
    return P, R, F1

In [8]:
# %% [markdown]
# ## 6) Train!
EPOCHS = 1  # naikkan sesuai kebutuhan
LR = 1e-3
optimizer = torch.optim.Adam(model.parameters(), lr=LR)

for epoch in range(1, EPOCHS+1):
    t0 = time.time()
    tr_loss = train_one_epoch(model, train_dl, optimizer)
    va_loss = evaluate(model, val_dl)
    dt = time.time()-t0
    print(f"Epoch {epoch:02d} | train {tr_loss:.4f} | val {va_loss:.4f} | {dt:.1f}s")

Epoch 01 | train 5.1178 | val 5.0211 | 51.0s


In [9]:
# %% [markdown]
# ## 7) Quick test: Greedy decoding + BLEU & ROUGE-1
model.eval()
SAMPLES = min(len(val_df), 5)
for i in random.sample(range(len(val_df)), SAMPLES):
    src = val_df.loc[i, 'source']
    ref = val_df.loc[i, 'target']
    hyp = model.generate(src, dual_tok, max_len=MAX_TGT_LEN, device=device)
    print("\n---")
    print("Source   :", src)
    print("Reference:", ref)
    print("Generated:", hyp)
    bleu = bleu_score_simple(ref, hyp)
    p,r,f1 = rouge1(ref, hyp)
    print(f"BLEU≈{bleu:.4f} | ROUGE-1 P={p:.3f} R={r:.3f} F1={f1:.3f}")


---
Source   : word
Reference: kata
Generated: 
BLEU≈0.0000 | ROUGE-1 P=0.000 R=0.000 F1=0.000

---
Source   : sometimes, occasionally
Reference: kadang-kadang, terkadang
Generated: 
BLEU≈0.0000 | ROUGE-1 P=0.000 R=0.000 F1=0.000

---
Source   : use | to use
Reference: kgunaan; mengunakan
Generated: 
BLEU≈0.0000 | ROUGE-1 P=0.000 R=0.000 F1=0.000

---
Source   : but, still
Reference: tetapi, masih
Generated: 
BLEU≈0.0000 | ROUGE-1 P=0.000 R=0.000 F1=0.000

---
Source   : fish
Reference: ikan
Generated: 
BLEU≈0.0000 | ROUGE-1 P=0.000 R=0.000 F1=0.000


In [10]:
# %% [markdown]
# ## 8) Save model & tokenizer
os.makedirs('artifacts', exist_ok=True)
torch.save(model.state_dict(), 'artifacts/mt_attention_rnn.pt')

import json
tok_art = {
    'src_stoi': dual_tok.source.stoi,
    'tgt_stoi': dual_tok.target.stoi,
    'special': {'PAD': PAD, 'SOS': SOS, 'EOS': EOS, 'UNK': UNK}
}
with open('artifacts/tokenizer.json', 'w', encoding='utf-8') as f:
    json.dump(tok_art, f, ensure_ascii=False, indent=2)

print("Saved to artifacts/")

Saved to artifacts/
