# Seq2Seq LSTM with Attention and Beam Search for English-Vietnamese Translation

This notebook implements a seq2seq model with LSTM and Bahdanau attention for English-Vietnamese translation, using BPE tokenization (via `bert-base-multilingual-cased`). Inference uses Beam Search (beam_width=5) for better translation quality. The model is lightweight for RTX 3090 (24GB VRAM). Both English and Vietnamese sentences are filtered to < 50 tokens.

## Steps:
1. Load and preprocess data with BPE tokenization, filter both languages.
2. Define model architecture (LSTM + Bahdanau Attention).
3. Set up training (configs, optimizer, loss, BLEU metric).
4. Train the model.
5. Inference with Beam Search and evaluation.

In [5]:
pip install torch transformers datasets evaluate sacrebleu numpy

Note: you may need to restart the kernel to use updated packages.


In [6]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer
import numpy as np
import evaluate
from tqdm import tqdm

# Check GPU availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

Using device: cuda


## 1. Data Preprocessing
Load raw dataset, apply BPE tokenization, filter both English and Vietnamese sentences (< 50 tokens), and create DataLoader.

In [7]:
# Paths to raw dataset
train_en = 'detokenization/train/train.en'
train_vi = 'detokenization/train/train.vi'
dev_en = 'detokenization/dev/dev.en'
dev_vi = 'detokenization/dev/dev.vi'
test_en = 'detokenization/test/test.en'
test_vi = 'detokenization/test/test.vi'

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-multilingual-cased')

# Read dataset
def read_data(en_path, vi_path):
    with open(en_path, 'r', encoding='utf-8') as f:
        en_data = f.readlines()
    with open(vi_path, 'r', encoding='utf-8') as f:
        vi_data = f.readlines()
    return [(en.strip(), vi.strip()) for en, vi in zip(en_data, vi_data)]

train_data = read_data(train_en, train_vi)
dev_data = read_data(dev_en, dev_vi)
test_data = read_data(test_en, test_vi)

# Filter by sentence length (< 50 tokens for both languages)
max_len = 50
train_data = [(en, vi) for en, vi in train_data if len(tokenizer.tokenize(en)) < max_len and len(tokenizer.tokenize(vi)) < max_len][:200000]  # 200,000 pairs
dev_data = [(en, vi) for en, vi in dev_data if len(tokenizer.tokenize(en)) < max_len and len(tokenizer.tokenize(vi)) < max_len][:10000]  # 10,000 for validation
test_data = [(en, vi) for en, vi in test_data if len(tokenizer.tokenize(en)) < max_len and len(tokenizer.tokenize(vi)) < max_len][:5000]  # 5,000 for testing

# Dataset class
class TranslationDataset(Dataset):
    def __init__(self, data, tokenizer, max_len=50):
        self.data = data
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        en, vi = self.data[idx]
        en_tokens = self.tokenizer(en, max_length=self.max_len, padding='max_length', truncation=True, return_tensors='pt')
        vi_tokens = self.tokenizer(vi, max_length=self.max_len, padding='max_length', truncation=True, return_tensors='pt')
        return en_tokens['input_ids'].squeeze(0), vi_tokens['input_ids'].squeeze(0)

# DataLoader
def collate_fn(batch):
    en_batch, vi_batch = zip(*batch)
    en_batch = torch.stack(en_batch)
    vi_batch = torch.stack(vi_batch)
    return en_batch, vi_batch

train_dataset = TranslationDataset(train_data, tokenizer, max_len)
dev_dataset = TranslationDataset(dev_data, tokenizer, max_len)
test_dataset = TranslationDataset(test_data, tokenizer, max_len)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, collate_fn=collate_fn)
dev_loader = DataLoader(dev_dataset, batch_size=32, shuffle=False, collate_fn=collate_fn)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False, collate_fn=collate_fn)

Token indices sequence length is longer than the specified maximum sequence length for this model (571 > 512). Running this sequence through the model will result in indexing errors


## 2. Model Architecture
Seq2Seq model with LSTM encoder and decoder, using Bahdanau attention. Lightweight for RTX 3090.

In [8]:
class BahdanauAttention(nn.Module):
    def __init__(self, hidden_size):
        super(BahdanauAttention, self).__init__()
        self.Wa = nn.Linear(hidden_size * 2, hidden_size)
        self.Ua = nn.Linear(hidden_size * 2, hidden_size)
        self.Va = nn.Linear(hidden_size, 1)

    def forward(self, hidden, encoder_outputs):
        hidden = hidden.unsqueeze(1).repeat(1, encoder_outputs.size(1), 1)
        energy = torch.tanh(self.Wa(hidden) + self.Ua(encoder_outputs))
        scores = self.Va(energy).squeeze(-1)
        attn_weights = torch.softmax(scores, dim=1)
        context = torch.bmm(attn_weights.unsqueeze(1), encoder_outputs).squeeze(1)
        return context, attn_weights

class Encoder(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers=1):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True, bidirectional=True)

    def forward(self, src):
        embedded = self.embedding(src)
        outputs, (hidden, cell) = self.lstm(embedded)
        hidden = torch.cat((hidden[-2], hidden[-1]), dim=1)
        cell = torch.cat((cell[-2], cell[-1]), dim=1)
        return outputs, hidden, cell

class Decoder(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers=1):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size + hidden_size * 2, hidden_size * 2, num_layers, batch_first=True)
        self.attention = BahdanauAttention(hidden_size)
        self.fc = nn.Linear(hidden_size * 2, vocab_size)

    def forward(self, tgt, hidden, cell, encoder_outputs):
        embedded = self.embedding(tgt)
        context, attn_weights = self.attention(hidden, encoder_outputs)
        lstm_input = torch.cat((embedded, context.unsqueeze(1)), dim=2)
        output, (hidden, cell) = self.lstm(lstm_input, (hidden.unsqueeze(0), cell.unsqueeze(0)))
        output = self.fc(output.squeeze(1))
        return output, hidden.squeeze(0), cell.squeeze(0), attn_weights

class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, src, tgt, teacher_forcing_ratio=0.5):
        batch_size = src.size(0)
        tgt_len = tgt.size(1)
        outputs = torch.zeros(batch_size, tgt_len, len(tokenizer)).to(device)

        encoder_outputs, hidden, cell = self.encoder(src)
        input = tgt[:, 0].unsqueeze(1)

        for t in range(1, tgt_len):
            output, hidden, cell, _ = self.decoder(input, hidden, cell, encoder_outputs)
            outputs[:, t, :] = output
            teacher_force = torch.rand(1).item() < teacher_forcing_ratio
            top1 = output.argmax(1).unsqueeze(1)
            input = tgt[:, t].unsqueeze(1) if teacher_force else top1

        return outputs

## 3. Training Setup
Define configs, optimizer, loss function, and BLEU metric.

In [9]:
# Model configs
vocab_size = len(tokenizer)
embed_size = 128  # Lightweight for RTX 3090
hidden_size = 256  # Lightweight for RTX 3090
num_layers = 1

encoder = Encoder(vocab_size, embed_size, hidden_size, num_layers)
decoder = Decoder(vocab_size, embed_size, hidden_size, num_layers)
model = Seq2Seq(encoder, decoder).to(device)

# Optimizer and loss
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss(ignore_index=tokenizer.pad_token_id)

# BLEU metric
bleu = evaluate.load('sacrebleu')

# Beam Search inference function
def translate_sentence(model, src, tokenizer, beam_width=5, max_len=50):
    model.eval()
    src = src.to(device)
    with torch.no_grad():
        encoder_outputs, hidden, cell = model.encoder(src)

        # Initialize beam
        beams = [(torch.tensor([[tokenizer.cls_token_id]], dtype=torch.long).to(device), 0.0, hidden, cell)]
        completed = []

        for _ in range(max_len):
            new_beams = []
            for input, score, h, c in beams:
                if input[0, -1].item() == tokenizer.sep_token_id:
                    completed.append((input, score))
                    continue

                output, new_hidden, new_cell, _ = model.decoder(input[:, -1:], h, c, encoder_outputs)
                probs = torch.log_softmax(output, dim=-1).squeeze(1)  # (batch_size=1, vocab_size)
                top_probs, top_idx = probs.topk(beam_width, dim=-1)

                for i in range(beam_width):
                    new_input = torch.cat([input, top_idx[:, i:i+1]], dim=1)
                    new_score = score + top_probs[:, i].item()
                    new_beams.append((new_input, new_score, new_hidden, new_cell))

            beams = sorted(new_beams, key=lambda x: x[1], reverse=True)[:beam_width]
            if len(completed) >= beam_width:
                break

        # Select best completed sequence or top beam if none completed
        if completed:
            best_sequence = max(completed, key=lambda x: x[1])[0]
        else:
            best_sequence = beams[0][0]

        return tokenizer.decode(best_sequence[0, 1:], skip_special_tokens=True)

# Training loop
def train_epoch(model, dataloader, optimizer, criterion):
    model.train()
    total_loss = 0
    for src, tgt in tqdm(dataloader, desc='Training'):
        src, tgt = src.to(device), tgt.to(device)
        optimizer.zero_grad()
        output = model(src, tgt)
        output = output[:, 1:].reshape(-1, output.size(-1))
        tgt = tgt[:, 1:].reshape(-1)
        loss = criterion(output, tgt)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    return total_loss / len(dataloader)

# Validation with BLEU (using Beam Search)
def evaluate_bleu(model, dataloader, tokenizer, beam_width=5, max_len=50):
    model.eval()
    predictions, references = [], []
    with torch.no_grad():
        for src, tgt in tqdm(dataloader, desc='Evaluating'):
            src, tgt = src.to(device), tgt.to(device)
            for i in range(src.size(0)):
                pred_text = translate_sentence(model, src[i:i+1], tokenizer, beam_width, max_len)
                ref_text = tokenizer.decode(tgt[i, 1:], skip_special_tokens=True)
                predictions.append(pred_text)
                references.append([ref_text])
    return bleu.compute(predictions=predictions, references=references)['score']

## 4. Training Loop
Train for 10 epochs, validate after each epoch, save best model.

In [None]:
num_epochs = 10
best_bleu = 0

for epoch in range(num_epochs):
    train_loss = train_epoch(model, train_loader, optimizer, criterion)
    bleu_score = evaluate_bleu(model, dev_loader, tokenizer)
    print(f'Epoch {epoch+1}: Loss = {train_loss:.4f}, BLEU = {bleu_score:.2f}')
    if bleu_score > best_bleu:
        best_bleu = bleu_score
        torch.save(model.state_dict(), 'best_model.pt')
        print('Saved best model')

Training: 100%|██████████| 6250/6250 [47:37<00:00,  2.19it/s]
Evaluating: 100%|██████████| 313/313 [15:16<00:00,  2.93s/it]


Epoch 1: Loss = 4.8506, BLEU = 7.40
Saved best model


Training: 100%|██████████| 6250/6250 [48:47<00:00,  2.13it/s]
Evaluating: 100%|██████████| 313/313 [15:23<00:00,  2.95s/it]


Epoch 2: Loss = 3.9536, BLEU = 10.01
Saved best model


Training: 100%|██████████| 6250/6250 [48:51<00:00,  2.13it/s]
Evaluating: 100%|██████████| 313/313 [15:40<00:00,  3.00s/it]


Epoch 3: Loss = 3.6231, BLEU = 11.32
Saved best model


Training: 100%|██████████| 6250/6250 [48:51<00:00,  2.13it/s]
Evaluating: 100%|██████████| 313/313 [15:44<00:00,  3.02s/it]


Epoch 4: Loss = 3.4305, BLEU = 12.21
Saved best model


Training: 100%|██████████| 6250/6250 [48:48<00:00,  2.13it/s]
Evaluating: 100%|██████████| 313/313 [15:49<00:00,  3.03s/it]


Epoch 5: Loss = 3.2906, BLEU = 12.60
Saved best model


Training: 100%|██████████| 6250/6250 [48:48<00:00,  2.13it/s]
Evaluating: 100%|██████████| 313/313 [16:02<00:00,  3.07s/it]


Epoch 6: Loss = 3.1844, BLEU = 13.23
Saved best model


Training: 100%|██████████| 6250/6250 [48:46<00:00,  2.14it/s]
Evaluating: 100%|██████████| 313/313 [16:06<00:00,  3.09s/it]


Epoch 7: Loss = 3.0990, BLEU = 13.46
Saved best model


Training: 100%|██████████| 6250/6250 [48:51<00:00,  2.13it/s]
Evaluating: 100%|██████████| 313/313 [15:39<00:00,  3.00s/it]


Epoch 8: Loss = 3.0290, BLEU = 13.25


Training: 100%|██████████| 6250/6250 [47:39<00:00,  2.19it/s]
Evaluating: 100%|██████████| 313/313 [14:27<00:00,  2.77s/it]


Epoch 9: Loss = 2.9664, BLEU = 13.67
Saved best model


Training: 100%|██████████| 6250/6250 [46:49<00:00,  2.22it/s]
Evaluating: 100%|██████████| 313/313 [13:27<00:00,  2.58s/it]


Epoch 10: Loss = 2.9147, BLEU = 14.03
Saved best model


## 5. Inference
Translate using Beam Search and evaluate on test set.

In [12]:
# Load best model
model.load_state_dict(torch.load('best_model.pt'))

# Test sample translation
sample_sentence = 'Spokesperson for the European Union External Action Services.'
tokens = tokenizer(sample_sentence, max_length=max_len, padding='max_length', truncation=True, return_tensors='pt')
translated = translate_sentence(model, tokens['input_ids'], tokenizer)
print(f'Input: {sample_sentence}')
print(f'Translation: {translated}')

# Evaluate on test set
test_bleu = evaluate_bleu(model, test_loader, tokenizer)
print(f'Test BLEU: {test_bleu:.2f}')

Input: Spokesperson for the European Union External Action Services.
Translation: Chân mật của Von Kính Lực Minh Kiểm Minh Ki.


Evaluating: 100%|██████████| 157/157 [07:37<00:00,  2.91s/it]


Test BLEU: 16.49
