<a href="https://colab.research.google.com/github/qu1r0ra/philippine-machine-translation/blob/chore%2Fpolish-files/notebooks/02c_modeling_cbk.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Modeling with PyTorch and not OpenNMT-py

## Notes

The author who trained the GRU model (initially Transformer) did not have sufficient computational resources to train locally, so he decided to train it via Google Colab. Unfortunately, the free plan in Colab also did not suffice for training as even the free T4 GPU has rate limits for free users, so the author also decided to purchase compute units to access Colab GPUs. You may also need Colab compute units to replicate the results of this notebook.

If you decide to do so, you can press the '**Open in Colab**' button found at the topmost cell of this notebook to be led to Google Colab.

Once you have opened this notebook in Colab, connect to a T4 GPU runtime. You don't need High RAM for this. You may choose other GPUs, but the authors found out through experimentation that the T4 is the most cost-efficient yet sufficient for this project.

### Acceptance Stage

So here we are with yet another pivot because of all the problems we experiencing with `OpenNMT-py`. I am never ever using it again.

Lesson learned: `PyTorch` is inevitable. Or perhaps I should've used `spaCy` instead. \- CJ

Anyways, let's load the data we will be training our models on.

In [None]:
%mkdir data

Upload your processed parallel corpora to separate folders in `data/`, each consisting of the ff.:
- train.src (training set for source)
- train.tgt (training set for target)
- valid.src (validation set for source)
- valid.tgt (validation set for target)

These can be found in `data/processed/<model-name>`.

Or if you have the data saved somewhere in your Google Drive with the same structure as above, load it from there.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Change the variable below to the path containing your data.

In [None]:
# Path to data folder, revise accordingly
DATA_PATH = "/content/drive/MyDrive/data"

## PyTorch implementation of our GRU model

Let's define the model's architecture through PyTorch.

In [None]:
import os
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F

# ==== CONFIG ====
EMB_DIM = 300
HID_DIM = 512
N_LAYERS = 2
DROPOUT = 0.2
EPOCHS = 20
BATCH_SIZE = 64
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
PATIENCE = 3

# ==== TOKENIZATION & ENCODING ====
def tokenize(sentence):
    return sentence.lower().split()

def encode(sentence, vocab):
    return [vocab["<sos>"]] + [vocab.get(w, len(vocab)) for w in tokenize(sentence)] + [vocab["<eos>"]]

# ==== DATASET & DATALOADER ====
class MTDataset(Dataset):
    def __init__(self, pairs, src_vocab, tgt_vocab):
        self.data = [(encode(s, src_vocab), encode(t, tgt_vocab)) for s, t in pairs]
    def __len__(self): return len(self.data)
    def __getitem__(self, idx): return self.data[idx]

def collate_fn(batch):
    srcs, tgts = zip(*batch)
    src_pad = nn.utils.rnn.pad_sequence([torch.tensor(s) for s in srcs], batch_first=True)
    tgt_pad = nn.utils.rnn.pad_sequence([torch.tensor(t) for t in tgts], batch_first=True)
    return src_pad, tgt_pad

def create_dataloaders(train_pairs, valid_pairs, src_vocab, tgt_vocab):
    train_loader = DataLoader(MTDataset(train_pairs, src_vocab, tgt_vocab), batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn)
    valid_loader = DataLoader(MTDataset(valid_pairs, src_vocab, tgt_vocab), batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn)
    return train_loader, valid_loader

# ==== MODEL ====
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, n_layers=2, dropout=0.2):
        super().__init__()
        self.embedding = nn.Embedding(input_dim, emb_dim, padding_idx=0)
        self.rnn = nn.GRU(emb_dim, hid_dim, num_layers=n_layers, batch_first=True, dropout=dropout)
        self.dropout = nn.Dropout(dropout)
    def forward(self, src):
        embedded = self.dropout(self.embedding(src))
        outputs, hidden = self.rnn(embedded)
        return outputs, hidden

class Attention(nn.Module):
    def __init__(self, hid_dim):
        super().__init__()
        self.attn = nn.Linear(hid_dim * 2, hid_dim)
        self.v = nn.Linear(hid_dim, 1, bias=False)
    def forward(self, hidden, encoder_outputs):
        hidden = hidden[-1].unsqueeze(1).repeat(1, encoder_outputs.size(1), 1)
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim=2)))
        attention = self.v(energy).squeeze(2)
        return F.softmax(attention, dim=1)

class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, n_layers=2, dropout=0.2):
        super().__init__()
        self.embedding = nn.Embedding(output_dim, emb_dim, padding_idx=0)
        self.rnn = nn.GRU(hid_dim + emb_dim, hid_dim, num_layers=n_layers, batch_first=True, dropout=dropout)
        self.fc_out = nn.Linear(hid_dim * 2 + emb_dim, output_dim)
        self.attention = Attention(hid_dim)
        self.dropout = nn.Dropout(dropout)
    def forward(self, input, hidden, encoder_outputs):
        input = input.unsqueeze(1)
        embedded = self.dropout(self.embedding(input))
        attn = self.attention(hidden, encoder_outputs).unsqueeze(1)
        context = attn.bmm(encoder_outputs)
        rnn_input = torch.cat((embedded, context), dim=2)
        output, hidden = self.rnn(rnn_input, hidden)
        prediction = self.fc_out(torch.cat((output, context, embedded), dim=2).squeeze(1))
        return prediction, hidden

class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
    def forward(self, src, tgt):
        encoder_outputs, hidden = self.encoder(src)
        input_tok = tgt[:, 0]
        outputs = []
        for t in range(1, tgt.size(1)):
            output, hidden = self.decoder(input_tok, hidden, encoder_outputs)
            outputs.append(output.unsqueeze(1))
            input_tok = tgt[:, t]
        return torch.cat(outputs, dim=1)

# ==== LOAD DATA ====
def load_parallel_corpus(folder_path):
    def read_file(path):
        with open(path, "r", encoding="utf-8") as f:
            return [line.strip() for line in f]
    train_src = read_file(os.path.join(folder_path, "train.src"))
    train_tgt = read_file(os.path.join(folder_path, "train.tgt"))
    valid_src = read_file(os.path.join(folder_path, "valid.src"))
    valid_tgt = read_file(os.path.join(folder_path, "valid.tgt"))
    return list(zip(train_src, train_tgt)), list(zip(valid_src, valid_tgt))

# ==== TRAINING ====
def train_model(folder_name, train_pairs, valid_pairs):
    # Build vocab
    src_vocab = {"<pad>":0, "<sos>":1, "<eos>":2}
    tgt_vocab = {"<pad>":0, "<sos>":1, "<eos>":2}
    for src, tgt in train_pairs + valid_pairs:
        for word in tokenize(src): src_vocab.setdefault(word, len(src_vocab))
        for word in tokenize(tgt): tgt_vocab.setdefault(word, len(tgt_vocab))
    src_ivocab = {v:k for k,v in src_vocab.items()}
    tgt_ivocab = {v:k for k,v in tgt_vocab.items()}

    train_loader, valid_loader = create_dataloaders(train_pairs, valid_pairs, src_vocab, tgt_vocab)

    enc = Encoder(len(src_vocab), EMB_DIM, HID_DIM, N_LAYERS, DROPOUT)
    dec = Decoder(len(tgt_vocab), EMB_DIM, HID_DIM, N_LAYERS, DROPOUT)
    model = Seq2Seq(enc, dec, DEVICE).to(DEVICE)

    optimizer = optim.Adam(model.parameters(), lr=0.001)
    criterion = nn.CrossEntropyLoss(ignore_index=0)

    best_valid_loss = float('inf')
    epochs_no_improve = 0

    for epoch in range(EPOCHS):
        model.train()
        total_loss = 0
        for src, tgt in train_loader:
            src, tgt = src.to(DEVICE), tgt.to(DEVICE)
            optimizer.zero_grad()
            output = model(src, tgt)
            loss = criterion(output.view(-1, output.size(-1)), tgt[:,1:].contiguous().view(-1))
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 5.0)
            optimizer.step()
            total_loss += loss.item()
        avg_train_loss = total_loss / len(train_loader)

        model.eval()
        val_loss = 0
        with torch.no_grad():
            for src, tgt in valid_loader:
                src, tgt = src.to(DEVICE), tgt.to(DEVICE)
                output = model(src, tgt)
                loss = criterion(output.view(-1, output.size(-1)), tgt[:,1:].contiguous().view(-1))
                val_loss += loss.item()
        avg_val_loss = val_loss / len(valid_loader)

        print(f"[{folder_name}] Epoch {epoch+1}/{EPOCHS} | Train Loss: {avg_train_loss:.4f} | Valid Loss: {avg_val_loss:.4f}")

        # Save checkpoint with vocabs
        if avg_val_loss < best_valid_loss:
            best_valid_loss = avg_val_loss
            epochs_no_improve = 0
            save_path = f"outputs/gru_{folder_name}_model.pt"
            torch.save({
                'model_state': model.state_dict(),
                'src_vocab': src_vocab,
                'tgt_vocab': tgt_vocab,
                'tgt_ivocab': tgt_ivocab
            }, save_path)
        else:
            epochs_no_improve += 1
            if epochs_no_improve >= PATIENCE:
                print(f"[{folder_name}] Early stopping triggered at epoch {epoch+1}")
                break

    return model, src_vocab, tgt_vocab, src_ivocab, tgt_ivocab

# ==== TRANSLATION ====
def translate(model, sentence, src_vocab, tgt_vocab, tgt_ivocab, max_len=50):
    """
    Translate a single sentence using a trained Seq2Seq model.

    Unknown source words are mapped to <eos> (or <pad> if preferred).
    """
    model.eval()
    # Encode sentence: unknown words map to <eos>
    src_indices = [src_vocab["<sos>"]] + [src_vocab.get(w, src_vocab["<eos>"]) for w in sentence.lower().split()] + [src_vocab["<eos>"]]
    src = torch.tensor(src_indices).unsqueeze(0).to(DEVICE)

    with torch.no_grad():
        encoder_outputs, hidden = model.encoder(src)
        input_tok = torch.tensor([tgt_vocab["<sos>"]]).to(DEVICE)
        result = []

        for _ in range(max_len):
            output, hidden = model.decoder(input_tok, hidden, encoder_outputs)
            token = output.argmax(1).item()
            if token == tgt_vocab["<eos>"]:
                break
            # Map token id back to word
            result.append(tgt_ivocab.get(token, "<unk>"))
            input_tok = torch.tensor([token]).to(DEVICE)

    return " ".join(result)

Let's train the models.

In [None]:
# ==== RUN TRAINING ====
os.makedirs("outputs", exist_ok=True)
for folder_name in os.listdir(DATA_PATH):
    folder_path = os.path.join(DATA_PATH, folder_name)
    if os.path.isdir(folder_path):
        print(f"\n=== Processing folder: {folder_name} ===")
        train_pairs, valid_pairs = load_parallel_corpus(folder_path)
        model, src_vocab, tgt_vocab, src_ivocab, tgt_ivocab = train_model(folder_name, train_pairs, valid_pairs)
        # Optional test translation for sanity check
        test_sentence = train_pairs[0][0]
        print(f"[{folder_name}] Test Translation: {translate(model, test_sentence, src_vocab, tgt_vocab, tgt_ivocab)}")


=== Processing folder: aug-cbk ===
[aug-cbk] Epoch 1/20 | Train Loss: 5.5122 | Valid Loss: 4.8026
[aug-cbk] Epoch 2/20 | Train Loss: 4.2875 | Valid Loss: 4.2235
[aug-cbk] Epoch 3/20 | Train Loss: 3.5322 | Valid Loss: 3.9153
[aug-cbk] Epoch 4/20 | Train Loss: 2.9487 | Valid Loss: 3.7305
[aug-cbk] Epoch 5/20 | Train Loss: 2.5212 | Valid Loss: 3.6274
[aug-cbk] Epoch 6/20 | Train Loss: 2.2155 | Valid Loss: 3.5877
[aug-cbk] Epoch 7/20 | Train Loss: 1.9911 | Valid Loss: 3.5709
[aug-cbk] Epoch 8/20 | Train Loss: 1.8183 | Valid Loss: 3.5670
[aug-cbk] Epoch 9/20 | Train Loss: 1.6815 | Valid Loss: 3.5715
[aug-cbk] Epoch 10/20 | Train Loss: 1.5714 | Valid Loss: 3.5973
[aug-cbk] Epoch 11/20 | Train Loss: 1.4824 | Valid Loss: 3.6227
[aug-cbk] Early stopping triggered at epoch 11
[aug-cbk] Test Translation: espigó pues hasta el día de noche y cuando desgranó su rama y las echó sobre la cama y las unió a un efa

=== Processing folder: base ===
[base] Epoch 1/20 | Train Loss: 5.6917 | Valid Loss: 5.0

In [None]:
# ==== TRANSLATE TEST CORPUS ====
def translate_test_corpus(test_file=f"{DATA_PATH}/test.src", outputs_folder="outputs", max_len=50):
    """
    Translate all sentences in a test file using all .pt models in the outputs folder.
    Saves translations to <model_name>_translations.txt.
    """
    with open(test_file, "r", encoding="utf-8") as f:
        test_sentences = [line.strip() for line in f]

    for model_file in os.listdir(outputs_folder):
        if model_file.endswith(".pt"):
            # Load checkpoint
            checkpoint = torch.load(os.path.join(outputs_folder, model_file), map_location=DEVICE)

            src_vocab = checkpoint['src_vocab']
            tgt_vocab = checkpoint['tgt_vocab']
            tgt_ivocab = checkpoint['tgt_ivocab']

            # Initialize model with exact vocab sizes from training
            enc = Encoder(len(src_vocab), EMB_DIM, HID_DIM, N_LAYERS, DROPOUT)
            dec = Decoder(len(tgt_vocab), EMB_DIM, HID_DIM, N_LAYERS, DROPOUT)
            model = Seq2Seq(enc, dec, DEVICE).to(DEVICE)
            model.load_state_dict(checkpoint['model_state'])
            model.eval()

            output_file = os.path.join(outputs_folder, f"{os.path.splitext(model_file)[0]}_translations.txt")
            with open(output_file, "w", encoding="utf-8") as out_f:
                for sent in test_sentences:
                    translation = translate(model, sent, src_vocab, tgt_vocab, tgt_ivocab, max_len=max_len)
                    out_f.write(translation + "\n")

            print(f"[{model_file}] Translations saved to {output_file}")

## Translation

In [None]:
translate_test_corpus()

[gru_base_model.pt] Translations saved to outputs/gru_base_model_translations.txt
[gru_aug-noise_model.pt] Translations saved to outputs/gru_aug-noise_model_translations.txt
[gru_aug-cbk_model.pt] Translations saved to outputs/gru_aug-cbk_model_translations.txt


Let's save the results.

In [None]:
!mkdir -p /content/drive/MyDrive/pytorch_models
!cp -r outputs /content/drive/MyDrive/pytorch_models/