<a href="https://colab.research.google.com/github/sunmyeonglee/2025-1-NLP/blob/main/4_machine_translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Live Coding: Machine Translation (Korean to English) with Seq2Seq

Welcome! In this session, we'll build a neural machine translation (NMT) system to translate Korean sentences into English. We'll use PyTorch and concepts from sequence-to-sequence (Seq2Seq) modeling.

## 1. Setting up the Environment 🛠️

First, we need to install the libraries that we'll be using:
-   `transformers`: From Hugging Face, for easy access to tokenizers and potentially pre-trained model components (though we'll build our own tokenizers and Seq2Seq model).
-   `tokenizers`: Hugging Face's library for training our own fast tokenizers.
-   `gdown`: To download datasets/files from Google Drive.
-   `pandas`: For data manipulation, especially to load our dataset.

Let's install them!

In [None]:
# Install necessary libraries
!pip install transformers tokenizers pandas gdown --quiet

## 2. Downloading the Dataset 📚

Machine translation models require a **parallel corpus**, which is a collection of texts in a source language aligned with their translations in a target language.

We'll use a Korean-English parallel corpus (originally from NIA AI-Hub). The command below will download a CSV version of this dataset.

In [None]:
# Download the dataset
!gdown 13CGLEULYccogSLByHXPAxSveLZTtnj8c --quiet
!unzip -q -o nia_korean_english_csv.zip # -o overwrites if exists

print("Dataset downloaded and unzipped.")

Dataset downloaded and unzipped.


## 3. Loading and Inspecting the Data 🧐

Now that our dataset is downloaded, let's load it using `pandas` and take a first look. We expect to see Korean sentences and their corresponding English translations.

In [None]:
import pandas as pd
from pathlib import Path

# Define the path to the CSV file
csv_file_path = "nia_korean_english.csv" # This should match the unzipped file name

# Load the dataframe
df = pd.read_csv(csv_file_path)

# Display the first few rows
print("First 5 rows of the dataset:")
display(df.head())

# Display some info about the dataframe
print("\nDataFrame Info:")
df.info()

# Display a few examples
print("\nSample sentence pairs:")
for i in range(3):
    print(f"  Korean (원문): {df['원문'].iloc[i]}")
    print(f"  English (번역문): {df['번역문'].iloc[i]}")
    print("-" * 20)

First 5 rows of the dataset:


Unnamed: 0,원문,번역문
0,'Bible Coloring'은 성경의 아름다운 이야기를 체험 할 수 있는 컬러링 ...,Bible Coloring' is a coloring application that...
1,씨티은행에서 일하세요?,Do you work at a City bank?
2,푸리토의 베스트셀러는 해외에서 입소문만으로 4차 완판을 기록하였다.,"PURITO's bestseller, which recorded 4th rough ..."
3,11장에서는 예수님이 이번엔 나사로를 무덤에서 불러내어 죽은 자 가운데서 살리셨습니다.,In Chapter 11 Jesus called Lazarus from the to...
4,"6.5, 7, 8 사이즈가 몇 개나 더 재입고 될지 제게 알려주시면 감사하겠습니다.",I would feel grateful to know how many stocks ...



DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1602418 entries, 0 to 1602417
Data columns (total 2 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   원문      1602418 non-null  object
 1   번역문     1602418 non-null  object
dtypes: object(2)
memory usage: 24.5+ MB

Sample sentence pairs:
  Korean (원문): 'Bible Coloring'은 성경의 아름다운 이야기를 체험 할 수 있는 컬러링 앱입니다.
  English (번역문): Bible Coloring' is a coloring application that allows you to experience beautiful stories in the Bible.
--------------------
  Korean (원문): 씨티은행에서 일하세요?
  English (번역문): Do you work at a City bank?
--------------------
  Korean (원문): 푸리토의 베스트셀러는 해외에서 입소문만으로 4차 완판을 기록하였다.
  English (번역문): PURITO's bestseller, which recorded 4th rough -cuts by words of mouth from abroad.
--------------------


## 4. Tokenization: Turning Text into Numbers 🔢

Neural networks don't understand words directly. They need numerical input. **Tokenization** is the process of converting text into a sequence of numerical IDs. This involves:
1.  **Splitting** text into smaller units called **tokens** (words, sub-words, or characters).
2.  Building a **vocabulary**: a mapping from unique tokens to integer IDs.
3.  **Converting** sequences of tokens into sequences of IDs.

We'll use the `BertWordPieceTokenizer` from the `tokenizers` library. WordPiece is effective because it can break down unknown words into known sub-word units. We need to train two separate tokenizers: one for Korean (source) and one for English (target).

### 4.1. Preparing Data for Tokenizer Training

The tokenizer trainer expects input as a list of paths to text files. Let's extract our Korean and English sentences into separate `.txt` files.

In [None]:
# Create a directory for tokenizer data
tokenizer_data_dir = Path("tokenizer_data")
tokenizer_data_dir.mkdir(exist_ok=True)

# Define file paths for the corpus
korean_corpus_file = tokenizer_data_dir / "korean_corpus.txt"
english_corpus_file = tokenizer_data_dir / "english_corpus.txt"

# Save Korean sentences (make sure they are strings)
with open(korean_corpus_file, "w", encoding="utf-8") as f:
    for sentence in df['원문']:
        f.write(str(sentence) + "\n")

# Save English sentences (make sure they are strings)
with open(english_corpus_file, "w", encoding="utf-8") as f:
    for sentence in df['번역문']:
        f.write(str(sentence) + "\n")

print(f"Korean corpus saved to: {korean_corpus_file}")
print(f"English corpus saved to: {english_corpus_file}")

Korean corpus saved to: tokenizer_data/korean_corpus.txt
English corpus saved to: tokenizer_data/english_corpus.txt


### 4.2. Training the Tokenizers

Now, let's train the `BertWordPieceTokenizer` for Korean and English.
Key parameters:
-   `vocab_size`: The maximum number of unique tokens the tokenizer will learn.
-   `min_frequency`: A token must appear at least this many times to be included in the vocabulary.
-   `limit_alphabet`: Limits the number of initial characters considered to build the vocabulary.
-   `special_tokens`: We define standard BERT special tokens like `[PAD]` (padding), `[UNK]` (unknown), `[CLS]` (classification/start), `[SEP]` (separator/end).

In [None]:
from tokenizers import BertWordPieceTokenizer

# Tokenizer parameters
VOCAB_SIZE = 32000
MIN_FREQUENCY = 5
LIMIT_ALPHABET = 6000 # How many initial characters to look at
SPECIAL_TOKENS = ["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]

# --- Train Korean Tokenizer ---
korean_tokenizer_output_dir = Path(f'hugging_kor_{VOCAB_SIZE}')
korean_tokenizer_output_dir.mkdir(exist_ok=True)

kor_tokenizer_trainer = BertWordPieceTokenizer(
    strip_accents=False, # Keep accents
    lowercase=False      # Preserve case
)

print("Training Korean tokenizer...")
kor_tokenizer_trainer.train(
    files=[str(korean_corpus_file)],
    vocab_size=VOCAB_SIZE,
    min_frequency=MIN_FREQUENCY,
    limit_alphabet=LIMIT_ALPHABET,
    show_progress=True,
    special_tokens=SPECIAL_TOKENS
)
kor_tokenizer_trainer.save_model(str(korean_tokenizer_output_dir))
print(f"Korean tokenizer saved to: {korean_tokenizer_output_dir}")


# --- Train English Tokenizer ---
english_tokenizer_output_dir = Path(f'hugging_eng_{VOCAB_SIZE}')
english_tokenizer_output_dir.mkdir(exist_ok=True)

eng_tokenizer_trainer = BertWordPieceTokenizer(
    strip_accents=False, # Usually True for English, but False is fine
    lowercase=False      # Usually True for English, but False helps if case is important
)

print("\nTraining English tokenizer...")
eng_tokenizer_trainer.train(
    files=[str(english_corpus_file)],
    vocab_size=VOCAB_SIZE,
    min_frequency=MIN_FREQUENCY,
    limit_alphabet=LIMIT_ALPHABET,
    show_progress=True,
    special_tokens=SPECIAL_TOKENS
)
eng_tokenizer_trainer.save_model(str(english_tokenizer_output_dir))
print(f"English tokenizer saved to: {english_tokenizer_output_dir}")

Training Korean tokenizer...



Korean tokenizer saved to: hugging_kor_32000

Training English tokenizer...



English tokenizer saved to: hugging_eng_32000


### 4.3. Loading and Testing Trained Tokenizers

We can now load our trained tokenizers using `BertTokenizerFast` from the `transformers` library. This provides a convenient interface for encoding text to IDs and decoding IDs back to text.

In [None]:
from transformers import BertTokenizerFast

# Load the trained Korean (source) tokenizer
tokenizer_src = BertTokenizerFast.from_pretrained(
    str(korean_tokenizer_output_dir),
    strip_accents=False,
    lowercase=False
)

# Load the trained English (target) tokenizer
tokenizer_tgt = BertTokenizerFast.from_pretrained(
    str(english_tokenizer_output_dir),
    strip_accents=False,
    lowercase=False
)

# Test the Korean tokenizer
sample_korean_sentence = "이것은 한국어 토크나이저 테스트입니다."
print(f"Original Korean: {sample_korean_sentence}")
tokenized_src_sample = tokenizer_src(sample_korean_sentence)
print(f"Tokens: {tokenizer_src.tokenize(sample_korean_sentence)}")
print(f"Token IDs: {tokenized_src_sample['input_ids']}")
print(f"Decoded: {tokenizer_src.decode(tokenized_src_sample['input_ids'])}")

# Test the English tokenizer
sample_english_sentence = "This is an English tokenizer test."
print(f"\nOriginal English: {sample_english_sentence}")
tokenized_tgt_sample = tokenizer_tgt(sample_english_sentence)
print(f"Tokens: {tokenizer_tgt.tokenize(sample_english_sentence)}")
print(f"Token IDs: {tokenized_tgt_sample['input_ids']}")
print(f"Decoded: {tokenizer_tgt.decode(tokenized_tgt_sample['input_ids'])}")

# Vocabulary sizes
print(f"\nSource (Korean) tokenizer vocab size: {tokenizer_src.vocab_size}")
print(f"Target (English) tokenizer vocab size: {tokenizer_tgt.vocab_size}")
print(f"Source PAD ID: {tokenizer_src.pad_token_id}, CLS ID: {tokenizer_src.cls_token_id}, SEP ID: {tokenizer_src.sep_token_id}")
print(f"Target PAD ID: {tokenizer_tgt.pad_token_id}, CLS ID: {tokenizer_tgt.cls_token_id}, SEP ID: {tokenizer_tgt.sep_token_id}")

  from .autonotebook import tqdm as notebook_tqdm


Original Korean: 이것은 한국어 토크나이저 테스트입니다.
Tokens: ['이것은', '한국어', '토크', '##나이', '##저', '테스트', '##입니다', '.']
Token IDs: [2, 8062, 8698, 16135, 15425, 4311, 10222, 6461, 18, 3]
Decoded: [CLS] 이것은 한국어 토크나이저 테스트입니다. [SEP]

Original English: This is an English tokenizer test.
Tokens: ['this', 'is', 'an', 'eng', '##lish', 'token', '##izer', 'test', '.']
Token IDs: [2, 1200, 1056, 1112, 3058, 2566, 15803, 10469, 2356, 18, 3]
Decoded: [CLS] this is an english tokenizer test. [SEP]

Source (Korean) tokenizer vocab size: 32000
Target (English) tokenizer vocab size: 32000
Source PAD ID: 0, CLS ID: 2, SEP ID: 3
Target PAD ID: 0, CLS ID: 2, SEP ID: 3


## 5. Creating a PyTorch Dataset 📦

PyTorch's `Dataset` class provides an abstraction over our data. We'll create a custom `TranslationDataset` that will:
1.  Take a DataFrame (train, val, or test) and our tokenizers.
2.  In its `__getitem__` method, fetch a Korean-English pair.
3.  Tokenize them.
4.  Return the tokenized source sentence, tokenized target sentence (for decoder input), and labels (also from target sentence, for loss calculation).

For a Seq2Seq model, typical inputs/outputs per sample are:
-   `encoder_input_ids`: Tokenized source sentence.
-   `decoder_input_ids`: Tokenized target sentence, usually starting with a start-of-sequence (SOS) token (e.g., `[CLS]`). This is fed to the decoder during training (teacher forcing).
-   `labels`: Tokenized target sentence, usually ending with an end-of-sequence (EOS) token (e.g., `[SEP]`). This is what the decoder aims to predict. Often, labels are a shifted version of `decoder_input_ids`.

Our `BertTokenizerFast` automatically adds `[CLS]` (start) and `[SEP]` (end) tokens.

In [None]:
import torch
from torch.utils.data import Dataset

class TranslationDataset(Dataset):
    def __init__(self, dataframe, src_tokenizer, tgt_tokenizer, src_col="원문", tgt_col="번역문", max_length=128):
        self.dataframe = dataframe
        self.src_tokenizer = src_tokenizer
        self.tgt_tokenizer = tgt_tokenizer
        self.src_col = src_col
        self.tgt_col = tgt_col
        self.max_length = max_length # Max sequence length for truncation

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        src_text = str(self.dataframe.iloc[idx][self.src_col])
        tgt_text = str(self.dataframe.iloc[idx][self.tgt_col])

        # Tokenize source sentence
        # `encode` returns a list of IDs. We convert to tensor.
        # `add_special_tokens=True` adds [CLS] and [SEP]
        encoder_input_ids = torch.tensor(
            self.src_tokenizer.encode(src_text, add_special_tokens=True, truncation=True, max_length=self.max_length)
        )

        # Tokenize target sentence for both decoder input and labels
        # For many Transformer-based Seq2Seq, decoder_input and labels can be the same tokenized target sequence.
        # The causal attention mask in the decoder ensures it only attends to previous positions.
        # For RNNs, decoder_input is often <SOS> + target_tokens and labels are target_tokens + <EOS>.
        # Since our tokenizers add [CLS] and [SEP], we can use the tokenized sequence directly.
        target_token_ids = torch.tensor(
            self.tgt_tokenizer.encode(tgt_text, add_special_tokens=True, truncation=True, max_length=self.max_length)
        )

        # In the original notebook, pack_collate expects (source, target, shifted_target)
        # Let's prepare them such that:
        # - source = encoder_input_ids
        # - target = decoder_input_ids (e.g. [CLS] w1 w2)
        # - shifted_target = labels (e.g. w1 w2 [SEP])
        decoder_input_ids = target_token_ids[:-1]
        labels = target_token_ids[1:]

        return encoder_input_ids, decoder_input_ids, labels

# Create Dataset instances
MAX_SEQ_LENGTH = 100 # Define a max length for sequences
dataset = TranslationDataset(df, tokenizer_src, tokenizer_tgt, max_length=MAX_SEQ_LENGTH)
# Test a sample from the dataset
sample_src_ids, sample_tgt_ids, sample_lbl_ids = dataset[0]
print("Sample from TranslationDataset:")
print(f"  Source IDs: {sample_src_ids}")
print(f"  Source Decoded: {tokenizer_src.decode(sample_src_ids)}")
print(f"  Target (Decoder Input) IDs: {sample_tgt_ids}")
print(f"  Target (Decoder Input) Decoded: {tokenizer_tgt.decode(sample_tgt_ids)}")
print(f"  Labels IDs: {sample_lbl_ids}")
print(f"  Labels Decoded: {tokenizer_tgt.decode(sample_lbl_ids)}")

Sample from TranslationDataset:
  Source IDs: tensor([    2,    11,    70,  4360,  4551, 13306,    71, 12901,  9564, 12435,
           11,  3546, 14567,  4326,  8934,  8407,  7400,  4154,  3252,  6420,
        12985,  5025,  3397,  6461,    18,     3])
  Source Decoded: [CLS]'bible coloring'은 성경의 아름다운 이야기를 체험 할 수 있는 컬러링 앱입니다. [SEP]
  Target (Decoder Input) IDs: tensor([    2, 26268, 23067,    11,  1056,    69, 23067,  2803,  1067,  5155,
         1117,  1042,  2405,  4024,  5520,  1039,  1023, 26268,    18])
  Target (Decoder Input) Decoded: [CLS] bible coloring'is a coloring application that allows you to experience beautiful stories in the bible.
  Labels IDs: tensor([26268, 23067,    11,  1056,    69, 23067,  2803,  1067,  5155,  1117,
         1042,  2405,  4024,  5520,  1039,  1023, 26268,    18,     3])
  Labels Decoded: bible coloring'is a coloring application that allows you to experience beautiful stories in the bible. [SEP]


## 6. Splitting the Dataset

To train and evaluate our model robustly, we split our data into three sets:
1.  **Training set**: Used to train the model parameters.
2.  **Validation set**: Used during training to monitor performance, tune hyperparameters, and prevent overfitting.
3.  **Test set**: Used *only once* at the very end to get an unbiased evaluation of the final model.

We'll use a common split (e.g., 80% train, 10% validation, 10% test). For faster execution in this live session, we might use a subset of the full data.

In [None]:
train_ratio = 0.8
val_ratio = 0.1
test_ratio = 0.1

train_dataset, val_dataset, test_dataset = torch.utils.data.random_split(dataset, [train_ratio, val_ratio, 1.0 - train_ratio - val_ratio], generator=torch.Generator().manual_seed(42))

print(f"Train dataset size: {len(train_dataset)}")
print(f"Validation dataset size: {len(val_dataset)}")
print(f"Test dataset size: {len(test_dataset)}")


Train dataset size: 1281935
Validation dataset size: 160242
Test dataset size: 160241


## 7. DataLoader and Collate Function for Variable Length Sequences 🔄

Sentences have different lengths. When creating batches of data with `DataLoader`, we need to handle these varying lengths.
-   **Padding**: Making all sequences in a batch the same length by adding `[PAD]` tokens.
-   **Packing**: More efficient for RNNs. `torch.nn.utils.rnn.pack_sequence` sorts sequences by length, concatenates non-padded elements, and stores `batch_sizes` (how many sequences are active at each timestep). PyTorch RNNs can process `PackedSequence` efficiently.

We'll define a **collate function**. This function is passed to `DataLoader` and takes a list of samples (from our `Dataset`) to form a batch. Our collate function will use `pack_sequence`.

In [None]:
from torch.nn.utils.rnn import pack_sequence, PackedSequence
from torch.utils.data import DataLoader

def pack_collate_fn(raw_batch_list):
    # raw_batch_list is a list of tuples: [(src1, tgt_in1, lbl1), (src2, tgt_in2, lbl2), ...]
    # Each src_i, tgt_in_i, lbl_i is a 1D tensor of token IDs.

    sources, target_inputs, labels = zip(*raw_batch_list)
    # Now, sources = (src1, src2, ...), target_inputs = (tgt_in1, tgt_in2, ...), etc.

    # `pack_sequence` expects a list of Tensors. It will sort them by length (descending)
    # if enforce_sorted=False (which is the default).
    packed_sources = pack_sequence(sources, enforce_sorted=False)
    packed_target_inputs = pack_sequence(target_inputs, enforce_sorted=False)
    packed_labels = pack_sequence(labels, enforce_sorted=False)

    return packed_sources, packed_target_inputs, packed_labels

# Create DataLoaders
BATCH_SIZE = 64 # Adjust based on GPU memory

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, collate_fn=pack_collate_fn, shuffle=True, num_workers=2)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, collate_fn=pack_collate_fn, shuffle=False, num_workers=2)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, collate_fn=pack_collate_fn, shuffle=False, num_workers=2)

# Let's test the DataLoader and collate function
print("Testing DataLoader and pack_collate_fn...")
try:
    batch_src_packed, batch_tgt_in_packed, batch_lbl_packed = next(iter(train_loader))
    print("Batch loaded successfully!")

    print("\nSource Batch (PackedSequence):")
    print(f"  Data shape (flattened tokens): {batch_src_packed.data.shape}")
    print(f"  Batch sizes (active sequences per timestep): {batch_src_packed.batch_sizes}")
    # print(f"  Sorted indices (original pos of sorted seqs): {batch_src_packed.sorted_indices}")
    # print(f"  Unsorted indices (how to restore original order): {batch_src_packed.unsorted_indices}")

    print("\nTarget Input Batch (PackedSequence):")
    print(f"  Data shape: {batch_tgt_in_packed.data.shape}")
    print(f"  Batch sizes: {batch_tgt_in_packed.batch_sizes}")

    print("\nLabels Batch (PackedSequence):")
    print(f"  Data shape: {batch_lbl_packed.data.shape}")
    print(f"  Batch sizes: {batch_lbl_packed.batch_sizes}")

except Exception as e:
    print(f"Error loading batch: {e}")
    import traceback
    traceback.print_exc()

## 8. Defining the Sequence-to-Sequence (Seq2Seq) Model 🧠

We'll build a classic Encoder-Decoder model using GRU (Gated Recurrent Unit) layers.

![Seq2Seq Architecture](https://raw.githubusercontent.com/tensorflow/nmt/master/nmt/g3doc/img/seq2seq.jpg)
*(Image source: TensorFlow NMT Tutorial)*

**Encoder**:
1.  **Embedding Layer**: Converts input source tokens (Korean) into dense vectors. We'll use `padding_idx` so PAD tokens have a zero embedding and don't contribute to gradients.
2.  **GRU Layer**: Processes the sequence of embeddings and outputs all hidden states and the final hidden state (context vector).

**Decoder**:
1.  **Embedding Layer**: Converts input target tokens (English) into dense vectors.
2.  **GRU Layer**: Takes the current target token's embedding and the previous hidden state (initialized with the encoder's context vector) to generate an output.
3.  **Linear Layer (Projection)**: Maps the GRU's output to logits over the target vocabulary.

Our model components will handle `PackedSequence` inputs. The encoder's final hidden state must be correctly passed to the decoder, potentially reordering it if the source and target sequences within a batch were sorted differently by `pack_sequence`.

In [None]:
import torch.nn as nn
from torch.nn.utils.rnn import PackedSequence # Already imported but good for clarity

import torch.nn as nn
class Seq2seq(nn.Module):
  def __init__(self, enc_vocab, dec_vocab, hidden_size, num_layers=2):
    super().__init__()
    self.encoder = Encoder(enc_vocab, hidden_size, num_layers=num_layers)
    self.decoder = Decoder(dec_vocab, hidden_size, num_layers=num_layers)


class Encoder(nn.Module):
  def __init__(self, num_vocab, hidden_size, num_layers=2):
    super().__init__()
    self.emb = nn.Embedding(num_vocab, hidden_size)
    self.rnn = nn.GRU(hidden_size, hidden_size, batch_first=True, num_layers=num_layers)
    # batch_first True: it takes (Num_samples_in_batch, num_timesteps, num_dim)
    # batch_first False: it takes (num_timesteps, Num_samples_in_batch, num_dim)



class Decoder(nn.Module):
  def __init__(self, num_vocab, hidden_size, num_layers=2):
    super().__init__()
    self.emb = nn.Embedding(num_vocab, hidden_size)
    self.rnn = nn.GRU(hidden_size, hidden_size, batch_first=True, num_layers=num_layers)
    self.proj = nn.Linear(hidden_size, num_vocab)



In [None]:

# --- Model Hyperparameters & Instantiation ---
EMBEDDING_DIM = 256
HIDDEN_DIM = 512  # Hidden dimension for GRU
NUM_LAYERS = 2    # Number of GRU layers
DROPOUT_P = 0.3   # Dropout probability

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Instantiate model components
enc = Encoder(tokenizer_src.vocab_size, EMBEDDING_DIM, HIDDEN_DIM, NUM_LAYERS, DROPOUT_P, tokenizer_src.pad_token_id).to(device)
dec = Decoder(tokenizer_tgt.vocab_size, EMBEDDING_DIM, HIDDEN_DIM, NUM_LAYERS, DROPOUT_P, tokenizer_tgt.pad_token_id).to(device)
model = Seq2Seq(enc, dec, device).to(device)

print(f"Model created with {sum(p.numel() for p in model.parameters() if p.requires_grad):,} trainable parameters.")

# Test forward pass with a batch (if batch variables exist from DataLoader test)
if 'batch_src_packed' in locals() and 'batch_tgt_in_packed' in locals():
    print("\nTesting model forward pass with a sample batch...")
    try:
        # Ensure batch tensors are on the correct device
        src_dev = PackedSequence(batch_src_packed.data.to(device), batch_src_packed.batch_sizes, batch_src_packed.sorted_indices, batch_src_packed.unsorted_indices)
        tgt_in_dev = PackedSequence(batch_tgt_in_packed.data.to(device), batch_tgt_in_packed.batch_sizes, batch_tgt_in_packed.sorted_indices, batch_tgt_in_packed.unsorted_indices)

        model.train() # Set to train mode for consistent behavior if dropout/batchnorm were used differently
        output_logits_packed = model(src_dev, tgt_in_dev)

        print("Model forward pass successful!")
        print(f"Output (PackedSequence logits data shape): {output_logits_packed.data.shape}")
        # Expected: (total_num_target_tokens_in_batch, target_vocab_size)
    except Exception as e:
        print(f"Error during model forward pass test: {e}")
        import traceback
        traceback.print_exc()
else:
    print("\nSkipping model forward pass test as batch data is not loaded. Run DataLoader cell first.")

## 9. Training the Model 🔥

The training process involves iterating through the training dataset in epochs and batches:
1.  **Get batch**: From `train_loader`.
2.  **Zero gradients**: `optimizer.zero_grad()`.
3.  **Forward pass**: `model(source_batch, target_input_batch)` to get `output_logits`.
4.  **Calculate loss**: Compare `output_logits.data` with `labels_batch.data`. `CrossEntropyLoss` is suitable, and its `ignore_index` parameter should be set to the PAD token ID for the target language so padding doesn't contribute to the loss.
5.  **Backward pass**: `loss.backward()` to compute gradients.
6.  **Gradient clipping**: (Optional but recommended) `torch.nn.utils.clip_grad_norm_` to prevent exploding gradients.
7.  **Optimizer step**: `optimizer.step()` to update model weights.

We'll also define an evaluation function to check performance on the validation set.

**Note**: Full training takes time. For this live session, we'll define the loop structure and then load pre-trained weights.

In [None]:
import torch.optim as optim
import time
import math

# Define Loss Function
# The output of our model (PackedSequence.data) is [total_target_tokens, target_vocab_size]
# The labels (PackedSequence.data) are [total_target_tokens]
# This is exactly what CrossEntropyLoss expects.
TARGET_PAD_IDX = tokenizer_tgt.pad_token_id
criterion = nn.CrossEntropyLoss(ignore_index=TARGET_PAD_IDX)

# Define Optimizer
LEARNING_RATE = 1e-3 # 0.001
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

# Training configuration
N_EPOCHS = 1 # Set higher for actual training (e.g., 10-20)
CLIP = 1.0     # Gradient clipping value

def train_epoch(model, dataloader, optimizer, criterion, clip, device):
    model.train() # Set model to training mode
    epoch_loss = 0
    num_batches = len(dataloader)

    for i, (src_packed, tgt_in_packed, lbl_packed) in enumerate(dataloader):
        # Move batch data to the device
        src_dev = PackedSequence(src_packed.data.to(device), src_packed.batch_sizes, src_packed.sorted_indices, src_packed.unsorted_indices)
        tgt_in_dev = PackedSequence(tgt_in_packed.data.to(device), tgt_in_packed.batch_sizes, tgt_in_packed.sorted_indices, tgt_in_packed.unsorted_indices)
        lbl_dev_data = lbl_packed.data.to(device) # Only need .data for labels

        optimizer.zero_grad()

        # Forward pass: output_logits_packed.data is (sum_lengths, vocab_size)
        output_logits_packed = model(src_dev, tgt_in_dev)

        # Calculate loss: output_logits_packed.data vs lbl_dev_data
        # output_logits_packed.data shape: (total_tokens_in_batch, target_vocab_size)
        # lbl_dev_data shape: (total_tokens_in_batch)
        loss = criterion(output_logits_packed.data, lbl_dev_data)

        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip) # Clip gradients
        optimizer.step()

        epoch_loss += loss.item()

        if (i + 1) % (num_batches // 10 if num_batches >= 10 else 1) == 0: # Print progress ~10 times
            print(f'  Batch {i+1}/{num_batches} | Train Loss: {loss.item():.3f}')

    return epoch_loss / num_batches

def evaluate_epoch(model, dataloader, criterion, device):
    model.eval() # Set model to evaluation mode
    epoch_loss = 0
    with torch.no_grad(): # Disable gradient calculations
        for i, (src_packed, tgt_in_packed, lbl_packed) in enumerate(dataloader):
            src_dev = PackedSequence(src_packed.data.to(device), src_packed.batch_sizes, src_packed.sorted_indices, src_packed.unsorted_indices)
            tgt_in_dev = PackedSequence(tgt_in_packed.data.to(device), tgt_in_packed.batch_sizes, tgt_in_packed.sorted_indices, tgt_in_packed.unsorted_indices)
            lbl_dev_data = lbl_packed.data.to(device)

            output_logits_packed = model(src_dev, tgt_in_dev)
            loss = criterion(output_logits_packed.data, lbl_dev_data)
            epoch_loss += loss.item()

    return epoch_loss / len(dataloader)

print("Training and evaluation loop structures defined.")
print("We will skip actual training and load pre-trained weights for the live session.")

# --- Example of starting a training loop (commented out for live coding) ---
# best_valid_loss = float('inf')
# MODEL_SAVE_PATH = 'best-seq2seq-model.pt'

# for epoch in range(N_EPOCHS):
#     start_time = time.time()
#     print(f'Starting Epoch: {epoch+1:02}/{N_EPOCHS}')

#     train_loss = train_epoch(model, train_loader, optimizer, criterion, CLIP, device)
#     valid_loss = evaluate_epoch(model, val_loader, criterion, device)

#     end_time = time.time()
#     epoch_mins, epoch_secs = divmod(end_time - start_time, 60)

#     if valid_loss < best_valid_loss:
#         best_valid_loss = valid_loss
#         torch.save({
#             'epoch': epoch,
#             'model_state_dict': model.state_dict(),
#             'optimizer_state_dict': optimizer.state_dict(),
#             'loss': best_valid_loss,
#             }, MODEL_SAVE_PATH)
#         print(f'  ** Best validation loss: {best_valid_loss:.3f}. Model saved to {MODEL_SAVE_PATH} **')

#     print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {int(epoch_secs)}s')
#     print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
#     print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

## 10. Loading Pre-trained Weights 💾

Training a good NMT model from scratch can take many hours or days. To save time, we'll load pre-trained weights into our model structure.

**Important**: The architecture of the model we define (embedding dimensions, hidden sizes, number of layers, vocabulary sizes) *must exactly match* the architecture used when these weights were saved.

The original notebook used `hidden_size = 512` and `num_layers=3` for its pre-trained weights. Let's define a new model instance with these parameters to load the weights.

In [None]:
# Download the pre-trained model weights
!gdown 15jL2TaRk6Q47uuPWDruUge6O_gCPv5mp --quiet -O kor_eng_translator_model_vanilla_best.pt
print("Pre-trained weights 'kor_eng_translator_model_vanilla_best.pt' downloaded.")

# Parameters matching the pre-trained model (from original notebook's loading cell)
PRETRAINED_EMBEDDING_DIM = EMBEDDING_DIM # Assuming embedding dim was consistent, adjust if needed
PRETRAINED_HIDDEN_DIM = 512
PRETRAINED_NUM_LAYERS = 3
PRETRAINED_DROPOUT_P = DROPOUT_P # Assuming dropout was consistent

print(f"\nInstantiating model for pre-trained weights:")
print(f"  Embedding Dim: {PRETRAINED_EMBEDDING_DIM}")
print(f"  Hidden Dim: {PRETRAINED_HIDDEN_DIM}")
print(f"  Num Layers: {PRETRAINED_NUM_LAYERS}")
print(f"  Dropout: {PRETRAINED_DROPOUT_P}")
print(f"  Source Vocab Size: {tokenizer_src.vocab_size}")
print(f"  Target Vocab Size: {tokenizer_tgt.vocab_size}")


# Instantiate the model with parameters matching the pre-trained file
enc_loaded = Encoder(tokenizer_src.vocab_size, PRETRAINED_EMBEDDING_DIM, PRETRAINED_HIDDEN_DIM,
                     PRETRAINED_NUM_LAYERS, PRETRAINED_DROPOUT_P, tokenizer_src.pad_token_id).to(device)
dec_loaded = Decoder(tokenizer_tgt.vocab_size, PRETRAINED_EMBEDDING_DIM, PRETRAINED_HIDDEN_DIM,
                     PRETRAINED_NUM_LAYERS, PRETRAINED_DROPOUT_P, tokenizer_tgt.pad_token_id).to(device)
model_loaded = Seq2Seq(enc_loaded, dec_loaded, device).to(device)

try:
    # The checkpoint file contains a dictionary, and the model's state_dict is under the 'model' key.
    checkpoint = torch.load("kor_eng_translator_model_vanilla_best.pt", map_location=device)
    model_loaded.load_state_dict(checkpoint['model'])
    print("\nPre-trained model weights loaded successfully!")
    print(f"Loaded model has {sum(p.numel() for p in model_loaded.parameters()):,} total parameters.")
except Exception as e:
    print(f"\nError loading pre-trained weights: {e}")
    print("Ensure the model architecture (vocab sizes, dimensions, layers) matches the saved state_dict.")
    import traceback
    traceback.print_exc()

## 11. Translation (Inference) 🗣️➡️💬

With our (pre-trained) model ready, let's translate some new Korean sentences into English!
The inference process (also called decoding) works step-by-step:
1.  Tokenize the input Korean sentence.
2.  Pass tokenized input through the **Encoder** to get the context vector (final hidden state).
3.  Initialize the **Decoder** with this context vector.
4.  Start decoding with a special start-of-sequence (SOS) token (e.g., `[CLS]`).
5.  In a loop, for each step:
    a.  Feed the current generated token (or SOS for the first step) and the previous decoder hidden state to the decoder.
    b.  Get the logits (raw scores) over the target vocabulary.
    c.  Select the token with the highest score (this is called **greedy decoding**).
    d.  If the selected token is an end-of-sequence (EOS) token (e.g., `[SEP]`), stop.
    e.  Otherwise, add the token to our list of output tokens and use it as input for the next decoding step.
6.  Convert the list of output token IDs back into an English sentence.