In [1]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"  # Set JAVA_HOME to your Java installation path
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import torch.optim as optim
import re
import random
import numpy as np
from tqdm import tqdm
from pathlib import Path
import time
import pickle
import math
from konlpy.tag import Mecab
import nltk
try:
    nltk.data.find('tokenizers/punkt_tab')
except LookupError:
    nltk.download('punkt_tab')
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from nltk.tokenize import word_tokenize
from torch.amp import autocast, GradScaler

# Bridging Languages with Deep Learning: Building a Korean-English Translator (with beam search)
#### **By Eunice Tu & Yewon Park**

In this project, we build a Neural Machine Translation (NMT) system to translate Korean text into English using deep learning models. We experiment with two main architectures: an LSTM-based Sequence-to-Sequence (Seq2Seq) model with attention, and a Transformer-based model. Our aim is to develop models that produce fluent, accurate translations that outperform traditional methods.

---
## Data Source:
We used the AI Hub Korean-English Parallel Corpus, a professionally curated dataset provided by the Korean government. It contains over one million aligned Korean-English sentence pairs across diverse domains such as news articles, spoken conversations, legal documents, IT, and patents. This dataset is well-suited for our translation project because:

- It provides clean, high-quality translations covering both formal and informal language.

- It offers sufficient volume to train deep learning models effectively.

- It reflects a variety of real-world contexts, improving the model's generalization ability.

---
## Setup and Configuration

This function sets up the computing environment to ensure consistent and reproducible results across different runs. It includes GPU configuration, seed initialization, and hardware information logging.

- Detects GPU availability using PyTorch and sets the device accordingly
- Configures deterministic behavior for consistent performance (optional)
- Sets random seeds for PyTorch, NumPy, and Python's `random` module
- Prints device and memory information when using a GPU

In [2]:
# Initial setup, device configuration, and random seed settings
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Set random seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)
random.seed(42)

if device.type == 'cuda':
    torch.backends.cudnn.benchmark = False  # Avoid random spikes
    torch.cuda.empty_cache()

Using device: cuda


---
## Data Preprocessing

#### Initial Preparation
To facilitate understanding and further analysis, the original column names in Korean need to be translated into English.

In [13]:
# Base folder where original files are
base_folder = "AIHub_translation"

# Folder where translated files will be saved
translated_folder = "translated"
os.makedirs(translated_folder, exist_ok=True)

files = [
    "1_spoken(1)_200226.xlsx",
    "1_spoken(2)_200226.xlsx",
    "2_conversation_200226.xlsx",
    "3_news(1)_200226.xlsx",
    "3_news(2)_200226.xlsx",
    "3_news(3)_200226.xlsx",
    "3_news(4)_200226.xlsx",
    "4_korean_culture_200226.xlsx",
    "5_decree_200226.xlsx",
    "6_government_website_200226.xlsx"
]

translated_files = [
    "1_spoken(1)_200226_translated.xlsx",
    "1_spoken(2)_200226_translated.xlsx",
    "2_conversation_200226_translated.xlsx",
    "3_news(1)_200226_translated.xlsx",
    "3_news(2)_200226_translated.xlsx",
    "3_news(3)_200226_translated.xlsx",
    "3_news(4)_200226_translated.xlsx",
    "4_korean_culture_200226_translated.xlsx",
    "5_decree_200226_translated.xlsx",
    "6_government_website_200226_translated.xlsx"
]

# Translating the column names from Korean to English
col_translation = {
    'SID': 'sid',
    'ID': 'id',
    '원문': 'korean',
    '번역문': 'english',
    '대분류': 'main_category',
    '소분류': 'sub_category',
    '상황': 'situation',
    'Set Nr.': 'set_number',
    '발화자': 'speaker',
    '날짜': 'date',
    '자동분류1': 'auto_category1',
    '자동분류2': 'auto_category2',
    '자동분류3': 'auto_category3',
    'URL': 'url',
    '언론사': 'media',
    '키워드': 'keyword',
    '지자체': 'local_government'
}

for f in files:
    path = os.path.join(base_folder, f)
    df = pd.read_excel(path, engine='openpyxl')

    # Translate column names
    translated_columns = {col: col_translation.get(col, col) for col in df.columns}
    df.rename(columns=translated_columns, inplace=True)

    # Save to new translated folder
    save_path = os.path.join(translated_folder, f.replace(".xlsx", "_translated.xlsx"))
    df.to_excel(save_path, index=False)

    print(f"✅ Saved translated file: {save_path}")

✅ Saved translated file: translated/1_spoken(1)_200226_translated.xlsx
✅ Saved translated file: translated/1_spoken(2)_200226_translated.xlsx
✅ Saved translated file: translated/2_conversation_200226_translated.xlsx
✅ Saved translated file: translated/3_news(1)_200226_translated.xlsx
✅ Saved translated file: translated/3_news(2)_200226_translated.xlsx
✅ Saved translated file: translated/3_news(3)_200226_translated.xlsx
✅ Saved translated file: translated/3_news(4)_200226_translated.xlsx
✅ Saved translated file: translated/4_korean_culture_200226_translated.xlsx
✅ Saved translated file: translated/5_decree_200226_translated.xlsx
✅ Saved translated file: translated/6_government_website_200226_translated.xlsx


In [None]:
# Checking the translated files
path = os.path.join(translated_folder, translated_files[2])
df = pd.read_excel(path, engine='openpyxl')
df.head()

Unnamed: 0,main_category,sub_category,situation,set_number,speaker,korean,english
0,비즈니스,회의,의견 교환하기,1,A-1,이번 신제품 출시에 대한 시장의 반응은 어떤가요?,How is the market's reaction to the newly rele...
1,비즈니스,회의,의견 교환하기,1,B-1,판매량이 지난번 제품보다 빠르게 늘고 있습니다.,The sales increase is faster than the previous...
2,비즈니스,회의,의견 교환하기,1,A-2,그렇다면 공장에 연락해서 주문량을 더 늘려야겠네요.,"Then, we'll have to call the manufacturer and ..."
3,비즈니스,회의,의견 교환하기,1,B-2,"네, 제가 연락해서 주문량을 2배로 늘리겠습니다.","Sure, I'll make a call and double the volume o..."
4,비즈니스,회의,의견 교환하기,2,A-1,지난 회의 마지막에 논의했던 안건을 다시 볼까요?,Shall we take a look at the issues we discusse...


#### Text Preprocessing

To ensure that the input data is clean, standardized, and properly formatted for efficient model training, we applied the following preprocessing steps:

- **Normalization:**  
  We first converted all text to lowercase and standardized spacing by adding spaces around punctuation marks such as `.`, `,`, `?`, and `!`. Non-alphabetic characters (except essential punctuation) were removed, and any multiple consecutive spaces were collapsed into a single space. For robustness, the function also returns an empty string for non-string inputs, ensuring that edge cases are handled appropriately.

- **Tokenization:**  
  After normalization, we applied a simple word-level tokenizer. Each unique word is assigned a unique index, and special tokens — `<pad>`, `<sos>`, `<eos>`, and `<unk>` — were included to manage padding, sequence start, sequence end, and unknown words respectively.

- **Sequence Preparation:**  
  Each sentence was wrapped with `<sos>` (start of sentence) and `<eos>` (end of sentence) tokens. Sequences were either padded or truncated to a fixed maximum length, allowing for efficient batching during model training.

- **Data Splitting:**  
  Currently, the entire dataset is being used for model training. In future work, we plan to split the dataset into training, validation, and test sets to better evaluate the model’s generalization performance.

These preprocessing steps collectively transform raw text into a structured format that is clean, consistent, and optimized for training effective machine learning models.

In [3]:
def preprocess_english(sentence):
    """Clean English: lowercase, remove symbols, keep letters."""
    if isinstance(sentence, str):
        sentence = sentence.lower().strip()
        sentence = re.sub(r"([?.!,])", r" \1 ", sentence)
        sentence = re.sub(r'[^a-zA-Z?.!,\s]', '', sentence)
        sentence = re.sub(r'\s+', ' ', sentence)
        return sentence
    return ""

def preprocess_korean(sentence):
    """For Korean, just trim extra spaces (DO NOT strip Hangul)."""
    if isinstance(sentence, str):
        sentence = sentence.strip()
        sentence = re.sub(r"([?.!,])", r" \1 ", sentence)
        sentence = re.sub(r'\s+', ' ', sentence)
        return sentence
    return ""

#### Tokenizer Class
The ImprovedTokenizer class provides a custom way to convert text into token IDs and back, enabling consistent text preprocessing and decoding.:

- Initialized with a vocabulary size limit 
- Contains special tokens: `<pad>`(0), `<sos>`(1), `<eos>`(2), and `<unk>`(3)
- Builds vocabulary based on word frequency in the training data
- The most common words are included in the vocabulary up to the max limit
- Provides methods to encode sentences into token IDs and decode IDs back to text
- Implements save/load functionality for persistence between runs

`ImprovedKoreanTokenizer`

This tokenizer provides character-level support for Korean text:

- The `tokenize()` method uses `mecab.morphs()` to properly segment Korean text into morphological units\
    **Morphological units (or morphemes) are the smallest linguistic units in a language that have meaning or grammatical function.**
- During vocabulary building, it counts frequency of each morphological unit
- Words are sorted by frequency and limited to the maximum vocabulary size
- The `encode()` method converts Korean text into token IDs using the built vocabulary
- The `decode()` method converts token IDs back into readable Korean text

**📌 Why Morphological Analysis Matters for Korean**

The ImprovedKoreanTokenizer uses MeCab (through the mecab.morphs() function) to perform morphological analysis because:

- Korean is an agglutinative language where numerous grammatical elements attach to word stems
- Simply splitting by spaces would be ineffective as Korean sentences often have many morphemes within a single space-separated "word"
- Character-level splitting would lose the semantic meaning carried by morphological units
- Proper morphological segmentation provides much better input for NLP tasks like translation or sentiment analysis

This is why the Korean tokenizer needs specialized processing while the English tokenizer can rely on simpler space-based splitting.

In [4]:
# Initialize the Mecab tokenizer for Korean
mecab = Mecab()
def tokenize_korean(text):
    return mecab.morphs(text)

class ImprovedKoreanTokenizer:
    """Korean-specific tokenizer with character-level support"""
    def __init__(self, max_vocab=50000):
        self.word2idx = {"<pad>": 0, "<sos>": 1, "<eos>": 2, "<unk>": 3}
        self.idx2word = {0: "<pad>", 1: "<sos>", 2: "<eos>", 3: "<unk>"}
        self.word_freq = {}
        self.max_vocab = max_vocab

    def tokenize(self, sentence):
        """Tokenize Korean text appropriately"""
        return tokenize_korean(sentence)

    def fit(self, sentences):
        """Build vocabulary from sentences"""
        for sent in tqdm(sentences, desc="Building vocabulary"):
            for word in self.tokenize(sent):
                self.word_freq[word] = self.word_freq.get(word, 0) + 1

        # Sort by frequency and limit vocab size
        sorted_words = sorted(self.word_freq.items(), key=lambda x: x[1], reverse=True)
        vocab_limit = min(len(sorted_words), self.max_vocab - 4)

        idx = 4
        for word, _ in sorted_words[:vocab_limit]:
            self.word2idx[word] = idx
            self.idx2word[idx] = word
            idx += 1

        print(f"Vocabulary size: {len(self.word2idx)}")

    def encode(self, sentence):
        """Convert sentence to token IDs"""
        return [self.word2idx.get(word, self.word2idx["<unk>"]) for word in self.tokenize(sentence)]

    def decode(self, indices):
        """Convert token IDs back to text"""
        return ' '.join([self.idx2word.get(idx, "<unk>") for idx in indices if idx != self.word2idx["<pad>"]])

    def save(self, path):
        """Save tokenizer state"""
        with open(path, 'wb') as f:
            pickle.dump(self.__dict__, f)

    @classmethod
    def load(cls, path):
        """Load tokenizer state"""
        obj = cls()
        with open(path, 'rb') as f:
            obj.__dict__.update(pickle.load(f))
        return obj

In [5]:
class EnglishTokenizer:
    """Simple word-level tokenizer for English"""
    def __init__(self, max_vocab=30000):
        self.word2idx = {"<pad>": 0, "<sos>": 1, "<eos>": 2, "<unk>": 3}
        self.idx2word = {0: "<pad>", 1: "<sos>", 2: "<eos>", 3: "<unk>"}
        self.word_freq = {}
        self.max_vocab = max_vocab

    def tokenize(self, sentence):
        """Split English text into words"""
        return word_tokenize(sentence)

    def fit(self, sentences):
        for sent in tqdm(sentences, desc="Building vocabulary"):
            for word in self.tokenize(sent):
                self.word_freq[word] = self.word_freq.get(word, 0) + 1

        sorted_words = sorted(self.word_freq.items(), key=lambda x: x[1], reverse=True)
        vocab_limit = min(len(sorted_words), self.max_vocab - 4)

        idx = 4
        for word, _ in sorted_words[:vocab_limit]:
            self.word2idx[word] = idx
            self.idx2word[idx] = word
            idx += 1

        print(f"Vocabulary size: {len(self.word2idx)}")

    def encode(self, sentence):
        return [self.word2idx.get(word, self.word2idx["<unk>"]) for word in self.tokenize(sentence)]

    def decode(self, indices):
        return ' '.join([self.idx2word.get(idx, "<unk>") for idx in indices if idx != self.word2idx["<pad>"]])

    def save(self, path):
        with open(path, 'wb') as f:
            pickle.dump(self.__dict__, f)

    @classmethod
    def load(cls, path):
        obj = cls()
        with open(path, 'rb') as f:
            obj.__dict__.update(pickle.load(f))
        return obj

#### Dataset Class with Caching
The `CachedTranslationDataset` class is designed to improve training efficiency for machine translation tasks by preprocessing and caching tokenized data.

- **Built on PyTorch Dataset:**  
  Inherits from `torch.utils.data.Dataset`, making it fully compatible with PyTorch's `DataLoader` for efficient batching and shuffling.

- **Caching Mechanism:**  
  Preprocesses each data sample once and stores the result in memory, avoiding repeated preprocessing during each epoch and speeding up training.

- **Tokenization and Sequence Preparation:**  
  - Converts source sentences (e.g., Korean) and target sentences (e.g., English) into token ID sequences.
  - Adds special tokens like `<sos>` (start of sentence) and `<eos>` (end of sentence).
  - Applies padding or truncation to ensure all sequences have a fixed maximum length.

- **Attention Mask Creation:**  
  Generates attention masks that distinguish between actual tokens and padding tokens, helping the model ignore padded positions during training.

- **Model-Ready Outputs:**  
  Returns tensors for:
  - Source input IDs and attention masks
  - Target input IDs and labels  
  These tensors are ready to be directly fed into the model for training or evaluation.

In [6]:
class CachedTranslationDataset(Dataset):
    """Dataset with caching for faster loading"""
    def __init__(self, df, src_tokenizer, tgt_tokenizer, max_len=40, cache_dir='dataset_cache'):
        self.df = df
        self.src_tokenizer = src_tokenizer
        self.tgt_tokenizer = tgt_tokenizer
        self.max_len = max_len
        self.cache_dir = cache_dir
        self.cache_file = os.path.join(cache_dir, f"cached_data_{len(df)}_{max_len}.pkl")
        
        # Create cache directory if it doesn't exist
        os.makedirs(cache_dir, exist_ok=True)
        
        # Try to load from cache, otherwise process data
        self.cached_data = self._load_or_create_cache()

    def _load_or_create_cache(self):
        if os.path.exists(self.cache_file):
            print(f"Loading cached dataset from {self.cache_file}")
            with open(self.cache_file, 'rb') as f:
                return pickle.load(f)
        
        print("Creating new dataset cache...")
        cached_data = []
        
        for idx in tqdm(range(len(self.df)), desc="Processing dataset"):
            src_text = self.df.iloc[idx]['korean']
            tgt_text = self.df.iloc[idx]['english']
            
            src_seq = [1] + self.src_tokenizer.encode(src_text) + [2]  # <sos> + sentence + <eos>
            tgt_seq = [1] + self.tgt_tokenizer.encode(tgt_text) + [2]  # <sos> + sentence + <eos>
            
            # Truncate sequences to max_len
            src_seq = src_seq[:self.max_len]
            tgt_seq = tgt_seq[:self.max_len]
            
            # Create source and target masks
            src_mask = [1] * len(src_seq) + [0] * (self.max_len - len(src_seq))
            tgt_mask = [1] * len(tgt_seq) + [0] * (self.max_len - len(tgt_seq))
            
            # Pad sequences
            src_seq += [0] * (self.max_len - len(src_seq))
            tgt_seq += [0] * (self.max_len - len(tgt_seq))
            
            cached_data.append({
                'src': torch.tensor(src_seq, dtype=torch.long),
                'tgt': torch.tensor(tgt_seq, dtype=torch.long),
                'src_mask': torch.tensor(src_mask, dtype=torch.bool),
                'tgt_mask': torch.tensor(tgt_mask, dtype=torch.bool),
                'src_len': len(src_seq),
                'tgt_len': len(tgt_seq)
            })
        
        # Save to cache
        print(f"Saving dataset cache to {self.cache_file}")
        with open(self.cache_file, 'wb') as f:
            pickle.dump(cached_data, f)
        
        return cached_data

    def __len__(self):
        return len(self.cached_data)

    def __getitem__(self, idx):
        return self.cached_data[idx]

#### Data Loading Functions
`load_excel_files`:
- combines all Excel files from a specified directory into a single DataFrame 
- automatically identifies the Korean and English columns, preprocesses the text
- caches the combined DataFrame for faster reloading in future runs.

`create_dataloaders`:
- Splits the dataset into training and validation sets
- Allows optional subsampling to use only a fraction of the available data if needed. Finally 
- Creates PyTorch `DataLoader` objects to efficiently batch and feed the data to the model during training and evaluation


In [7]:
def load_excel_files(directory='./translated', pattern="*translated.xlsx"):
    """Load and combine all Excel files matching the pattern"""
    all_data = []
    cache_file = os.path.join(directory, "combined_data_cache.pkl")
    
    # Try to load from cache
    if os.path.exists(cache_file):
        print(f"Loading combined data from cache: {cache_file}")
        with open(cache_file, 'rb') as f:
            return pickle.load(f)
    
    files = list(Path(directory).glob(pattern))
    
    if not files:
        print(f"No files matching pattern '{pattern}' found in directory '{directory}'")
        return pd.DataFrame()
    
    print(f"Found {len(files)} files matching the pattern")
    
    for file_path in tqdm(files, desc="Loading Excel files"):
        try:
            # Attempt to identify Korean and English columns based on common patterns
            df = pd.read_excel(file_path)
            
            # Try to automatically detect Korean and English columns
            korean_col = None
            english_col = None
            
            # Common column name patterns
            korean_patterns = ['korean', 'ko', '한국어', 'source']
            english_patterns = ['english', 'en', '영어', 'target']
            
            # Check column names
            for col in df.columns:
                col_lower = str(col).lower()
                if any(pattern in col_lower for pattern in korean_patterns):
                    korean_col = col
                if any(pattern in col_lower for pattern in english_patterns):
                    english_col = col
            
            # If automatic detection fails, use the first two columns
            if korean_col is None or english_col is None:
                if len(df.columns) >= 2:
                    korean_col = df.columns[0]
                    english_col = df.columns[1]
                    print(f"Using columns: {korean_col} and {english_col} for {file_path.name}")
                else:
                    print(f"Skipping {file_path.name}: Not enough columns")
                    continue
            
            # Extract and rename columns
            file_data = df[[korean_col, english_col]].copy()
            file_data.columns = ['korean', 'english']
            
            # Add source file information
            file_data['source_file'] = file_path.name
            
            # Append to combined data
            all_data.append(file_data)
            
        except Exception as e:
            print(f"Error processing {file_path.name}: {e}")
    
    if not all_data:
        return pd.DataFrame()
    
    # Combine all dataframes
    combined_data = pd.concat(all_data, ignore_index=True)
    
    # Clean the data
    combined_data = combined_data.dropna()
    combined_data['english'] = combined_data['english'].apply(preprocess_english)
    combined_data['korean'] = combined_data['korean'].apply(preprocess_korean)
    
    # Remove rows with too short sentences
    combined_data = combined_data[
    (combined_data['english'].str.split().str.len() > 3) &
    (combined_data['korean'].str.split().str.len() > 3)
    ]

    # Remove rows with empty strings
    combined_data = combined_data[(combined_data['english'] != '') & (combined_data['korean'] != '')]
    
    # Save to cache
    print(f"Saving combined data to cache: {cache_file}")
    with open(cache_file, 'wb') as f:
        pickle.dump(combined_data, f)
    
    return combined_data

def create_dataloaders(data, ko_tokenizer, en_tokenizer, train_ratio=0.8, 
                       batch_size=64, max_len=40, num_workers=4, pin_memory=True, 
                       subset_fraction=0.3):  # Added subset_fraction parameter
    """Create train and validation DataLoaders with optimized settings"""
    
    # Apply subset sampling - NEW
    if subset_fraction < 1.0:
        sample_size = int(len(data) * subset_fraction)
        data = data.sample(sample_size, random_state=42).reset_index(drop=True)
        print(f"Using {subset_fraction*100}% of the data: {len(data)} samples")
    
    # Split data into train and validation sets
    train_size = int(len(data) * train_ratio)
    train_data = data.iloc[:train_size]
    val_data = data.iloc[train_size:]
    
    print(f"Training data: {len(train_data)} samples")
    print(f"Validation data: {len(val_data)} samples")
    
    # Create datasets with caching
    train_dataset = CachedTranslationDataset(train_data, ko_tokenizer, en_tokenizer, max_len, cache_dir='dataset_cache/train')
    val_dataset = CachedTranslationDataset(val_data, ko_tokenizer, en_tokenizer, max_len, cache_dir='dataset_cache/val')
    
    # Create data loaders
    train_loader = DataLoader(
        train_dataset, 
        batch_size=batch_size, 
        shuffle=True,
        num_workers=num_workers,
        pin_memory=pin_memory
    )
    
    val_loader = DataLoader(
        val_dataset, 
        batch_size=batch_size, 
        shuffle=False,
        num_workers=num_workers,
        pin_memory=pin_memory
    )
    
    return train_loader, val_loader

---
## Model Implementation
#### Transformer-Based Model:
`PositionalEncoding`:

- Adds position-dependent signals to token embeddings since the Transformer has no inherent notion of order.
- Uses fixed sine and cosine patterns based on position and embedding dimension

`TransformerModel`:

- Implements a full encoder-decoder Transformer architecture using PyTorch’s nn.Transformer.
- Handles embedding, positional encoding, masking, and decoding logic.
- Supports both training (forward) and inference (beam_search) with step-wise prediction.
- `beam_search()` enables more fluent translations compared to greedy decoding.

In [8]:
class PositionalEncoding(nn.Module):
    """Injects positional information using sine and cosine signals"""
    def __init__(self, d_model, max_len=5000):
        super(PositionalEncoding, self).__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1).float()
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))

        # Apply sin to even indices, cos to odd
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        pe = pe.unsqueeze(0).transpose(0, 1)  # Shape: (max_len, 1, d_model)
        self.register_buffer('pe', pe)  # Avoids updating during training

    def forward(self, x):
        # Add positional encoding to input embeddings
        x = x + self.pe[:x.size(0), :]
        return x

In [9]:
class TransformerModel(nn.Module):
    """Full Transformer model for sequence-to-sequence translation"""
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model=512, nhead=8,
                 num_encoder_layers=4, num_decoder_layers=4, dim_feedforward=1024, dropout=0.2):
        super(TransformerModel, self).__init__()

        # Embedding + positional encodings
        self.src_embedding = nn.Embedding(src_vocab_size, d_model)
        self.tgt_embedding = nn.Embedding(tgt_vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model)

        # Transformer backbone (encoder-decoder)
        self.transformer = nn.Transformer(
            d_model=d_model,
            nhead=nhead,
            num_encoder_layers=num_encoder_layers,
            num_decoder_layers=num_decoder_layers,
            dim_feedforward=dim_feedforward,
            dropout=dropout,
            batch_first=True
        )

        # Final linear projection to vocab size
        self.output_layer = nn.Linear(d_model, tgt_vocab_size)

        self._init_parameters()

        # Store hyperparameters
        self.d_model = d_model
        self.src_vocab_size = src_vocab_size
        self.tgt_vocab_size = tgt_vocab_size

    def _init_parameters(self):
        """Initialize model weights with Xavier uniform"""
        for p in self.parameters():
            if p.dim() > 1:
                nn.init.xavier_uniform_(p)
    def create_mask(self, src, tgt):
        """Create padding and look-ahead masks for source and target"""
        src_padding_mask = (src == 0).to(device)
        tgt_padding_mask = (tgt == 0).to(device)
        tgt_len = tgt.size(1)

        # Prevent target positions from seeing future tokens
        tgt_attention_mask = torch.triu(torch.ones(tgt_len, tgt_len, dtype=torch.bool), diagonal=1).to(device)
        return src_padding_mask, tgt_padding_mask, tgt_attention_mask

    def forward(self, src, tgt):
        """Standard forward pass for training"""
        src_padding_mask, tgt_padding_mask, tgt_attention_mask = self.create_mask(src, tgt)

        # Embed + add position encoding
        src_embedded = self.positional_encoding(self.src_embedding(src) * math.sqrt(self.d_model))
        tgt_embedded = self.positional_encoding(self.tgt_embedding(tgt) * math.sqrt(self.d_model))

        # Transformer forward
        output = self.transformer(
            src=src_embedded,
            tgt=tgt_embedded,
            src_key_padding_mask=src_padding_mask,
            tgt_key_padding_mask=tgt_padding_mask,
            memory_key_padding_mask=src_padding_mask,
            tgt_mask=tgt_attention_mask
        )

        return self.output_layer(output)
    def beam_search(self, src, src_tokenizer, tgt_tokenizer, beam_width=5, max_len=50):
        """Decodes translation using beam search for better quality than greedy"""
        self.eval()
        with torch.no_grad():
            # Tokenize if raw input
            if isinstance(src, str):
                src_tokens = [1] + src_tokenizer.encode(src) + [2]
                src = torch.tensor([src_tokens]).to(device)
            else:
                src = src.to(device)

            src_padding_mask = (src == 0).to(device)
            src_embedded = self.positional_encoding(self.src_embedding(src) * math.sqrt(self.d_model))

            # Encode source once
            memory = self.transformer.encoder(src_embedded, src_key_padding_mask=src_padding_mask)

            # Initialize beam with <sos> token
            sequences = [([1], 0.0)]  # ([tokens], score)

            for _ in range(max_len):
                all_candidates = []
                for seq, score in sequences:
                    tgt_input = torch.tensor([seq]).to(device)
                    tgt_mask = torch.triu(torch.ones(len(seq), len(seq), dtype=torch.bool), diagonal=1).to(device)
                    tgt_embedded = self.positional_encoding(self.tgt_embedding(tgt_input) * math.sqrt(self.d_model))

                    # Decode next step
                    output = self.transformer.decoder(
                        tgt=tgt_embedded,
                        memory=memory,
                        tgt_mask=tgt_mask,
                        memory_key_padding_mask=src_padding_mask,
                        tgt_key_padding_mask=(tgt_input == 0)
                    )

                    logits = self.output_layer(output[:, -1, :])
                    probs = torch.log_softmax(logits, dim=-1)
                    topk = torch.topk(probs, beam_width)

                    # Expand each beam with top tokens
                    for i in range(beam_width):
                        token = topk.indices[0, i].item()
                        new_seq = seq + [token]
                        new_score = (score + topk.values[0, i].item()) / ((len(new_seq) + 1) ** 0.7)
                        all_candidates.append((new_seq, new_score))

                # Keep top-k candidates
                sequences = sorted(all_candidates, key=lambda tup: tup[1], reverse=True)[:beam_width]

                # Stop early if all sequences ended with <eos>
                if all(seq[-1] == 2 for seq, _ in sequences):
                    break

            best_seq = sequences[0][0]
            if 2 in best_seq:
                best_seq = best_seq[:best_seq.index(2)]

            return tgt_tokenizer.decode(best_seq)

#### LSTM + Attention-based Seq2Seq
`EncoderRNN` (LSTM Encoder):

- Takes the input (source language) sequence, embeds the tokens into dense vectors, and passes them through an LSTM network to capture contextual information.
- Outputs both the full sequence of hidden states (for attention) and the final hidden and cell states (for initializing the decoder).

In [10]:
class EncoderRNN(nn.Module):
    """LSTM Encoder"""
    def __init__(self, input_dim, embed_dim, hidden_dim, num_layers=1, dropout=0.2):
        super(EncoderRNN, self).__init__()
        
        self.embedding = nn.Embedding(input_dim, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers=num_layers, dropout=dropout, batch_first=True)
        
    def forward(self, src):
        embedded = self.embedding(src)
        outputs, (hidden, cell) = self.lstm(embedded)
        return outputs, hidden, cell 

`AttentionDecoderRNN` (LSTM Decoder with Attention):
- At each decoding step, it uses an attention mechanism to dynamically focus on different parts of the encoder's hidden states.
- Combines the current embedded decoder input with the attention context, processes it through an LSTM, and predicts the next token in the sequence.

In [11]:
class AttentionDecoderRNN(nn.Module):
    """LSTM Decoder with Attention"""
    def __init__(self, output_dim, embed_dim, hidden_dim, num_layers=1, dropout=0.2):
        super(AttentionDecoderRNN, self).__init__()
        
        self.embedding = nn.Embedding(output_dim, embed_dim)
        self.attention = nn.Linear(hidden_dim + embed_dim, hidden_dim)
        self.lstm = nn.LSTM(hidden_dim + embed_dim, hidden_dim, num_layers=num_layers, dropout=dropout, batch_first=True)
        self.fc_out = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, input, hidden, cell, encoder_outputs):
        input = input.unsqueeze(1)  # (batch_size, 1)
        embedded = self.embedding(input)  # (batch_size, 1, embed_dim)
        
        # Calculate attention weights
        hidden_broadcast = hidden[-1].unsqueeze(1)  # (batch_size, 1, hidden_dim)
        attn_weights = torch.bmm(hidden_broadcast, encoder_outputs.transpose(1, 2))  # (batch_size, 1, seq_len)
        attn_weights = torch.softmax(attn_weights, dim=-1)
        
        # Context vector
        context = torch.bmm(attn_weights, encoder_outputs)  # (batch_size, 1, hidden_dim)
        
        # Concatenate context and embedding
        rnn_input = torch.cat((embedded, context), dim=2)  # (batch_size, 1, hidden_dim + embed_dim)
        
        # Pass through LSTM
        output, (hidden, cell) = self.lstm(rnn_input, (hidden, cell))
        
        prediction = self.fc_out(output.squeeze(1))  # (batch_size, output_dim)
        
        return prediction, hidden, cell


`Seq2SeqModel` (Seq2Seq Wrapper):
- A wrapper that connects the encoder and decoder together into a full translation model.
- Defines the overall sequence-to-sequence forward pass, starting by encoding the source sequence and then decoding the target sequence one token at a time using teacher forcing during training.

In [12]:
class Seq2SeqModel(nn.Module):
    """Wrapper for LSTM Encoder-Decoder with Attention"""
    def __init__(self, encoder, decoder, device):
        super(Seq2SeqModel, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
    def forward(self, src, tgt):
        batch_size = src.size(0)
        tgt_len = tgt.size(1)
        tgt_vocab_size = self.decoder.fc_out.out_features
        
        # Tensor to store decoder outputs
        outputs = torch.zeros(batch_size, tgt_len, tgt_vocab_size).to(self.device)
        
        encoder_outputs, hidden, cell = self.encoder(src)
        
        # First input to the decoder is the <sos> tokens
        input = tgt[:, 0]
        
        for t in range(1, tgt_len):
            output, hidden, cell = self.decoder(input, hidden, cell, encoder_outputs)
            outputs[:, t] = output
            input = tgt[:, t]  # Teacher forcing: next input is ground truth
        
        return outputs

## Methods
`train_epoch`:

- Trains the Transformer model for one full epoch using teacher forcing.
- Uses mixed precision (via autocast and GradScaler) for memory efficiency.
- Applies gradient clipping to prevent exploding gradients

In [13]:
scaler = GradScaler()  # For mixed precision training

def train_epoch(model, train_loader, optimizer, criterion, clip=1.0):
    """
    Train the model for one epoch using teacher forcing.
    
    Args:
        model (nn.Module): Transformer model
        train_loader (DataLoader): Batches of (src, tgt) pairs
        optimizer (Optimizer): Optimizer like Adam
        criterion (Loss): CrossEntropyLoss
        clip (float): Gradient clipping norm

    Returns:
        float: Average loss over the epoch
    """
    model.train()
    epoch_loss = 0
    progress_bar = tqdm(train_loader, desc="Training")

    for batch in progress_bar:
        src = batch['src'].to(device)  # Source sequence (input)
        tgt = batch['tgt'].to(device)  # Target sequence (label)

        tgt_input = tgt[:, :-1]  # Inputs for decoder (excluding <eos>)
        tgt_output = tgt[:, 1:]  # Ground truth for loss (excluding <sos>)

        optimizer.zero_grad()  # Clear previous gradients

        with autocast(device_type="cuda"):  # Enable mixed precision
            output = model(src, tgt_input)  # Forward pass
            output = output.contiguous().view(-1, output.shape[-1])
            tgt_output = tgt_output.contiguous().view(-1)

            loss = criterion(output, tgt_output)

        # Backpropagate with scaled loss
        scaler.scale(loss).backward()

        # Clip gradients to avoid exploding updates
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)

        # Step optimizer and update scale
        scaler.step(optimizer)
        scaler.update()

        # Track loss
        epoch_loss += loss.item()
        progress_bar.set_postfix({"loss": epoch_loss / (progress_bar.n + 1)})

    return epoch_loss / len(train_loader)


`evaluate`:

- Evaluate model performance on validation data without updating weights

**Function Workflow**:

- Set the model to evaluation mode (model.eval()).
- Initialize variables for total loss (epoch_loss) and BLEU scores (bleu_scores).
- Loop through validation data:
  - Make predictions with the model.
  - Compute loss by comparing predictions to target values.
  - Calculate BLEU score to evaluate prediction quality.
  - Optionally print translation examples for inspection.

**BLEU Score**:
- **Purpose**: BLEU (Bilingual Evaluation Understudy) measures the quality of machine-generated text by comparing n-grams (sequences of words) in the predicted output with those in reference translations.
- **Key Features**:
  - **N-gram Precision**: It checks how many n-grams (e.g., unigrams, bigrams) from the predicted sentence match those in the reference sentence.
  - **Brevity Penalty**: A penalty is applied if the predicted output is shorter than the reference, encouraging the model to produce more complete translations.
  - **Smoothing**: Smoothing is used to adjust BLEU calculations in cases where the model produces rare n-grams not present in the reference translations, ensuring more stable scores.
- **Usage**: BLEU is especially useful in tasks like machine translation, where we want to evaluate how well a machine-generated translation matches human-produced translations.
- **Limitations**: BLEU focuses on surface-level n-gram overlap and doesn't capture the full meaning or fluency of a sentence, so it's less effective for evaluating overall sentence quality or semantic accuracy.
  
**Output**:
- Returns the average loss and average BLEU score across the validation set.



In [14]:
def evaluate(model, val_loader, criterion, ko_tokenizer=None, en_tokenizer=None, print_examples=3):
    """
    Evaluate the Transformer model on validation data.
    
    Args:
        model (nn.Module): Trained Transformer model
        val_loader (DataLoader): Validation data batches
        criterion (Loss): Loss function (e.g. CrossEntropyLoss)
        ko_tokenizer (Tokenizer): Korean tokenizer (for decoding)
        en_tokenizer (Tokenizer): English tokenizer (for decoding)
        print_examples (int): Number of examples to print

    Returns:
        tuple: (average_loss, average_bleu_score)
    """
    model.eval()
    epoch_loss = 0
    bleu_scores = []
    smooth = SmoothingFunction().method4  # BLEU smoothing for short sentences
    examples_printed = 0

    with torch.no_grad():
        for i, batch in enumerate(tqdm(val_loader, desc="Evaluating")):
            if i >= 200:
                break  # For speed: evaluate only on 200 batches

            src = batch['src'].to(device)
            tgt = batch['tgt'].to(device)

            tgt_input = tgt[:, :-1]   # Decoder input (no <eos>)
            tgt_output = tgt[:, 1:]   # Ground-truth prediction (no <sos>)

            with autocast(device_type="cuda"):
                output = model(src, tgt_input)  # Forward pass
                output_flat = output.contiguous().view(-1, output.shape[-1])
                tgt_output_flat = tgt_output.contiguous().view(-1)

                loss = criterion(output_flat, tgt_output_flat)
                epoch_loss += loss.item()

            # --- BLEU Score Calculation + Sample Printing ---
            for j in range(src.size(0)):
                if examples_printed >= print_examples:
                    break

                # Run beam search for prediction
                src_seq = src[j].unsqueeze(0)
                pred_text = model.beam_search(src_seq, ko_tokenizer, en_tokenizer)

                # Decode ground-truth target
                tgt_sentence = [w for w in tgt[j].tolist() if w not in [0, 1, 2]]  # Remove <pad>, <sos>, <eos>
                ref_text = en_tokenizer.decode(tgt_sentence)

                # Calculate BLEU score (ref vs pred)
                bleu = sentence_bleu(
                    [ref_text.split()],
                    pred_text.split(),
                    smoothing_function=smooth
                )
                bleu_scores.append(bleu)

                # Print example translation
                src_text = ko_tokenizer.decode([w for w in src[j].tolist() if w not in [0, 1, 2]])
                print(f"\n--- Example {examples_printed + 1} ---")
                print(f"Source (KO): {src_text}")
                print(f"Target (EN): {ref_text}")
                print(f"Prediction : {pred_text}")
                examples_printed += 1

    avg_bleu = sum(bleu_scores) / len(bleu_scores) if bleu_scores else 0
    print(f"\nAverage BLEU score on validation set: {avg_bleu:.4f}")
    
    return epoch_loss / max(i, 1)  # Avoid divide-by-zero if no batches


`train_model`:

- Trains the model across multiple epochs with checkpointing and early stopping

**Function Workflow**:

- Trains the model using train_epoch() and validates using evaluate().
- Tracks the best validation loss and saves the corresponding model checkpoint.
- Stops training early if the validation loss does not improve for a specified number of epochs (patience).

Output:
- Returns the best validation loss achieved.

In [15]:
def train_model(model, train_loader, val_loader, optimizer, criterion,
                ko_tokenizer=None, en_tokenizer=None,
                num_epochs=20, patience=2, model_save_path='models'):

    os.makedirs(model_save_path, exist_ok=True)  # Ensure save directory exists

    best_val_loss = float('inf')  # Initialize with infinity
    patience_counter = 0          # Tracks how long validation loss hasn't improved
    scaler = GradScaler()         # AMP support

    for epoch in range(num_epochs):
        start_time = time.time()

        # === TRAIN ONE EPOCH ===
        train_loss = train_epoch(model, train_loader, optimizer, criterion)

        # === VALIDATION ===
        val_loss = evaluate(
            model, val_loader, criterion,
            ko_tokenizer=ko_tokenizer,
            en_tokenizer=en_tokenizer
        )

        end_time = time.time()
        print(f"\nEpoch {epoch+1}/{num_epochs} | Time: {end_time - start_time:.2f}s")
        print(f"Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f}")

        # === SAVE BEST MODEL ===
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            torch.save({
                'epoch': epoch,
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'train_loss': train_loss,
                'val_loss': val_loss,
            }, os.path.join(model_save_path, 'best_model.pt'))
            print(f"✅ Saved new best model (val loss {val_loss:.4f})")
            patience_counter = 0  # Reset patience if model improved
        else:
            patience_counter += 1
            print(f"⚠️ No improvement. Patience: {patience_counter}/{patience}")

        # === EARLY STOPPING ===
        if patience_counter >= patience:
            print(f"⏹️ Early stopping triggered after {epoch+1} epochs.")
            break

    return best_val_loss


## Experiments and Results
#### Load Data, Build Model, Train, and Test
`process_and_train()`
- Loads dataset from files.
- Loads or builds Korean/English tokenizers.
- Creates PyTorch DataLoaders.
- Instantiates either a Transformer or LSTM+Attention model.
- Trains the model and saves the best-performing checkpoint.

In [16]:
def process_and_train(
    directory='./translated',
    batch_size=64,
    d_model=512,
    num_epochs=20,
    learning_rate=0.0003,
    num_workers=4,
    use_cached_data=True,
    subset_fraction=0.5,
    model_type="transformer",
    model_save_path="models"
):
    start_time = time.time()

    # === LOAD DATA & TOKENIZERS ===
    if use_cached_data and os.path.exists('tokenizers/korean_tokenizer.pkl') and os.path.exists('tokenizers/english_tokenizer.pkl'):
        print("Loading tokenizers from cache...")
        ko_tokenizer = ImprovedKoreanTokenizer.load('tokenizers/korean_tokenizer.pkl')
        en_tokenizer = EnglishTokenizer.load('tokenizers/english_tokenizer.pkl')
        data = load_excel_files(directory)
    else:
        data = load_excel_files(directory)
        if len(data) == 0:
            print("No data loaded. Exiting.")
            return None
        print(f"Total data loaded: {len(data)} sentence pairs")

        # Build vocab from scratch
        ko_tokenizer = ImprovedKoreanTokenizer(max_vocab=50000)
        en_tokenizer = EnglishTokenizer(max_vocab=50000)

        print("Building Korean vocabulary...")
        ko_tokenizer.fit(data['korean'].tolist())

        print("Building English vocabulary...")
        en_tokenizer.fit(data['english'].tolist())

        os.makedirs('tokenizers', exist_ok=True)
        ko_tokenizer.save('tokenizers/korean_tokenizer.pkl')
        en_tokenizer.save('tokenizers/english_tokenizer.pkl')
        print("Tokenizers saved.")

    # === CREATE DATALOADERS ===
    train_loader, val_loader = create_dataloaders(
        data, ko_tokenizer, en_tokenizer,
        batch_size=batch_size,
        num_workers=num_workers,
        max_len=30,
        pin_memory=(device.type == 'cuda'),
        subset_fraction=subset_fraction
    )

    # === MODEL SELECTION ===
    if model_type == "transformer":
        model = TransformerModel(
            src_vocab_size=len(ko_tokenizer.word2idx),
            tgt_vocab_size=len(en_tokenizer.word2idx),
            d_model=d_model,
            nhead=8,
            num_encoder_layers=4,
            num_decoder_layers=4,
            dim_feedforward=1024,
            dropout=0.1
        ).to(device)

    elif model_type == "lstm":
        INPUT_DIM = len(ko_tokenizer.word2idx)
        OUTPUT_DIM = len(en_tokenizer.word2idx)
        EMBED_DIM = 256
        HIDDEN_DIM = 512
        encoder = EncoderRNN(INPUT_DIM, EMBED_DIM, HIDDEN_DIM)
        decoder = AttentionDecoderRNN(OUTPUT_DIM, EMBED_DIM, HIDDEN_DIM)
        model = Seq2SeqModel(encoder, decoder, device).to(device)

    else:
        raise ValueError(f"Unsupported model_type: {model_type}")

    # === SETUP TRAINING ===
    if device.type == 'cuda':
        torch.backends.cudnn.benchmark = True  # Optimize for fixed input size

    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    criterion = nn.CrossEntropyLoss(ignore_index=0, label_smoothing=0.1)

    # === TRAINING LOOP ===
    print("\nStarting model training...")
    best_val_loss = train_model(
        model, train_loader, val_loader,
        optimizer, criterion,
        ko_tokenizer=ko_tokenizer,
        en_tokenizer=en_tokenizer,
        num_epochs=num_epochs,
        model_save_path=model_save_path
    )

    end_time = time.time()
    print(f"\nTraining completed in {end_time - start_time:.2f} seconds")
    print(f"Best validation loss: {best_val_loss:.4f}")

    return model, ko_tokenizer, en_tokenizer


#### Entry Point
Run the process when this file is executed.

- Change the `model_type` to switch between **trandformer** and **LSTM+Attention**

In [None]:
if __name__ == "__main__":
    model, ko_tokenizer, en_tokenizer = process_and_train(
        directory='./translated',
        use_cached_data=False,
        batch_size=16,             
        num_epochs=20,
        d_model=512,
        learning_rate=0.00025,
        num_workers=2,              
        subset_fraction=0.3,
        model_type="transformer"
    )

Loading combined data from cache: ./translated/combined_data_cache.pkl
Total data loaded: 1602058 sentence pairs
Building Korean vocabulary...


Building vocabulary: 100%|██████████| 1602058/1602058 [02:37<00:00, 10162.75it/s]


Vocabulary size: 50000
Building English vocabulary...


Building vocabulary: 100%|██████████| 1602058/1602058 [03:57<00:00, 6732.76it/s] 


Vocabulary size: 50000
Tokenizers saved.
Using 30.0% of the data: 480617 samples
Training data: 384493 samples
Validation data: 96124 samples
Loading cached dataset from dataset_cache/train/cached_data_384493_30.pkl


  return torch.load(io.BytesIO(b))


Loading cached dataset from dataset_cache/val/cached_data_96124_30.pkl

Starting model training...


Training: 100%|██████████| 24031/24031 [33:09<00:00, 12.08it/s, loss=5.91]
Evaluating:   0%|          | 0/6008 [00:00<?, ?it/s]


--- Example 1 ---
Source (KO): 심의회 의 위원장 을 제외 한 위원 은 소속 공무원 , 외부 전문가 로 지명 하 거나 위촉 하 되 , 그 중 2 분 의 1 은 정보
Target (EN): the members other than the caused of the college council shall be several if god as affiliated public officials and articles middle , and one half of them shall
Prediction : <sos> which official of the members shall be transport by the members of the members of the members of the members of the members of the members of the members of the members of the members of the members of the gu shall be the members of the members of the

--- Example 2 ---
Source (KO): 이 영화 에서 이미지 와 음악 의 결합 은 영화 에서 음악 을 어떻게 활용 해야 하 는지 를 정확 하 게 보여 주 었 습니다 .
Target (EN): the spaces of the release and music in this changed technique child how music should be used in settlement .
Prediction : <sos> this is because i think that the changed is the most civil of my raised in the world .


Evaluating:   0%|          | 6/6008 [00:04<1:01:14,  1.63it/s]


--- Example 3 ---
Source (KO): 가계 부채 증가 율 이 2015 년 과 2016 년 10 ∼ 11 % 수준 이 었 던 것 과 비교 하 면 속도 가 줄 었 지만 명목
Target (EN): the growth rate of recognized debt , not ships at between and percent in and , has slowed down , but it is still high science to the mw
Prediction : <sos> in the past years of the year , it was a lot of time in the past years , but it was an headquarters for years .


Evaluating:   3%|▎         | 200/6008 [00:08<04:14, 22.78it/s]



Average BLEU score on validation set: 0.0336

Epoch 1/20 | Time: 1997.90s
Train Loss: 5.9059 | Val Loss: 5.4147
✅ Saved new best model (val loss 5.4147)


Training:  96%|█████████▌| 23024/24031 [31:53<01:21, 12.30it/s, loss=5.24]IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

Training:  18%|█▊        | 4232/24031 [05:55<30:49, 10.71it/s, loss=4.94]IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

Training:  99%|█████████▉| 23860/24031 [32:45<00:13, 12.41it/s, loss=4.82]IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change thi


--- Example 1 ---
Source (KO): 심의회 의 위원장 을 제외 한 위원 은 소속 공무원 , 외부 전문가 로 지명 하 거나 위촉 하 되 , 그 중 2 분 의 1 은 정보
Target (EN): the members other than the caused of the college council shall be several if god as affiliated public officials and articles middle , and one half of them shall
Prediction : <sos> members shall be transport if god by the caused from among the members of the college council , and the members shall be transport if god by the caused from among the members of the college council , and the members shall be transport by the caused from among the

--- Example 2 ---
Source (KO): 이 영화 에서 이미지 와 음악 의 결합 은 영화 에서 음악 을 어떻게 활용 해야 하 는지 를 정확 하 게 보여 주 었 습니다 .
Target (EN): the spaces of the release and music in this changed technique child how music should be used in settlement .
Prediction : <sos> the music inspection users how to use the changed and the music release of music in the changed .


Evaluating:   0%|          | 6/6008 [00:05<1:05:22,  1.53it/s]


--- Example 3 ---
Source (KO): 가계 부채 증가 율 이 2015 년 과 2016 년 10 ∼ 11 % 수준 이 었 던 것 과 비교 하 면 속도 가 줄 었 지만 명목
Target (EN): the growth rate of recognized debt , not ships at between and percent in and , has slowed down , but it is still high science to the mw
Prediction : <sos> although it was a rate of increase in recognized debt , the increase in the number of recognized debt increased by science to the year , and the increase rate of recognized debt increased . science to .


Evaluating:   3%|▎         | 200/6008 [00:09<04:28, 21.64it/s]



Average BLEU score on validation set: 0.0799

Epoch 6/20 | Time: 2011.47s
Train Loss: 4.1170 | Val Loss: 4.1875
✅ Saved new best model (val loss 4.1875)


Training:  69%|██████▊   | 16483/24031 [23:08<11:53, 10.58it/s, loss=3.94]IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

Training:  92%|█████████▏| 22197/24031 [30:58<02:29, 12.26it/s, loss=3.94]IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

Training:  24%|██▍       | 5774/24031 [08:07<25:16, 12.04it/s, loss=3.79]IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change thi

#### RESULTS

We were unable to fully complete the training of our beam search translation model due to repeated kernel crashes—primarily caused by out-of-memory (OOM) errors during training on limited hardware.

Despite these interruptions, we observed promising improvements in early epochs:
- Epoch 1: Validation loss = 5.415, BLEU score = 0.0336
- Epoch 6: Validation loss dropped to 4.188, BLEU score improved to 0.0799

This upward trajectory suggests that with greater memory and a more powerful GPU, the beam search model has strong potential to continue improving and converge effectively.

We remain optimistic about resuming and finalizing training under better computational conditions.