## Bridging Languages with Deep Learning: Building a Korean-English Translator
**By Eunice Tu & Yewon Park**

In this project, we build a Neural Machine Translation (NMT) system to translate Korean text into English using deep learning models. We experiment with two main architectures: an LSTM-based Sequence-to-Sequence (Seq2Seq) model with attention, and a Transformer-based model. Our aim is to develop models that produce fluent, accurate translations that outperform traditional methods.

---

### Data Preprocessing
#### Data Source:
We used the AI Hub Korean-English Parallel Corpus, a professionally curated dataset provided by the Korean government. It contains over one million aligned Korean-English sentence pairs across diverse domains such as news articles, spoken conversations, legal documents, IT, and patents. This dataset is well-suited for our translation project because:

- It provides clean, high-quality translations covering both formal and informal language.

- It offers sufficient volume to train deep learning models effectively.

- It reflects a variety of real-world contexts, improving the model's generalization ability.

---

#### Import Libraries

In [5]:
import os
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import torch.optim as optim
import re
import random
import numpy as np
from tqdm import tqdm
from pathlib import Path
import time
import pickle
import math
from konlpy.tag import Mecab
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction


#### Setup and Configuration
Configures the computing environment to guarantees consistent results across different runs:

- Checks for GPU availability with PyTorch and sets the device accordingly
- Sets random seeds for PyTorch, NumPy, and Python's random module to ensure reproducibility

In [6]:
# Initial setup, device configuration, and random seed settings
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Set random seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)
random.seed(42)

Using device: cpu


#### Text Preprocessing

To ensure that the input data is clean, standardized, and properly formatted for efficient model training, we applied the following preprocessing steps:

- **Normalization:**  
  We first converted all text to lowercase and standardized spacing by adding spaces around punctuation marks such as `.`, `,`, `?`, and `!`. Non-alphabetic characters (except essential punctuation) were removed, and any multiple consecutive spaces were collapsed into a single space. For robustness, the function also returns an empty string for non-string inputs, ensuring that edge cases are handled appropriately.

- **Tokenization:**  
  After normalization, we applied a simple word-level tokenizer. Each unique word is assigned a unique index, and special tokens — `<pad>`, `<sos>`, `<eos>`, and `<unk>` — were included to manage padding, sequence start, sequence end, and unknown words respectively.

- **Sequence Preparation:**  
  Each sentence was wrapped with `<sos>` (start of sentence) and `<eos>` (end of sentence) tokens. Sequences were either padded or truncated to a fixed maximum length, allowing for efficient batching during model training.

- **Data Splitting:**  
  Currently, the entire dataset is being used for model training. In future work, we plan to split the dataset into training, validation, and test sets to better evaluate the model’s generalization performance.

These preprocessing steps collectively transform raw text into a structured format that is clean, consistent, and optimized for training effective machine learning models.

In [7]:
def preprocess_english(sentence):
    """Clean English: lowercase, remove symbols, keep letters."""
    if isinstance(sentence, str):
        sentence = sentence.lower().strip()
        sentence = re.sub(r"([?.!,])", r" \1 ", sentence)
        sentence = re.sub(r'[^a-zA-Z?.!,\s]', '', sentence)
        sentence = re.sub(r'\s+', ' ', sentence)
        return sentence
    return ""

def preprocess_korean(sentence):
    """For Korean, just trim extra spaces (DO NOT strip Hangul)."""
    if isinstance(sentence, str):
        sentence = sentence.strip()
        sentence = re.sub(r"([?.!,])", r" \1 ", sentence)
        sentence = re.sub(r'\s+', ' ', sentence)
        return sentence
    return ""

#### Tokenizer Class
The ImprovedTokenizer class provides a custom way to convert text into token IDs and back, enabling consistent text preprocessing and decoding.:

- Initialized with a vocabulary size limit 
- Contains special tokens: `<pad>`(0), `<sos>`(1), `<eos>`(2), and `<unk>`(3)
- Builds vocabulary based on word frequency in the training data
- The most common words are included in the vocabulary up to the max limit
- Provides methods to encode sentences into token IDs and decode IDs back to text
- Implements save/load functionality for persistence between runs

`ImprovedKoreanTokenizer`

This tokenizer provides character-level support for Korean text:

- The `tokenize()` method uses `mecab.morphs()` to properly segment Korean text into morphological units\
    **Morphological units (or morphemes) are the smallest linguistic units in a language that have meaning or grammatical function.**
- During vocabulary building, it counts frequency of each morphological unit
- Words are sorted by frequency and limited to the maximum vocabulary size
- The `encode()` method converts Korean text into token IDs using the built vocabulary
- The `decode()` method converts token IDs back into readable Korean text

In [8]:
# Initialize the Mecab tokenizer for Korean
mecab = Mecab()
def tokenize_korean(text):
    return mecab.morphs(text)

#### Why Morphological Analysis Matters for Korean
The ImprovedKoreanTokenizer uses MeCab (through the mecab.morphs() function) to perform morphological analysis because:

- Korean is an agglutinative language where numerous grammatical elements attach to word stems
- Simply splitting by spaces would be ineffective as Korean sentences often have many morphemes within a single space-separated "word"
- Character-level splitting would lose the semantic meaning carried by morphological units
- Proper morphological segmentation provides much better input for NLP tasks like translation or sentiment analysis

This is why the Korean tokenizer needs specialized processing while the English tokenizer can rely on simpler space-based splitting.

In [9]:
class ImprovedKoreanTokenizer:
    """Korean-specific tokenizer with character-level support"""
    def __init__(self, max_vocab=50000):
        self.word2idx = {"<pad>": 0, "<sos>": 1, "<eos>": 2, "<unk>": 3}
        self.idx2word = {0: "<pad>", 1: "<sos>", 2: "<eos>", 3: "<unk>"}
        self.word_freq = {}
        self.max_vocab = max_vocab

    def tokenize(self, sentence):
        """Tokenize Korean text appropriately"""
        return tokenize_korean(sentence)

    def fit(self, sentences):
        """Build vocabulary from sentences"""
        for sent in tqdm(sentences, desc="Building vocabulary"):
            for word in self.tokenize(sent):
                self.word_freq[word] = self.word_freq.get(word, 0) + 1

        # Sort by frequency and limit vocab size
        sorted_words = sorted(self.word_freq.items(), key=lambda x: x[1], reverse=True)
        vocab_limit = min(len(sorted_words), self.max_vocab - 4)

        idx = 4
        for word, _ in sorted_words[:vocab_limit]:
            self.word2idx[word] = idx
            self.idx2word[idx] = word
            idx += 1

        print(f"Vocabulary size: {len(self.word2idx)}")

    def encode(self, sentence):
        """Convert sentence to token IDs"""
        return [self.word2idx.get(word, self.word2idx["<unk>"]) for word in self.tokenize(sentence)]

    def decode(self, indices):
        """Convert token IDs back to text"""
        return ' '.join([self.idx2word.get(idx, "<unk>") for idx in indices if idx != self.word2idx["<pad>"]])

    def save(self, path):
        """Save tokenizer state"""
        with open(path, 'wb') as f:
            pickle.dump(self.__dict__, f)

    @classmethod
    def load(cls, path):
        """Load tokenizer state"""
        obj = cls()
        with open(path, 'rb') as f:
            obj.__dict__.update(pickle.load(f))
        return obj

In [10]:
class EnglishTokenizer:
    """Simple word-level tokenizer for English"""
    def __init__(self, max_vocab=30000):
        self.word2idx = {"<pad>": 0, "<sos>": 1, "<eos>": 2, "<unk>": 3}
        self.idx2word = {0: "<pad>", 1: "<sos>", 2: "<eos>", 3: "<unk>"}
        self.word_freq = {}
        self.max_vocab = max_vocab

    def tokenize(self, sentence):
        """Split English text into words"""
        return sentence.split()

    def fit(self, sentences):
        for sent in tqdm(sentences, desc="Building vocabulary"):
            for word in self.tokenize(sent):
                self.word_freq[word] = self.word_freq.get(word, 0) + 1

        sorted_words = sorted(self.word_freq.items(), key=lambda x: x[1], reverse=True)
        vocab_limit = min(len(sorted_words), self.max_vocab - 4)

        idx = 4
        for word, _ in sorted_words[:vocab_limit]:
            self.word2idx[word] = idx
            self.idx2word[idx] = word
            idx += 1

        print(f"Vocabulary size: {len(self.word2idx)}")

    def encode(self, sentence):
        return [self.word2idx.get(word, self.word2idx["<unk>"]) for word in self.tokenize(sentence)]

    def decode(self, indices):
        return ' '.join([self.idx2word.get(idx, "<unk>") for idx in indices if idx != self.word2idx["<pad>"]])

    def save(self, path):
        with open(path, 'wb') as f:
            pickle.dump(self.__dict__, f)

    @classmethod
    def load(cls, path):
        obj = cls()
        with open(path, 'rb') as f:
            obj.__dict__.update(pickle.load(f))
        return obj

#### Dataset Class with Caching
The `CachedTranslationDataset` class is designed to improve training efficiency for machine translation tasks by preprocessing and caching tokenized data.

- **Built on PyTorch Dataset:**  
  Inherits from `torch.utils.data.Dataset`, making it fully compatible with PyTorch's `DataLoader` for efficient batching and shuffling.

- **Caching Mechanism:**  
  Preprocesses each data sample once and stores the result in memory, avoiding repeated preprocessing during each epoch and speeding up training.

- **Tokenization and Sequence Preparation:**  
  - Converts source sentences (e.g., Korean) and target sentences (e.g., English) into token ID sequences.
  - Adds special tokens like `<sos>` (start of sentence) and `<eos>` (end of sentence).
  - Applies padding or truncation to ensure all sequences have a fixed maximum length.

- **Attention Mask Creation:**  
  Generates attention masks that distinguish between actual tokens and padding tokens, helping the model ignore padded positions during training.

- **Model-Ready Outputs:**  
  Returns tensors for:
  - Source input IDs and attention masks
  - Target input IDs and labels  
  These tensors are ready to be directly fed into the model for training or evaluation.

In [11]:
class CachedTranslationDataset(Dataset):
    """Dataset with caching for faster loading"""
    def __init__(self, df, src_tokenizer, tgt_tokenizer, max_len=40, cache_dir='dataset_cache'):
        self.df = df
        self.src_tokenizer = src_tokenizer
        self.tgt_tokenizer = tgt_tokenizer
        self.max_len = max_len
        self.cache_dir = cache_dir
        self.cache_file = os.path.join(cache_dir, f"cached_data_{len(df)}_{max_len}.pkl")
        
        # Create cache directory if it doesn't exist
        os.makedirs(cache_dir, exist_ok=True)
        
        # Try to load from cache, otherwise process data
        self.cached_data = self._load_or_create_cache()

    def _load_or_create_cache(self):
        if os.path.exists(self.cache_file):
            print(f"Loading cached dataset from {self.cache_file}")
            with open(self.cache_file, 'rb') as f:
                return pickle.load(f)
        
        print("Creating new dataset cache...")
        cached_data = []
        
        for idx in tqdm(range(len(self.df)), desc="Processing dataset"):
            src_text = self.df.iloc[idx]['korean']
            tgt_text = self.df.iloc[idx]['english']
            
            src_seq = [1] + self.src_tokenizer.encode(src_text) + [2]  # <sos> + sentence + <eos>
            tgt_seq = [1] + self.tgt_tokenizer.encode(tgt_text) + [2]  # <sos> + sentence + <eos>
            
            # Truncate sequences to max_len
            src_seq = src_seq[:self.max_len]
            tgt_seq = tgt_seq[:self.max_len]
            
            # Create source and target masks
            src_mask = [1] * len(src_seq) + [0] * (self.max_len - len(src_seq))
            tgt_mask = [1] * len(tgt_seq) + [0] * (self.max_len - len(tgt_seq))
            
            # Pad sequences
            src_seq += [0] * (self.max_len - len(src_seq))
            tgt_seq += [0] * (self.max_len - len(tgt_seq))
            
            cached_data.append({
                'src': torch.tensor(src_seq, dtype=torch.long),
                'tgt': torch.tensor(tgt_seq, dtype=torch.long),
                'src_mask': torch.tensor(src_mask, dtype=torch.bool),
                'tgt_mask': torch.tensor(tgt_mask, dtype=torch.bool),
                'src_len': len(src_seq),
                'tgt_len': len(tgt_seq)
            })
        
        # Save to cache
        print(f"Saving dataset cache to {self.cache_file}")
        with open(self.cache_file, 'wb') as f:
            pickle.dump(cached_data, f)
        
        return cached_data

    def __len__(self):
        return len(self.cached_data)

    def __getitem__(self, idx):
        return self.cached_data[idx]

#### Data Loading Functions
`load_excel_files`:
- combines all Excel files from a specified directory into a single DataFrame 
- automatically identifies the Korean and English columns, preprocesses the text
- caches the combined DataFrame for faster reloading in future runs.

`create_dataloaders`:
- Splits the dataset into training and validation sets
- Allows optional subsampling to use only a fraction of the available data if needed. Finally 
- Creates PyTorch `DataLoader` objects to efficiently batch and feed the data to the model during training and evaluation


In [None]:
def load_excel_files(directory='./translated', pattern="*translated.xlsx"):
    """Load and combine all Excel files matching the pattern"""
    all_data = []
    cache_file = os.path.join(directory, "combined_data_cache.pkl")
    
    # Try to load from cache
    if os.path.exists(cache_file):
        print(f"Loading combined data from cache: {cache_file}")
        with open(cache_file, 'rb') as f:
            return pickle.load(f)
    
    files = list(Path(directory).glob(pattern))
    
    if not files:
        print(f"No files matching pattern '{pattern}' found in directory '{directory}'")
        return pd.DataFrame()
    
    print(f"Found {len(files)} files matching the pattern")
    
    for file_path in tqdm(files, desc="Loading Excel files"):
        try:
            # Attempt to identify Korean and English columns based on common patterns
            df = pd.read_excel(file_path)
            
            # Try to automatically detect Korean and English columns
            korean_col = None
            english_col = None
            
            # Common column name patterns
            korean_patterns = ['korean', 'ko', '한국어', 'source']
            english_patterns = ['english', 'en', '영어', 'target']
            
            # Check column names
            for col in df.columns:
                col_lower = str(col).lower()
                if any(pattern in col_lower for pattern in korean_patterns):
                    korean_col = col
                if any(pattern in col_lower for pattern in english_patterns):
                    english_col = col
            
            # If automatic detection fails, use the first two columns
            if korean_col is None or english_col is None:
                if len(df.columns) >= 2:
                    korean_col = df.columns[0]
                    english_col = df.columns[1]
                    print(f"Using columns: {korean_col} and {english_col} for {file_path.name}")
                else:
                    print(f"Skipping {file_path.name}: Not enough columns")
                    continue
            
            # Extract and rename columns
            file_data = df[[korean_col, english_col]].copy()
            file_data.columns = ['korean', 'english']
            
            # Add source file information
            file_data['source_file'] = file_path.name
            
            # Append to combined data
            all_data.append(file_data)
            
        except Exception as e:
            print(f"Error processing {file_path.name}: {e}")
    
    if not all_data:
        return pd.DataFrame()
    
    # Combine all dataframes
    combined_data = pd.concat(all_data, ignore_index=True)
    
    # Clean the data
    combined_data = combined_data.dropna()
    combined_data['english'] = combined_data['english'].apply(preprocess_english)
    combined_data['korean'] = combined_data['korean'].apply(preprocess_korean)
    
    # Remove rows with too short sentences
    combined_data = combined_data[
    (combined_data['english'].str.split().str.len() > 3) &
    (combined_data['korean'].str.split().str.len() > 3)
    ]

    # Remove rows with empty strings
    combined_data = combined_data[(combined_data['english'] != '') & (combined_data['korean'] != '')]
    
    # Save to cache
    print(f"Saving combined data to cache: {cache_file}")
    with open(cache_file, 'wb') as f:
        pickle.dump(combined_data, f)
    
    return combined_data

def create_dataloaders(data, ko_tokenizer, en_tokenizer, train_ratio=0.8, 
                       batch_size=64, max_len=40, num_workers=4, pin_memory=True, 
                       subset_fraction=0.3):  # Added subset_fraction parameter
    """Create train and validation DataLoaders with optimized settings"""
    
    # Apply subset sampling - NEW
    if subset_fraction < 1.0:
        sample_size = int(len(data) * subset_fraction)
        data = data.sample(sample_size, random_state=42).reset_index(drop=True)
        print(f"Using {subset_fraction*100}% of the data: {len(data)} samples")
    
    # Split data into train and validation sets
    train_size = int(len(data) * train_ratio)
    train_data = data.iloc[:train_size]
    val_data = data.iloc[train_size:]
    
    print(f"Training data: {len(train_data)} samples")
    print(f"Validation data: {len(val_data)} samples")
    
    # Create datasets with caching
    train_dataset = CachedTranslationDataset(train_data, ko_tokenizer, en_tokenizer, max_len, cache_dir='dataset_cache/train')
    val_dataset = CachedTranslationDataset(val_data, ko_tokenizer, en_tokenizer, max_len, cache_dir='dataset_cache/val')
    
    # Create data loaders
    train_loader = DataLoader(
        train_dataset, 
        batch_size=batch_size, 
        shuffle=True,
        num_workers=num_workers,
        pin_memory=pin_memory
    )
    
    val_loader = DataLoader(
        val_dataset, 
        batch_size=batch_size, 
        shuffle=False,
        num_workers=num_workers,
        pin_memory=pin_memory
    )
    
    return train_loader, val_loader

#### Model Architectures
`PositionalEncoding`:

- Adds positional information to word embeddings (because transformer is order-agnostic otherwise).

- Uses sine/cosine functions for encoding.

`TransformerModel`:

- Full Transformer encoder-decoder model.

- Embeds input/output sequences + positional encodings.

- Defines masking logic for padding and future tokens.

- Forward pass to compute outputs for loss calculation.

- Includes a translate() method to perform greedy translation (step-by-step prediction).


In [None]:
# Positional Encoding for Transformer
class PositionalEncoding(nn.Module):
    """Positional encoding for the transformer model"""
    def __init__(self, d_model, max_len=5000):
        super(PositionalEncoding, self).__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:x.size(0), :]
        return x

class TransformerModel(nn.Module):
    """Transformer model for machine translation"""
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model=256, nhead=8, 
                 num_encoder_layers=4, num_decoder_layers=4, dim_feedforward=1024, dropout=0.2):
        super(TransformerModel, self).__init__()
        
        # Embedding layers
        self.src_embedding = nn.Embedding(src_vocab_size, d_model)
        self.tgt_embedding = nn.Embedding(tgt_vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model)
        
        # Transformer architecture
        self.transformer = nn.Transformer(
            d_model=d_model,
            nhead=nhead,
            num_encoder_layers=num_encoder_layers,
            num_decoder_layers=num_decoder_layers,
            dim_feedforward=dim_feedforward,
            dropout=dropout,
            batch_first=True
        )
        
        # Output layer
        self.output_layer = nn.Linear(d_model, tgt_vocab_size)
        
        # Initialize parameters
        self._init_parameters()
        
        # Model hyper-parameters
        self.d_model = d_model
        self.src_vocab_size = src_vocab_size
        self.tgt_vocab_size = tgt_vocab_size

    def _init_parameters(self):
        """Initialize model parameters"""
        for p in self.parameters():
            if p.dim() > 1:
                nn.init.xavier_uniform_(p)

    def create_mask(self, src, tgt):
        """Create masks for transformer"""
        # Source padding mask (for encoder)
        src_padding_mask = (src == 0).to(device)
        
        # Target padding mask (for decoder)
        tgt_padding_mask = (tgt == 0).to(device)
        
        # Target attention mask (for decoder self-attention)
        tgt_len = tgt.size(1)
        tgt_attention_mask = torch.triu(
            torch.ones(tgt_len, tgt_len), diagonal=1
        ).bool().to(device)
        
        return src_padding_mask, tgt_padding_mask, tgt_attention_mask

    def forward(self, src, tgt):
        """Forward pass"""
        src_padding_mask, tgt_padding_mask, tgt_attention_mask = self.create_mask(src, tgt)
        
        # Embed source and target sequences
        src_embedded = self.positional_encoding(self.src_embedding(src) * math.sqrt(self.d_model))
        tgt_embedded = self.positional_encoding(self.tgt_embedding(tgt) * math.sqrt(self.d_model))
        
        # Apply transformer model
        output = self.transformer(
            src=src_embedded,
            tgt=tgt_embedded,
            src_key_padding_mask=src_padding_mask,
            tgt_key_padding_mask=tgt_padding_mask,
            memory_key_padding_mask=src_padding_mask,
            tgt_mask=tgt_attention_mask
        )
        
        # Apply output layer
        output = self.output_layer(output)
        
        return output

    def translate(self, src, src_tokenizer, tgt_tokenizer, max_len=50):
        """Translate a source sentence"""
        self.eval()
        with torch.no_grad():
            # Preprocess source sentence
            if isinstance(src, str):
                src_tokens = [1] + src_tokenizer.encode(src) + [2]  # <sos> + sentence + <eos>
                src = torch.tensor([src_tokens]).to(device)
            
            # Initialize target with <sos> token
            tgt = torch.tensor([[1]]).to(device)  # <sos> token
            
            for _ in range(max_len):
                # Generate prediction
                src_padding_mask, tgt_padding_mask, tgt_attention_mask = self.create_mask(src, tgt)
                
                # Embed source and target sequences
                src_embedded = self.positional_encoding(self.src_embedding(src) * math.sqrt(self.d_model))
                tgt_embedded = self.positional_encoding(self.tgt_embedding(tgt) * math.sqrt(self.d_model))
                
                # Apply transformer model
                output = self.transformer(
                    src=src_embedded,
                    tgt=tgt_embedded,
                    src_key_padding_mask=src_padding_mask,
                    tgt_key_padding_mask=tgt_padding_mask,
                    memory_key_padding_mask=src_padding_mask,
                    tgt_mask=tgt_attention_mask
                )
                
                # Apply output layer and get next token prediction
                output = self.output_layer(output)
                next_token = output[:, -1].argmax(dim=1).unsqueeze(1)
                
                # Append to target sequence
                tgt = torch.cat([tgt, next_token], dim=1)
                
                # Stop if <eos> token is generated
                if next_token.item() == 2:
                    break
            
            # Convert token IDs to sentence
            output_tokens = tgt.squeeze().tolist()
            translated = tgt_tokenizer.decode(output_tokens)
            
            return translated

#### LSTM + Attention-based Seq2Seq
`EncoderRNN` (LSTM Encoder):

- Takes the input (source language) sequence, embeds the tokens into dense vectors, and passes them through an LSTM network to capture contextual information.
- Outputs both the full sequence of hidden states (for attention) and the final hidden and cell states (for initializing the decoder).

In [None]:
class EncoderRNN(nn.Module):
    """LSTM Encoder"""
    def __init__(self, input_dim, embed_dim, hidden_dim, num_layers=1, dropout=0.2):
        super(EncoderRNN, self).__init__()
        
        self.embedding = nn.Embedding(input_dim, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers=num_layers, dropout=dropout, batch_first=True)
        
    def forward(self, src):
        embedded = self.embedding(src)
        outputs, (hidden, cell) = self.lstm(embedded)
        return outputs, hidden, cell 


`AttentionDecoderRNN` (LSTM Decoder with Attention):
- At each decoding step, it uses an attention mechanism to dynamically focus on different parts of the encoder's hidden states.
- Combines the current embedded decoder input with the attention context, processes it through an LSTM, and predicts the next token in the sequence.

In [15]:
class AttentionDecoderRNN(nn.Module):
    """LSTM Decoder with Attention"""
    def __init__(self, output_dim, embed_dim, hidden_dim, num_layers=1, dropout=0.2):
        super(AttentionDecoderRNN, self).__init__()
        
        self.embedding = nn.Embedding(output_dim, embed_dim)
        self.attention = nn.Linear(hidden_dim + embed_dim, hidden_dim)
        self.lstm = nn.LSTM(hidden_dim + embed_dim, hidden_dim, num_layers=num_layers, dropout=dropout, batch_first=True)
        self.fc_out = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, input, hidden, cell, encoder_outputs):
        input = input.unsqueeze(1)  # (batch_size, 1)
        embedded = self.embedding(input)  # (batch_size, 1, embed_dim)
        
        # Calculate attention weights
        hidden_broadcast = hidden[-1].unsqueeze(1)  # (batch_size, 1, hidden_dim)
        attn_weights = torch.bmm(hidden_broadcast, encoder_outputs.transpose(1, 2))  # (batch_size, 1, seq_len)
        attn_weights = torch.softmax(attn_weights, dim=-1)
        
        # Context vector
        context = torch.bmm(attn_weights, encoder_outputs)  # (batch_size, 1, hidden_dim)
        
        # Concatenate context and embedding
        rnn_input = torch.cat((embedded, context), dim=2)  # (batch_size, 1, hidden_dim + embed_dim)
        
        # Pass through LSTM
        output, (hidden, cell) = self.lstm(rnn_input, (hidden, cell))
        
        prediction = self.fc_out(output.squeeze(1))  # (batch_size, output_dim)
        
        return prediction, hidden, cell


`Seq2SeqModel` (Seq2Seq Wrapper):
- A wrapper that connects the encoder and decoder together into a full translation model.
- Defines the overall sequence-to-sequence forward pass, starting by encoding the source sequence and then decoding the target sequence one token at a time using teacher forcing during training.

In [16]:
class Seq2SeqModel(nn.Module):
    """Wrapper for LSTM Encoder-Decoder with Attention"""
    def __init__(self, encoder, decoder, device):
        super(Seq2SeqModel, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
    def forward(self, src, tgt):
        batch_size = src.size(0)
        tgt_len = tgt.size(1)
        tgt_vocab_size = self.decoder.fc_out.out_features
        
        # Tensor to store decoder outputs
        outputs = torch.zeros(batch_size, tgt_len, tgt_vocab_size).to(self.device)
        
        encoder_outputs, hidden, cell = self.encoder(src)
        
        # First input to the decoder is the <sos> tokens
        input = tgt[:, 0]
        
        for t in range(1, tgt_len):
            output, hidden, cell = self.decoder(input, hidden, cell, encoder_outputs)
            outputs[:, t] = output
            input = tgt[:, t]  # Teacher forcing: next input is ground truth
        
        return outputs

#### Training Functions
`train_epoch`:

- Train model on one full epoch.

- Applies teacher forcing during training (uses ground-truth inputs).

- Clips gradients to prevent exploding gradients.

In [None]:
def train_epoch(model, train_loader, optimizer, criterion, clip=1.0):
    model.train()
    epoch_loss = 0
    
    progress_bar = tqdm(train_loader, desc="Training")
    
    for batch in progress_bar:
        src = batch['src'].to(device)
        tgt = batch['tgt'].to(device)
        
        tgt_input = tgt[:, :-1]  # Inputs for teacher forcing
        tgt_output = tgt[:, 1:]  # Expected outputs
        
        optimizer.zero_grad()
        output = model(src, tgt_input)
        
        output = output.contiguous().view(-1, output.shape[-1])
        tgt_output = tgt_output.contiguous().view(-1)
        
        loss = criterion(output, tgt_output)
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        
        epoch_loss += loss.item()
        progress_bar.set_postfix({"loss": epoch_loss / (progress_bar.n + 1)})
    
    return epoch_loss / len(train_loader)

`evaluate`:

- Evaluate model performance on validation data without updating weights.

**Function Workflow**:

- Set the model to evaluation mode (`model.eval()`).
- Initialize variables for total loss (`epoch_loss`) and BLEU scores (`bleu_scores`).
- Loop through validation data:
  - Make predictions with the model.
  - Compute loss by comparing predictions to target values.
  - Calculate **BLEU score** to evaluate prediction quality.
  - Optionally print translation examples for inspection.

**BLEU Score**:
- **Purpose**: BLEU (Bilingual Evaluation Understudy) measures the quality of machine-generated text by comparing n-grams (sequences of words) in the predicted output with those in reference translations.
- **Key Features**:
  - **N-gram Precision**: It checks how many n-grams (e.g., unigrams, bigrams) from the predicted sentence match those in the reference sentence.
  - **Brevity Penalty**: A penalty is applied if the predicted output is shorter than the reference, encouraging the model to produce more complete translations.
  - **Smoothing**: Smoothing is used to adjust BLEU calculations in cases where the model produces rare n-grams not present in the reference translations, ensuring more stable scores.
- **Usage**: BLEU is especially useful in tasks like machine translation, where we want to evaluate how well a machine-generated translation matches human-produced translations.
- **Limitations**: BLEU focuses on surface-level n-gram overlap and doesn't capture the full meaning or fluency of a sentence, so it's less effective for evaluating overall sentence quality or semantic accuracy.
  
**Output**:
- Returns the average loss and average BLEU score across the validation set.



In [None]:
def evaluate(model, val_loader, criterion, ko_tokenizer=None, en_tokenizer=None, print_examples=3):
    model.eval()
    epoch_loss = 0
    bleu_scores = []
    
    smooth = SmoothingFunction().method4  # BLEU smoothing

    examples_printed = 0
    with torch.no_grad():
        for batch in tqdm(val_loader, desc="Evaluating"):
            src = batch['src'].to(device)
            tgt = batch['tgt'].to(device)

            tgt_input = tgt[:, :-1]
            tgt_output = tgt[:, 1:]

            output = model(src, tgt_input)

            output_flat = output.contiguous().view(-1, output.shape[-1])
            tgt_output_flat = tgt_output.contiguous().view(-1)
            loss = criterion(output_flat, tgt_output_flat)
            epoch_loss += loss.item()

            # BLEU evaluation
            pred_ids = output.argmax(dim=-1).tolist()
            tgt_ids = tgt.tolist()
            for pred, tgt_ref in zip(pred_ids, tgt_ids):
                # Remove <pad> and special tokens
                pred_sentence = [w for w in pred if w not in [0, 1, 2]]
                tgt_sentence = [w for w in tgt_ref if w not in [0, 1, 2]]
                
                if tgt_sentence:
                    bleu = sentence_bleu(
                        [tgt_sentence], pred_sentence, smoothing_function=smooth
                    )
                    bleu_scores.append(bleu)

                # Print a few examples
                if examples_printed < print_examples and ko_tokenizer and en_tokenizer:
                    src_tokens = [w for w in batch['src'][examples_printed].tolist() if w not in [0, 1, 2]]  # remove <pad>, <sos>, <eos>
                    src_text = ko_tokenizer.decode(src_tokens)
                    tgt_text = en_tokenizer.decode(tgt_sentence)
                    pred_text = en_tokenizer.decode(pred_sentence)
                    print(f"\n--- Example {examples_printed+1} ---")
                    print(f"Source (KO): {src_text}")
                    print(f"Target (EN): {tgt_text}")
                    print(f"Prediction : {pred_text}")
                    examples_printed += 1

    avg_bleu = sum(bleu_scores) / len(bleu_scores) if bleu_scores else 0
    print(f"\nAverage BLEU score on validation set: {avg_bleu:.4f}")
    
    return epoch_loss / len(val_loader)

`train_model`:

- Trains model across multiple epochs.

- Saves the best-performing model (lowest validation loss).

- Applies early stopping if validation loss doesn’t improve.

In [18]:
def train_model(model, train_loader, val_loader, optimizer, criterion,
                ko_tokenizer=None, en_tokenizer=None,
                num_epochs=3, patience=2, model_save_path='models'):
    os.makedirs(model_save_path, exist_ok=True)
    
    best_val_loss = float('inf')
    patience_counter = 0
    
    for epoch in range(num_epochs):
        start_time = time.time()
        
        train_loss = train_epoch(model, train_loader, optimizer, criterion)
        val_loss = evaluate(model, val_loader, criterion, ko_tokenizer, en_tokenizer)
        
        end_time = time.time()
        epoch_time = end_time - start_time
        
        print(f"Epoch {epoch+1}/{num_epochs} | Time: {epoch_time:.2f}s")
        print(f"Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f}")
        
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            torch.save({
                'epoch': epoch,
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'train_loss': train_loss,
                'val_loss': val_loss,
            }, os.path.join(model_save_path, 'best_model.pt'))
            print(f"Saved new best model with validation loss {val_loss:.4f}")
            patience_counter = 0
        else:
            patience_counter += 1
        
        if patience_counter >= patience:
            print(f"Early stopping triggered after {epoch+1} epochs.")
            break
    
    return best_val_loss

#### Full process_and_train() Function
Load Data, Build Model, Train, and Test

- Load data
- Build tokenizers
- Create DataLoaders
- Choose Transformer or LSTM+Attention
- Train and test the model


In [19]:
def process_and_train(directory='./translated', batch_size=64, d_model=256, num_epochs=3, learning_rate=0.0003, num_workers=4, use_cached_data=True, subset_fraction=0.3, model_type="transformer"):
    start_time = time.time()
    
    # Step 1: Load or create tokenizers
    if use_cached_data and os.path.exists('tokenizers/korean_tokenizer.pkl') and os.path.exists('tokenizers/english_tokenizer.pkl'):
        print("Loading tokenizers from cache...")
        ko_tokenizer = ImprovedKoreanTokenizer.load('tokenizers/korean_tokenizer.pkl')
        en_tokenizer = EnglishTokenizer.load('tokenizers/english_tokenizer.pkl')
        data = load_excel_files(directory)
    else:
        data = load_excel_files(directory)
        if len(data) == 0:
            print("No data loaded. Exiting.")
            return None
        print(f"Total data loaded: {len(data)} sentence pairs")
        ko_tokenizer = ImprovedKoreanTokenizer(max_vocab=50000)
        en_tokenizer = EnglishTokenizer(max_vocab=50000)
        print("Building Korean vocabulary...")
        ko_tokenizer.fit(data['korean'].tolist())
        print("Building English vocabulary...")
        en_tokenizer.fit(data['english'].tolist())
        os.makedirs('tokenizers', exist_ok=True)
        ko_tokenizer.save('tokenizers/korean_tokenizer.pkl')
        en_tokenizer.save('tokenizers/english_tokenizer.pkl')
        print("Tokenizers saved.")
    
    # Step 2: Create DataLoaders
    train_loader, val_loader = create_dataloaders(data, ko_tokenizer, en_tokenizer, batch_size=batch_size, num_workers=num_workers, pin_memory=(device.type == 'cuda'), subset_fraction=subset_fraction)
    
    # Step 3: Build Model
    if model_type == "transformer":
        model = TransformerModel(
            src_vocab_size=len(ko_tokenizer.word2idx),
            tgt_vocab_size=len(en_tokenizer.word2idx),
            d_model=d_model,
            nhead=8,
            num_encoder_layers=4,
            num_decoder_layers=4,
            dim_feedforward=1024,
            dropout=0.1
        ).to(device)
    elif model_type == "lstm":
        INPUT_DIM = len(ko_tokenizer.word2idx)
        OUTPUT_DIM = len(en_tokenizer.word2idx)
        EMBED_DIM = 256
        HIDDEN_DIM = 512
        encoder = EncoderRNN(INPUT_DIM, EMBED_DIM, HIDDEN_DIM)
        decoder = AttentionDecoderRNN(OUTPUT_DIM, EMBED_DIM, HIDDEN_DIM)
        model = Seq2SeqModel(encoder, decoder, device).to(device)
    else:
        raise ValueError(f"Unsupported model_type: {model_type}")
    
    if device.type == 'cuda':
        torch.backends.cudnn.benchmark = True
    
    # Step 4: Optimizer and Loss
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    criterion = nn.CrossEntropyLoss(ignore_index=0, label_smoothing=0.05) # Add label smoothing 
    
    # Step 5: Train
    print("\nStarting model training...")
    best_val_loss = train_model(model, train_loader, val_loader,
        optimizer, criterion, ko_tokenizer=ko_tokenizer,
        en_tokenizer=en_tokenizer, num_epochs=num_epochs
    )

    
    end_time = time.time()
    print(f"\nTraining completed in {end_time - start_time:.2f} seconds")
    print(f"Best validation loss: {best_val_loss:.4f}")
    
    return model, ko_tokenizer, en_tokenizer

#### Entry Point
Run the process when this file is executed.

- Change the `model_type` to switch between **trandformer** and **LSTM+Attention**

In [22]:
if __name__ == "__main__":
    model, ko_tokenizer, en_tokenizer = process_and_train(
        directory='./translated',
        use_cached_data=False,
        batch_size=64,
        num_epochs=2,
        d_model=384,
        learning_rate=0.00025,
        num_workers=0,
        subset_fraction=0.3,
        model_type="transformer"  # Change to "lstm" if you want LSTM+Attention
    )

Loading combined data from cache: ./translated/combined_data_cache.pkl
Total data loaded: 1602058 sentence pairs
Building Korean vocabulary...


Building vocabulary: 100%|██████████| 1602058/1602058 [00:50<00:00, 31831.78it/s]


Vocabulary size: 50000
Building English vocabulary...


Building vocabulary: 100%|██████████| 1602058/1602058 [00:04<00:00, 353998.66it/s]


Vocabulary size: 50000
Tokenizers saved.
Using 30.0% of the data: 480617 samples
Training data: 384493 samples
Validation data: 96124 samples
Creating new dataset cache...


Processing dataset: 100%|██████████| 384493/384493 [00:28<00:00, 13452.69it/s]


Saving dataset cache to dataset_cache/train/cached_data_384493_40.pkl
Creating new dataset cache...


Processing dataset: 100%|██████████| 96124/96124 [00:07<00:00, 13621.41it/s]


Saving dataset cache to dataset_cache/val/cached_data_96124_40.pkl

Starting model training...


Training: 100%|██████████| 6008/6008 [3:49:39<00:00,  2.29s/it, loss=5.14]     
Evaluating:   0%|          | 1/1502 [00:00<14:32,  1.72it/s]


--- Example 1 ---
Source (KO): 겸재 는 금강산 을 오가 는 길 에 이 일대 에 은거 하 던 스승 <unk> <unk> 을 찾아왔 다가 이 폭포 의 경관 에 반해 진경산수 화 를 남겼 다 .
Target (EN): on his way to and from mt . geumgang , gyeomjae visited his teacher , <unk> kim <unk> , who had been living in the area , and left real landscape painting , falling in love the view of
Prediction : the the way to the <unk> the . <unk> , the was the house in and , <unk> , and was been in in the north , and was the life . . and into the . sea of

--- Example 2 ---
Source (KO): 구글 은 국내 차량 3 대 중 2 대 의 점유 율 을 자랑 하 는 현대 기아차 에 안드로이드 오토 를 탑재 하 면서 자연스레 IVI 시장 에서 우위 를 점할 수 있 는 토대 가 마련 됐
Target (EN): this is because google has laid the foundation for its dominance in the ivi market naturally as it is equipped with android auto to hyundaikia , which boasts a share of two of the three vehicles in korea .
Prediction : the is a the has a off largest for its own of the domestic of , , it has a with a <unk> engine the motors which is a large of its cars its worlds major 

Evaluating: 100%|██████████| 1502/1502 [14:58<00:00,  1.67it/s]



Average BLEU score on validation set: 0.0463
Epoch 1/2 | Time: 14678.55s
Train Loss: 5.1388 | Val Loss: 4.3817
Saved new best model with validation loss 4.3817


Training: 100%|██████████| 6008/6008 [7:10:37<00:00,  4.30s/it, loss=4.03]     
Evaluating:   0%|          | 1/1502 [00:00<08:07,  3.08it/s]


--- Example 1 ---
Source (KO): 겸재 는 금강산 을 오가 는 길 에 이 일대 에 은거 하 던 스승 <unk> <unk> 을 찾아왔 다가 이 폭포 의 경관 에 반해 진경산수 화 를 남겼 다 .
Target (EN): on his way to and from mt . geumgang , gyeomjae visited his teacher , <unk> kim <unk> , who had been living in the area , and left real landscape painting , falling in love the view of
Prediction : at the way to the left the . kumgang , the , the tomb , who , <unk> , who was been on in the mountains of and the the time . , was into the . mountains of

--- Example 2 ---
Source (KO): 구글 은 국내 차량 3 대 중 2 대 의 점유 율 을 자랑 하 는 현대 기아차 에 안드로이드 오토 를 탑재 하 면서 자연스레 IVI 시장 에서 우위 를 점할 수 있 는 토대 가 마련 됐
Target (EN): this is because google has laid the foundation for its dominance in the ivi market naturally as it is equipped with android auto to hyundaikia , which boasts a share of two of the three vehicles in korea .
Prediction : the is the the has been the base for its own of the domestic of , in it has equipped with a motors vehicles maximize motors which has its competi

Evaluating: 100%|██████████| 1502/1502 [08:15<00:00,  3.03it/s]



Average BLEU score on validation set: 0.0773
Epoch 2/2 | Time: 26332.46s
Train Loss: 4.0288 | Val Loss: 3.7120
Saved new best model with validation loss 3.7120

Training completed in 41150.55 seconds
Best validation loss: 3.7120


In [23]:
# Reload Excel files
data = load_excel_files(directory='./translated')

# Check the worst examples BEFORE caching
bad_rows = data[
    (data['korean'].str.strip().str.len() < 5) | 
    (data['korean'].str.split().str.len() <= 3)
]

print("Bad rows sample:")
print(bad_rows.head(10))


Loading combined data from cache: ./translated/combined_data_cache.pkl
Bad rows sample:
Empty DataFrame
Columns: [korean, english, source_file]
Index: []


In [8]:
# Step 1: Load data and tokenizers
data = load_excel_files(directory='./translated')
ko_tokenizer = ImprovedKoreanTokenizer.load('tokenizers/korean_tokenizer.pkl')
en_tokenizer = EnglishTokenizer.load('tokenizers/english_tokenizer.pkl')


# Step 2: Create val_loader (you don’t need to retrain)
_, val_loader = create_dataloaders(
    data,
    ko_tokenizer,
    en_tokenizer,
    batch_size=1,  # use batch size 1 for easy viewing
    subset_fraction=0.1,
    num_workers=0,
    pin_memory=False
)

# Step 3: Load model and weights
model = TransformerModel(
    src_vocab_size=len(ko_tokenizer.word2idx),
    tgt_vocab_size=len(en_tokenizer.word2idx),
    d_model=384,
    nhead=8,
    num_encoder_layers=4,
    num_decoder_layers=4,
    dim_feedforward=1024,
    dropout=0.1
).to(device)

checkpoint = torch.load('models/best_model.pt', map_location=device)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# Step 4: Print a few translations from the validation set
from random import randint

for i in range(3):  # print 3 examples
    idx = randint(0, len(val_loader.dataset) - 1)
    sample = val_loader.dataset[idx]

    src_tokens = sample['src'].tolist()
    tgt_tokens = sample['tgt'].tolist()

    src_sentence = ko_tokenizer.decode([t for t in src_tokens if t not in [0, 1, 2]])
    tgt_sentence = en_tokenizer.decode([t for t in tgt_tokens if t not in [0, 1, 2]])

    # Get model translation
    translation = model.translate(src_sentence, ko_tokenizer, en_tokenizer)

    print(f"\n--- Example {i+1} ---")
    print(f"Source (KO): {src_sentence}")
    print(f"Target (EN): {tgt_sentence}")
    print(f"Prediction  : {translation}")


Loading combined data from cache: ./translated/combined_data_cache.pkl
Using 10.0% of the data: 160228 samples
Training data: 128182 samples
Validation data: 32046 samples
Loading cached dataset from dataset_cache/train/cached_data_128182_40.pkl
Loading cached dataset from dataset_cache/val/cached_data_32046_40.pkl


  output = torch._nested_tensor_from_mask(



--- Example 1 ---
Source (KO): 어제 전화 로 예약 하 고 왔 고요 , 체크인 해 주 세요 .
Target (EN): i made a reservation yesterday via phone call , and im here for the checkin .
Prediction  : <sos> it is a good idea to see the new relatively times , so shall korea see the new relatively . <eos>

--- Example 2 ---
Source (KO): 오 는 19 일 까지 KT 채용 홈페이지 를 통해 지원 할 수 있 으며 남자 인 경우 병역 문제 가 해결 돼야 한다 .
Target (EN): the application can be made through kts recruitment website by th , and in case of a man , the military service issue must be resolved .
Prediction  : <sos> it is not urban to users that the korean wave is not urban , but it is not urban to users the situation . <eos>

--- Example 3 ---
Source (KO): 사용 허가 를 받 은 다음 사용 시작 전날 까지 미리 그 사용 을 취소 또는 연기 할 때 에 는 총 사용료 의 10 퍼센트 를 공제 후 반환 하 고 , 사용 시작 일 이후 는 이용
Target (EN): when the use is cancelled or postponed in advance by the day before the commencement of the use after the permission is granted , the fee shall be refunded after deducting ten percent of the total 