# Seq2Seq English-Polish Translation with PyTorch

Character-level LSTM Encoder-Decoder for translation.

**Architecture:**
- Encoder: LSTM processes English sentence â†’ context vectors (h, c)
- Decoder: LSTM generates Polish translation using context
- Teacher forcing during training
- Autoregressive generation during inference

---

## ðŸ“‹ Setup Instructions for Kaggle

1. **Upload this notebook** to Kaggle
2. **Upload dataset**: Add `eng_to_pl.tsv` as input data
   - Click "Add Data" â†’ "Upload" â†’ Select your `eng_to_pl.tsv` file
   - Or create a Kaggle dataset first, then add it as input
3. **Enable GPU**: Settings â†’ Accelerator â†’ GPU T4 x2 (or P100)
4. **Run all cells**

The notebook will automatically detect if it's running on Kaggle or locally and adjust the data path accordingly.


In [4]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import csv
from pathlib import Path

# Check if GPU is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")


Using device: cpu


## 1. Data Loading

Upload your `eng_to_pl.tsv` file to Kaggle's input folder, or adjust the path below.


In [5]:
class EnglishToPolishTranslationData:
    def __init__(self, data_path="/kaggle/input/eng-to-pl/eng_to_pl.tsv"):
        self.pairs = []
        
        if not Path(data_path).exists():
            data_path = "data/eng_to_pl/eng_to_pl.tsv"  # Local path
        
        with open(data_path, encoding="utf-8") as tsv_file:
            reader = csv.reader(tsv_file, delimiter="\t")
            for row in reader:
                eng = row[1]
                pol = row[3]
                # Filter for shorter sentences (5-12 chars English, 5-15 chars Polish)
                if len(eng) < 5 or len(eng) > 12:
                    continue
                if len(pol) < 5 or len(pol) > 15:
                    continue
                self.pairs.append((eng.lower(), pol.lower()))
        
        self.build_vocabulary()
        print(f"Loaded {len(self.pairs)} translation pairs")
        print(f"Vocabulary size: {self.vocab_size}")
    
    def build_vocabulary(self):
        all_chars = set().union(*(eng + pol for eng, pol in self.pairs))
        self.chars = sorted(all_chars)

        self.chars = ["<PAD>", "<START>", "<END>"] + self.chars
        
        self.char_to_idx = {ch: i for i, ch in enumerate(self.chars)}
        self.idx_to_char = {i: ch for i, ch in enumerate(self.chars)}
        self.vocab_size = len(self.chars)
    
    def char_to_onehot(self, char):
        """Convert character to one-hot vector (1, vocab_size)"""
        char_idx = self.char_to_idx[char]
        onehot = torch.zeros(1, self.vocab_size, device=device)
        onehot[0, char_idx] = 1
        return onehot
    
    def get_pairs(self):
        return self.pairs

data = EnglishToPolishTranslationData()
print(f"\nFirst 3 pairs:")
for eng, pol in data.get_pairs()[:3]:
    print(f"  '{eng}' â†’ '{pol}'")


Loaded 2723 translation pairs
Vocabulary size: 58

First 3 pairs:
  'hurry up.' â†’ 'poÅ›piesz siÄ™!'
  'so what?' â†’ 'no i co?'
  'so what?' â†’ 'no i?'


## 2. LSTM Encoder

Processes English sentence character-by-character, outputs final hidden & cell states.


In [6]:
class Encoder(nn.Module):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size
        
        self.lstm_cell = nn.LSTMCell(input_size, hidden_size)
    
    def encode(self, sentence, data_processor):
        hidden = torch.zeros(1, self.hidden_size, device=device)
        cell = torch.zeros(1, self.hidden_size, device=device)
        encoder_states = []
        
        # Process each character
        for char in sentence:
            onehot = data_processor.char_to_onehot(char)
            hidden, cell = self.lstm_cell(onehot, (hidden, cell))
            encoder_states.append(hidden)
        
        return encoder_states, hidden, cell


## 3. Attention

In [7]:
class Attention(nn.Module):
    def __init__(self, encoder_hidden_size, decoder_hidden_size, attention_size):
        """
        Bahdanau attention mechanism.
        """
        super().__init__()
        
        self.W_encoder = nn.Linear(encoder_hidden_size, attention_size, bias=False)

        self.W_decoder = nn.Linear(decoder_hidden_size, attention_size, bias=False)

        self.v = nn.Linear(attention_size, 1, bias=False)

    def forward(self, decoder_hidden, encoder_states):
        encoder_outputs = torch.stack(encoder_states, dim=0)  # (seq_len, 1, hidden)
        seq_len = encoder_outputs.size(0)
        
        encoder_outputs = encoder_outputs.squeeze(1)  # (seq_len, hidden)
        decoder_hidden = decoder_hidden.squeeze(0)    # (hidden,)
        
        # Step 2: Project encoder states and decoder hidden state
        # encoder_proj: (seq_len, attention_size)
        encoder_proj = self.W_encoder(encoder_outputs)
        
        # decoder_proj: (attention_size,) â†’ expand to (seq_len, attention_size)
        decoder_proj = self.W_decoder(decoder_hidden)
        decoder_proj = decoder_proj.unsqueeze(0).expand(seq_len, -1)
        
        # Step 3: Compute alignment scores
        # combined: (seq_len, attention_size)
        combined = torch.tanh(encoder_proj + decoder_proj)
        
        # scores: (seq_len, 1)
        scores = self.v(combined)
        
        # Step 4: Softmax to get attention weights
        # attention_weights: (seq_len, 1)
        attention_weights = F.softmax(scores, dim=0)
        
        # Step 5: Compute weighted sum of encoder states (context vector)
        # context: (seq_len, hidden) * (seq_len, 1) â†’ sum â†’ (hidden,)
        context_vector = (encoder_outputs * attention_weights).sum(dim=0)
        
        # Reshape back to (1, hidden) to match expected shape
        context_vector = context_vector.unsqueeze(0)
        attention_weights = attention_weights.squeeze(1).unsqueeze(0)  # (1, seq_len)
        
        return context_vector, attention_weights
        

## 3. LSTM Decoder

Generates Polish translation using context from encoder.

- **Training mode**: Teacher forcing (uses ground truth as input)
- **Generation mode**: Autoregressive (uses its own predictions)


In [8]:
class Decoder(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.hidden_size = hidden_size
        
        self.lstm_cell = nn.LSTMCell(input_size, hidden_size)
        self.attention = Attention(
            encoder_hidden_size=hidden_size,
            decoder_hidden_size=hidden_size,
            attention_size=128  # Can tune this
        )
        
        self.output_layer = nn.Linear(hidden_size * 2, output_size)
    
    def predict_char_probs(self, hidden_state):
        logits = self.output_layer(hidden_state)
        log_probs = F.log_softmax(logits, dim=1)
        return log_probs
    
    def decode_train(self, encoder_states, context_h, context_c, target_sentence, data):
        hidden = context_h
        cell = context_c
        predictions = []
        
        current_char = "<START>"
        
        for target_char in target_sentence:
            onehot = data.char_to_onehot(current_char)
            hidden, cell = self.lstm_cell(onehot, (hidden, cell))
            context_vector, attention_weights = self.attention(hidden, encoder_states)
            combined = torch.cat([context_vector, hidden], dim = 1)
            prediction = self.predict_char_probs(combined)
            predictions.append(prediction)
            
            current_char = target_char  # Teacher forcing
        
        # Predict <END> token
        onehot = data.char_to_onehot(current_char)
        hidden, cell = self.lstm_cell(onehot, (hidden, cell))
        context_vector, attention_weights = self.attention(hidden, encoder_states)
        combined = torch.cat([context_vector, hidden], dim = 1)
        prediction = self.predict_char_probs(combined)
        predictions.append(prediction)
        
        return predictions
    
    def decode_generate(self, encoder_states, context_h, context_c, data, max_length=30):
        hidden = context_h
        cell = context_c
        generated = []
        
        current_char = "<START>"
        
        for _ in range(max_length):
            onehot = data.char_to_onehot(current_char)
            hidden, cell = self.lstm_cell(onehot, (hidden, cell))
            context_vector, attention_weights = self.attention(hidden, encoder_states)
            combined = torch.cat([context_vector, hidden], dim = 1)
            log_probs = self.predict_char_probs(combined)
            predicted_idx = log_probs.argmax().item()
            next_char = data.idx_to_char[predicted_idx]
            
            if next_char == "<END>" or next_char == "<PAD>":
                break
            
            generated.append(next_char)
            current_char = next_char
        
        return "".join(generated)


## 4. Loss Function

Negative Log-Likelihood Loss for sequence of predictions.


In [9]:
def compute_sequence_loss(predictions, target_sequence, data_processor):
    total_loss = 0
    
    for prediction, target_char in zip(predictions, target_sequence):
        target_onehot = data_processor.char_to_onehot(target_char)
        target_idx = target_onehot.argmax().item()
        target_tensor = torch.tensor([target_idx], device=device)
        
        loss = F.nll_loss(prediction, target_tensor)
        total_loss += loss
    
    if len(predictions) > len(target_sequence):
        last_prediction = predictions[-1]
        end_onehot = data_processor.char_to_onehot("<END>")
        end_idx = end_onehot.argmax().item()
        end_tensor = torch.tensor([end_idx], device=device)
        
        loss = F.nll_loss(last_prediction, end_tensor)
        total_loss += loss
    
    avg_loss = total_loss / len(predictions)
    return avg_loss


## 5. Training Loop

Train encoder and decoder jointly with backpropagation through time.


In [10]:
def test_translation(encoder, decoder, data, test_sentences=None):
    if test_sentences is None:
        test_sentences = ["i love you", "have fun", "why me?"]
    
    encoder.eval()
    decoder.eval()
    
    with torch.no_grad():
        for english in test_sentences:
            encoder_states, context_h, context_c = encoder.encode(english, data)
            polish = decoder.decode_generate(encoder_states, context_h, context_c, data, max_length=30)
            print(f"  {english} â†’ {polish}")
    
    encoder.train()
    decoder.train()


def train_network(epochs=200, lr=0.01, max_pairs=2723):
    pairs = data.get_pairs()[:max_pairs]
    print(f"\nTraining on {len(pairs)} pairs\n")
    
    encoder = Encoder(input_size=data.vocab_size, hidden_size=256).to(device)
    decoder = Decoder(input_size=data.vocab_size, hidden_size=256, output_size=data.vocab_size).to(device)
    
    optimizer = torch.optim.Adam(list(encoder.parameters()) + list(decoder.parameters()), lr=0.001)
    
    for epoch in range(epochs):
        epoch_loss = 0
        
        for english, polish in pairs:
            optimizer.zero_grad()
            
            encoder_states, context_h, context_c = encoder.encode(english, data)
            
            predictions = decoder.decode_train(encoder_states, context_h, context_c, polish, data)
            
            loss = compute_sequence_loss(predictions, polish, data)
            epoch_loss += loss.item()
            
            loss.backward()
            
            torch.nn.utils.clip_grad_norm_(
            list(encoder.parameters()) + list(decoder.parameters()),
            max_norm=1.0
        )
            
            optimizer.step()
        
        if epoch % 10 == 0:
            avg_loss = epoch_loss / len(pairs)
            print(f"Epoch {epoch}/{epochs}, Loss: {avg_loss:.4f}")
            test_translation(encoder, decoder, data)
            print()
    
    print("Training complete!")
    return encoder, decoder


## 6. Train the Model


In [11]:
encoder, decoder = train_network(epochs=100, lr=0.01, max_pairs=1000)



Training on 1000 pairs

Epoch 0/100, Loss: 2.7653
  i love you â†’ po szesze.
  have fun â†’ pa szesze.
  why me? â†’ pa szesze.

Epoch 10/100, Loss: 0.7680
  i love you â†’ moÅ¼e pochoni.
  have fun â†’ jak leci?
  why me? â†’ kiedy?

Epoch 20/100, Loss: 0.2621
  i love you â†’ kocham ciÄ™!
  have fun â†’ maw liÄ™ nie!
  why me? â†’ dlaczego nia?

Epoch 30/100, Loss: 0.1184
  i love you â†’ kocham ciÄ™.
  have fun â†’ na zdlionie!
  why me? â†’ dlaczego ja?

Epoch 40/100, Loss: 0.0809
  i love you â†’ kocham ciÄ™!
  have fun â†’ mar moÅ¼e!
  why me? â†’ do kie?

Epoch 50/100, Loss: 0.0684
  i love you â†’ kocham ciÄ™!
  have fun â†’ obudÅº siÄ™.
  why me? â†’ dlaczego ja?

Epoch 60/100, Loss: 0.0524
  i love you â†’ kocham ciÄ™!
  have fun â†’ marÄ™ miÄ™!
  why me? â†’ dlaczego ja?

Epoch 70/100, Loss: 0.0444
  i love you â†’ kocham ciÄ™!
  have fun â†’ mardzam!
  why me? â†’ dlaczego ja?

Epoch 80/100, Loss: 0.0408
  i love you â†’ kocham ciÄ™!
  have fun â†’ alery nie!
  why me? â†

## 7. Test on Custom Sentences


In [12]:

custom_sentences = [
    "i love you",
    "have fun",
    "why me?",
    "hello",
    "good night"
]

print("\nCustom translations:")
test_translation(encoder, decoder, data, custom_sentences)



Custom translations:
  i love you â†’ kocham ciÄ™!
  have fun â†’ obudÅº siÄ™.
  why me? â†’ dlaczego namau.
  hello â†’ pomocy!
  good night â†’ dobranoc.


## 8. Save Models (Optional)


In [13]:

torch.save({
    'encoder_state_dict': encoder.state_dict(),
    'decoder_state_dict': decoder.state_dict(),
    'vocab_size': data.vocab_size,
    'char_to_idx': data.char_to_idx,
    'idx_to_char': data.idx_to_char
}, 'seq2seq_eng_pol.pth')

print("Models saved to 'seq2seq_eng_pol.pth'")


Models saved to 'seq2seq_eng_pol.pth'


## 9. Load Models (Optional)


In [14]:

checkpoint = torch.load('seq2seq_eng_pol.pth')


loaded_encoder = Encoder(input_size=checkpoint['vocab_size'], hidden_size=256).to(device)
loaded_decoder = Decoder(input_size=checkpoint['vocab_size'], hidden_size=256, output_size=checkpoint['vocab_size']).to(device)


loaded_encoder.load_state_dict(checkpoint['encoder_state_dict'])
loaded_decoder.load_state_dict(checkpoint['decoder_state_dict'])

data.char_to_idx = checkpoint['char_to_idx']
data.idx_to_char = checkpoint['idx_to_char']

print("Models loaded successfully!")
test_translation(loaded_encoder, loaded_decoder, data)


Models loaded successfully!
  i love you â†’ kocham ciÄ™!
  have fun â†’ obudÅº siÄ™.
  why me? â†’ dlaczego namau.
