Inspired by https://web.stanford.edu/class/cs224n/index.html#coursework

### Assignment Overview:
1. **Neural Machine Translation (NMT)**: This involves training a model to translate text from one language to another.
2. **Sequence-to-Sequence (Seq2Seq)**: Seq2Seq models are based on an encoder-decoder architecture, where the encoder processes the input sequence and the decoder generates the output sequence.
3. **Attention Mechanism**: The attention mechanism allows the model to focus on specific parts of the input sequence when generating each part of the output sequence, addressing limitations of the basic Seq2Seq architecture.
4. **Subwords (Byte Pair Encoding, BPE)**: Subword tokenization methods like BPE are used to break down words into smaller, more manageable units, reducing the vocabulary size and helping with rare word translations.



### Steps to Implement the Project:

We'll need a variety of libraries and tools, so let's start by identifying the key components of the project:

1. **Data Preprocessing**:
   - Tokenize and preprocess data using libraries like `SentencePiece`, `BPE` (Byte Pair Encoding), or `SubwordNMT`.
   - Use `nltk` for general text processing and `spaCy` for language-specific tokenization.

2. **Model Implementation**:
   - **Encoder-Decoder with Attention**:
     - Use `TensorFlow` or `PyTorch` to implement the sequence-to-sequence architecture with attention mechanisms.
     - For attention, you can use the `Bahdanau` or `Luong` attention variants.
   
3. **Training**:
   - Set up a training pipeline using frameworks like `PyTorch` or `TensorFlow/Keras`.
   - Use GPUs via `CUDA` for faster training (especially if dealing with large datasets).

4. **Evaluation**:
   - Compute metrics like BLEU score (via `nltk.translate` or `sacrebleu`).

5. **Libraries and Tools**:
   - **Core Libraries**:
     - `torch` or `tensorflow`: For neural network building and training.
     - `transformers`: For pre-trained models and tokenizers (e.g., BERT, T5, GPT).
     - `sentencepiece` or `subword-nmt`: For subword tokenization.
     - `nltk`, `spacy`: For data preprocessing and tokenization.
   - **Performance/Optimization**:
     - `torchtext` (for easier text preprocessing and data handling).
     - `tensorboardX` (for logging and monitoring training).
   - **Metrics**:
     - `sacrebleu` or `nltk` (for BLEU score evaluation).



### Breakdown of the Model:

1. **Preprocessing with Subwords**:
   - **SentencePiece or BPE**: We'll use these for subword tokenization. They allow us to break down words into smaller chunks (subwords) and handle out-of-vocabulary words effectively.

2. **Seq2Seq Model with Attention**:
   - Encoder: Typically an LSTM or GRU-based model.
   - Decoder: LSTM/GRU-based, but with the attention mechanism to focus on different parts of the input.
   - Attention Mechanism: We'll use the Bahdanau or Luong attention. This mechanism computes a context vector based on the encoder's hidden states and the current state of the decoder.

3. **Training**:
   - We will use teacher forcing during training to feed the actual previous token as the next input.
   - For optimization, we'll use Adam or RMSProp.

4. **Evaluation**:
   - Use BLEU score to evaluate the quality of the translation output.
   - Optionally, use other metrics like ROUGE or TER.



## Plan

### 1. **Data Preparation**
   - **Dataset**: We'll need a parallel corpus for training, such as the [WMT](http://www.statmt.org/wmt20/) datasets or the [IWSLT](https://sites.google.com/site/iwsltevaluation2017/) datasets.
   - **Preprocessing**: Tokenize, clean, and split the data into training, validation, and test sets.
     - We’ll use `nltk` or `spacy` for basic tokenization.
     - Use `sentencepiece` or `subword-nmt` to perform subword tokenization.

In [1]:
pip install sentencepiece nltk spacy


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:

import sentencepiece as spm
import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/scales/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [5]:

# Example of SentencePiece tokenization
spm.SentencePieceTrainer.train(input='poem.txt', model_prefix='model', vocab_size=77)

sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: poem.txt
  input_format: 
  model_prefix: model
  model_type: UNIGRAM
  vocab_size: 77
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
  enable_differential_privacy: 0
  differential_p


### 2. **Model Architecture (Seq2Seq + Attention)**

We'll implement a Seq2Seq model with attention in PyTorch. Here's a simplified architecture:


In [6]:
pip install torch



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [11]:
import random

import torch
import torch.nn as nn

class Encoder(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, n_layers, dropout):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(input_dim, embedding_dim)
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout=dropout, batch_first=True)
        self.dropout = nn.Dropout(dropout)

    def forward(self, src):
        embedded = self.dropout(self.embedding(src))
        outputs, (hidden, cell) = self.rnn(embedded)
        return hidden, cell

class Attention(nn.Module):
    def __init__(self, hidden_dim):
        super(Attention, self).__init__()
        self.attn = nn.Linear(hidden_dim * 2, hidden_dim)
        self.v = nn.Parameter(torch.rand(hidden_dim))

    def forward(self, hidden, encoder_outputs):
        src_len = encoder_outputs.shape[1]
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim=2)))
        attention = torch.sum(self.v * energy, dim=2)
        return torch.softmax(attention, dim=1)

class Decoder(nn.Module):
    def __init__(self, output_dim, embedding_dim, hidden_dim, n_layers, dropout, attention):
        super(Decoder, self).__init__()
        self.output_dim = output_dim
        self.embedding = nn.Embedding(output_dim, embedding_dim)
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout=dropout, batch_first=True)
        self.attention = attention
        self.fc_out = nn.Linear(hidden_dim * 2, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, input, hidden, cell, encoder_outputs):
        embedded = self.dropout(self.embedding(input))
        rnn_output, (hidden, cell) = self.rnn(embedded, (hidden, cell))
        attention_weights = self.attention(hidden, encoder_outputs)
        context_vector = torch.bmm(attention_weights.unsqueeze(1), encoder_outputs)
        output = torch.cat((rnn_output.squeeze(1), context_vector.squeeze(1)), dim=1)
        prediction = self.fc_out(output)
        return prediction, hidden, cell, attention_weights

class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        batch_size = src.shape[0]
        trg_len = trg.shape[1]
        batch_size = src.shape[0]
        trg_vocab_size = self.decoder.output_dim
        outputs = torch.zeros(batch_size, trg_len, trg_vocab_size).to(self.device)

        encoder_hidden, encoder_cell = self.encoder(src)

        # First input to the decoder is the <sos> token
        input = trg[:, 0]

        for t in range(1, trg_len):
            output, hidden, cell, _ = self.decoder(input, encoder_hidden, encoder_cell, src)
            outputs[:, t] = output
            teacher_force = random.random() < teacher_forcing_ratio
            top1 = output.argmax(1)  
            input = trg[:, t] if teacher_force else top1
        
        return outputs



### 3. **Training Loop**



In [14]:

import random
import torch.optim as optim

def train(model, iterator, optimizer, criterion, clip):
    model.train()
    epoch_loss = 0

    for i, batch in enumerate(iterator):
        src = batch.src
        trg = batch.trg
        optimizer.zero_grad()
        
        output = model(src, trg)
        
        # Reshape output for calculating loss
        output_dim = output.shape[-1]
        output = output.view(-1, output_dim)
        trg = trg.view(-1)
        
        loss = criterion(output, trg)
        loss.backward()
        
        # Gradient clipping
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()

    return epoch_loss / len(iterator)


In [15]:

pip install sacrebleu


Collecting sacrebleu
  Downloading sacrebleu-2.4.3-py3-none-any.whl.metadata (51 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m369.8 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting portalocker (from sacrebleu)
  Downloading portalocker-3.0.0-py3-none-any.whl.metadata (8.5 kB)
Collecting tabulate>=0.8.9 (from sacrebleu)
  Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB)
Collecting colorama (from sacrebleu)
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Collecting lxml (from sacrebleu)
  Downloading lxml-5.3.0-cp312-cp312-macosx_10_9_x86_64.whl.metadata (3.8 kB)
Downloading sacrebleu-2.4.3-py3-none-any.whl (103 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.0/104.0 kB[0m [31m393.6 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading tabulate-0.9.0-py3-none-any.whl (35 kB)
Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Downloading lxml-5.3.0-cp312-cp312

In [16]:
import sacrebleu

def evaluate(model, iterator, criterion):
    model.eval()
    epoch_loss = 0
    predictions, targets = [], []

    with torch.no_grad():
        for batch in iterator:
            src = batch.src
            trg = batch.trg
            output = model(src, trg, teacher_forcing_ratio=0)

            output_dim = output.shape[-1]
            output = output.view(-1, output_dim)
            trg = trg.view(-1)

            loss = criterion(output, trg)
            epoch_loss += loss.item()

            # Convert to words
            output = output.argmax(1).cpu().numpy()
            trg = trg.cpu().numpy()

            predictions.append(output)
            targets.append(trg)
    
    # BLEU score calculation
    bleu_score = sacrebleu.corpus_bleu(predictions, [targets]).score
    return epoch_loss / len(iterator), bleu_score


### 5. **Training and Evaluation**

Set up the data loaders, optimizer, and loss function.

### Next Steps:

- **Data Handling**: Implement data loading, preprocessing (tokenization, padding).
- **Hyperparameter Tuning**: Adjust hidden layer sizes, embedding dimensions, etc.
- **Optimization**: Test the model with different batch sizes, learning rates, etc.

Let me know how you'd like to proceed or if you'd like further details on any part!