# Задание

В этом задании вам предстоит продолжить работу с задачей машинного перевода из [занятия 7](https://github.com/pacifikus/itmo_dl_nlp_course/blob/main/Lecture%207/itmo_dl_nlp_course_06_seq2seq.ipynb)

Попробуйте улучшить качество модели, проверив следующие гипотезы:

- измените размер словаря / предобработку во время токенизации - **1 балл**
- продолжите эксперименты с различными RNN юнитами в encoder и decoder части: замена GRU/LSTM, изменение количества слоев, использование bidirectional RNN - **1 балл**
- улучшите процесс тренировки: добавьте lr sheduling, early stopping, поэкспериментируйте с оптимизатором - **2 балла**
- поэкспериментируйте с сэмплированием - замените greedy-инференс на альтернативные варианты - **2 балла**
- проведите ablation-study, сделайте выводы о влиянии ваших изменений на итоговую производительность модели - **2 балла**

**Общее**

- Принимаемые решения обоснованы (почему выбрана определенная архитектура/гиперпараметр/оптимизатор/преобразование и т.п.) - **1 балл**
- Обеспечена воспроизводимость решения: зафиксированы random_state, ноутбук воспроизводится от начала до конца без ошибок - **1 балл**

## Этап 0 - Подготовка

### Импортируем необходимые библиотеки и компоненты

In [10]:
import warnings
from typing import Generator
from pathlib import Path

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.optim.lr_scheduler import ReduceLROnPlateau
from sklearn.model_selection import train_test_split
from razdel import tokenize
from nltk.translate.bleu_score import corpus_bleu
from tqdm import trange
from subword_nmt.learn_bpe import learn_bpe
from subword_nmt.apply_bpe import BPE


warnings.filterwarnings("ignore")

### Фиксируем seed'ы

In [2]:
seed = 42
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)

device = "cuda" if torch.cuda.is_available() else "cpu"

## Этап 1 - Размер словаря

### Составим файлы с текстами разных языков как на занятии

In [3]:
with open("../data/raw/data.txt", encoding="utf-8") as f:
    print(f.readline())

Cordelia Hotel is situated in Tbilisi, a 3-minute walk away from Saint Trinity Church.	Отель Cordelia расположен в Тбилиси, в 3 минутах ходьбы от Свято-Троицкого собора.



In [4]:
def tokenize_text(s: str) -> str:
    return " ".join(list(t.text for t in tokenize(s.lower())))


def read_file_lines(file_path: str, encoding="utf-8") -> Generator[str, None, None]:
    with open(file_path, encoding=encoding) as file:
        for line in file:
            yield line.strip()


print(tokenize_text("Cordelia Hotel is situated in Tbilisi, a 3-minute walk away from Saint Trinity Church."))

cordelia hotel is situated in tbilisi , a 3-minute walk away from saint trinity church .


In [5]:
artifact_base_path = Path("../data/hw_2/").resolve()

In [None]:
with open(artifact_base_path / "train.en", "w", encoding="utf-8") as f_src, open(
    artifact_base_path / "train.ru", "w", encoding="utf-8"
) as f_dst:

    for line in read_file_lines("../data/raw/data.txt"):
        src_line, dst_line = line.strip().split("\t")
        print(tokenize_text(src_line), file=f_src)
        print(tokenize_text(dst_line), file=f_dst)

In [6]:
def apply_bpe(num_symbols: int, lang: str) -> Path:
    input_file = artifact_base_path / f"train.{lang}"
    rules_file = artifact_base_path / f"bpe_rules_{num_symbols}.{lang}"

    learn_bpe(open(input_file, encoding="utf-8"), open(rules_file, "w", encoding="utf-8"), num_symbols=num_symbols)
    bpe = BPE(open(rules_file, encoding="utf-8"))

    processed_text = artifact_base_path / f"train.bpe_{num_symbols}.{lang}"

    with open(processed_text, "w", encoding="utf-8") as f_out:
        for line in read_file_lines(input_file):
            print(bpe.process_line(line.strip()), file=f_out)

    return processed_text

In [11]:
class Vocab:
    def __init__(self, tokens, bos="_BOS_", eos="_EOS_", unk="_UNK_"):
        """
        A special class that converts lines of tokens into matrices and backwards
        """
        assert all(tok in tokens for tok in (bos, eos, unk))
        self.tokens = tokens
        self.token_to_ix = {t: i for i, t in enumerate(tokens)}
        self.bos, self.eos, self.unk = bos, eos, unk
        self.bos_ix = self.token_to_ix[bos]
        self.eos_ix = self.token_to_ix[eos]
        self.unk_ix = self.token_to_ix[unk]

    def __len__(self):
        return len(self.tokens)

    @staticmethod
    def from_lines(lines, bos="_BOS_", eos="_EOS_", unk="_UNK_"):
        flat_lines = "\n".join(list(lines)).split()
        tokens = sorted(set(flat_lines))
        tokens = [t for t in tokens if t not in (bos, eos, unk) and len(t)]
        tokens = [bos, eos, unk] + tokens
        return Vocab(tokens, bos, eos, unk)

    def tokenize(self, string):
        """converts string to a list of tokens"""
        tokens = [tok if tok in self.token_to_ix else self.unk for tok in string.split()]
        return [self.bos] + tokens + [self.eos]

    def to_matrix(self, lines, dtype=torch.int64, max_len=None):
        """
        convert variable length token sequences into  fixed size matrix
        example usage:
        >>>print(to_matrix(words[:3],source_to_ix))
        [[15 22 21 28 27 13 -1 -1 -1 -1 -1]
         [30 21 15 15 21 14 28 27 13 -1 -1]
         [25 37 31 34 21 20 37 21 28 19 13]]
        """
        lines = list(map(self.tokenize, lines))
        max_len = max_len or max(map(len, lines))

        matrix = torch.full((len(lines), max_len), self.eos_ix, dtype=dtype)
        for i, seq in enumerate(lines):
            row_ix = list(map(self.token_to_ix.get, seq))[:max_len]
            matrix[i, : len(row_ix)] = torch.as_tensor(row_ix)

        return matrix

    def to_lines(self, matrix, crop=True):
        """
        Convert matrix of token ids into strings
        :param matrix: matrix of tokens of int32, shape=[batch,time]
        :param crop: if True, crops BOS and EOS from line
        :return:
        """
        lines = []
        for line_ix in map(list, matrix):
            if crop:
                if line_ix[0] == self.bos_ix:
                    line_ix = line_ix[1:]
                if self.eos_ix in line_ix:
                    line_ix = line_ix[: line_ix.index(self.eos_ix)]
            line = " ".join(self.tokens[i] for i in line_ix)
            lines.append(line)
        return lines

    def compute_mask(self, input_ix):
        """compute a boolean mask that equals "1" until first EOS (including that EOS)"""
        return F.pad(torch.cumsum(input_ix == self.eos_ix, dim=-1)[..., :-1] < 1, pad=(1, 0, 0, 0), value=True)

### Проверим разные размеры словаря

Для скорости проверки возьмём 3 - 8k как на паре, 8k * 2 и 8k * 4. 

Попробуем найти оптимальный по соотношению длина последовательности / размер словаря.

In [7]:
vocab_sizes = [8192, 16384, 32768]

In [12]:
for selected_vocab_size in vocab_sizes:
    train_bpe_en = apply_bpe(selected_vocab_size, "en")
    train_bpe_ru = apply_bpe(selected_vocab_size, "ru")

    data_inp = np.array(open(train_bpe_en, encoding="utf-8").read().split("\n"))
    data_out = np.array(open(train_bpe_en, encoding="utf-8").read().split("\n"))

    data_inp = data_inp[data_inp != ""]
    data_out = data_out[data_out != ""]

    train_inp, dev_inp, train_out, dev_out = train_test_split(data_inp, data_out, test_size=3000, random_state=seed)

    inp_voc = Vocab.from_lines(train_inp)
    out_voc = Vocab.from_lines(train_out)

    print(f"\n=== Vocab Size {selected_vocab_size} ===")
    print("Input vocab size:", len(inp_voc))
    print("Output vocab size:", len(out_voc))

    avg_tokens_inp = np.mean([len(line.split()) for line in train_inp])
    avg_tokens_out = np.mean([len(line.split()) for line in train_out])
    print("Average token count - Input:", avg_tokens_inp, "| Output:", avg_tokens_out)

100%|██████████| 8192/8192 [00:04<00:00, 1771.83it/s]
100%|██████████| 8192/8192 [00:05<00:00, 1586.32it/s]



=== Vocab Size 8192 ===
Input vocab size: 8009
Output vocab size: 8009
Average token count - Input: 17.71721276595745 | Output: 17.71721276595745


100%|██████████| 16384/16384 [00:11<00:00, 1376.58it/s]
100%|██████████| 16384/16384 [00:12<00:00, 1334.91it/s]



=== Vocab Size 16384 ===
Input vocab size: 14799
Output vocab size: 14799
Average token count - Input: 17.01168085106383 | Output: 17.01168085106383


 72%|███████▏  | 23588/32768 [00:16<00:02, 4184.45it/s]no pair has frequency >= 2. Stopping
 73%|███████▎  | 24060/32768 [00:17<00:06, 1403.09it/s]
100%|██████████| 32768/32768 [00:28<00:00, 1169.38it/s]



=== Vocab Size 32768 ===
Input vocab size: 20442
Output vocab size: 20442
Average token count - Input: 16.695531914893618 | Output: 16.695531914893618


### Вывод

Увеличение размера словаря снижает среднюю длину последовательности (с 17.72 токенов при 8192 до 16.70 при 32768), что положительно сказывается на вычислительной сложности. 

При этом прирост снижения длины от 16k до 32k минимален, а размер словаря значительно растёт. Оптимальным выглядит около 16k.

In [13]:
selected_vocab_size = 16384

train_bpe_en = apply_bpe(selected_vocab_size, "en")
train_bpe_ru = apply_bpe(selected_vocab_size, "ru")

data_inp = np.array(open(artifact_base_path / train_bpe_ru, encoding="utf-8").read().split("\n"))
data_out = np.array(open(artifact_base_path / train_bpe_en, encoding="utf-8").read().split("\n"))

data_inp = data_inp[data_inp != ""]
data_out = data_out[data_out != ""]

train_inp, dev_inp, train_out, dev_out = train_test_split(data_inp, data_out, test_size=3000, random_state=2023)

inp_voc = Vocab.from_lines(train_inp)
out_voc = Vocab.from_lines(train_out)

100%|██████████| 16384/16384 [00:11<00:00, 1434.14it/s]
100%|██████████| 16384/16384 [00:11<00:00, 1405.21it/s]


## Этап 2 - Эксперименты с моделями

### Базовые модели с лекции

In [18]:
class BasicModel(nn.Module):
    def __init__(self, inp_voc, out_voc, emb_size=64, hid_size=128):
        """
        A simple encoder-decoder seq2seq model
        """
        super().__init__()

        self.inp_voc, self.out_voc = inp_voc, out_voc
        self.hid_size = hid_size

        self.emb_inp = nn.Embedding(len(inp_voc), emb_size)
        self.emb_out = nn.Embedding(len(out_voc), emb_size)
        self.enc0 = nn.GRU(emb_size, hid_size, batch_first=True)

        self.dec_start = nn.Linear(hid_size, hid_size)
        self.dec0 = nn.GRUCell(emb_size, hid_size)
        self.logits = nn.Linear(hid_size, len(out_voc))

    def forward(self, inp, out):
        """Apply model in training mode"""
        initial_state = self.encode(inp)
        return self.decode(initial_state, out)

    def encode(self, inp, **flags):
        """
        Takes input sequence, computes initial state
        :param inp: matrix of input tokens [batch, time]
        :returns: initial decoder state tensors, one or many
        """
        inp_emb = self.emb_inp(inp)
        batch_size = inp.shape[0]

        enc_seq, last_state_but_not_really = self.enc0(inp_emb)
        # enc_seq: [batch, sequence length, hid_size], last_state: [batch, hid_size]

        # note: last_state is not _actually_ last because of padding, let's find the real last_state
        lengths = (inp != self.inp_voc.eos_ix).to(torch.int64).sum(dim=1).clamp_max(inp.shape[1] - 1)
        last_state = enc_seq[torch.arange(len(enc_seq)), lengths]
        # ^-- shape: [batch_size, hid_size]

        dec_start = self.dec_start(last_state)
        return [dec_start]

    def decode_step(self, prev_state, prev_tokens, **flags):
        """
        Takes previous decoder state and tokens, returns new state and logits for next tokens
        :param prev_state: a list of previous decoder state tensors, same as returned by encode(...)
        :param prev_tokens: previous output tokens, an int vector of [batch_size]
        :return: a list of next decoder state tensors, a tensor of logits [batch, len(out_voc)]
        """
        prev_gru0_state = prev_state[0]
        prev_emb = self.emb_out(prev_tokens)
        new_dec_state = self.dec0(prev_emb, prev_gru0_state)  # input & hidden states
        output_logits = self.logits(new_dec_state)

        return [new_dec_state], output_logits

    def decode(self, initial_state, out_tokens, **flags):
        """Iterate over reference tokens (out_tokens) with decode_step"""
        batch_size = out_tokens.shape[0]
        state = initial_state

        # initial logits: always predict BOS
        onehot_bos = F.one_hot(
            torch.full([batch_size], self.out_voc.bos_ix, dtype=torch.int64), num_classes=len(self.out_voc)
        ).to(device=out_tokens.device)
        first_logits = torch.log(onehot_bos.to(torch.float32) + 1e-9)

        logits_sequence = [first_logits]
        for i in range(out_tokens.shape[1] - 1):
            state, logits = self.decode_step(state, out_tokens[:, i])
            logits_sequence.append(logits)
        return torch.stack(logits_sequence, dim=1)

    def decode_inference(self, initial_state, max_len=100, **flags):
        """Generate translations from model (greedy version)"""
        batch_size, device = len(initial_state[0]), initial_state[0].device
        state = initial_state
        outputs = [torch.full([batch_size], self.out_voc.bos_ix, dtype=torch.int64, device=device)]
        all_states = [initial_state]

        for i in range(max_len):
            state, logits = self.decode_step(state, outputs[-1])
            outputs.append(logits.argmax(dim=-1))
            all_states.append(state)

        return torch.stack(outputs, dim=1), all_states

    def translate_lines(self, inp_lines, **kwargs):
        inp = self.inp_voc.to_matrix(inp_lines).to(device)
        initial_state = self.encode(inp)
        out_ids, states = self.decode_inference(initial_state, **kwargs)
        return self.out_voc.to_lines(out_ids.cpu().numpy()), states


class AttentionLayer(nn.Module):
    def __init__(self, enc_size, dec_size, hid_size):
        """A layer that computes additive attention response and weights"""
        super().__init__()
        self.enc_size = enc_size  # num units in encoder state
        self.dec_size = dec_size  # num units in decoder state
        self.hid_size = hid_size  # attention layer hidden units

        self.linear_enc = nn.Linear(enc_size, hid_size)
        self.linear_dec = nn.Linear(dec_size, hid_size)
        self.linear_out = nn.Linear(hid_size, 1)
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, enc, dec, inp_mask):
        """
        Computes attention response and weights
        :param enc: encoder activation sequence, float32[batch_size, ninp, enc_size]
        :param dec: single decoder state used as "query", float32[batch_size, dec_size]
        :param inp_mask: mask on enc activatons (0 after first eos), float32 [batch_size, ninp]
        :returns: attn[batch_size, enc_size], probs[batch_size, ninp]
            - attn - attention response vector (weighted sum of enc)
            - probs - attention weights after softmax
        """
        batch_size, ninp, enc_size = enc.shape

        # Compute logits
        x = self.linear_dec(dec).reshape(-1, 1, self.hid_size)
        x = torch.tanh(self.linear_enc(enc) + x)
        x = self.linear_out(x)

        # Apply mask - if mask is 0, logits should be -inf or -1e9
        # You may need torch.where
        x[torch.where(inp_mask == False)] = -1e9

        # Compute attention probabilities (softmax)
        probs = self.softmax(x.reshape(batch_size, ninp))

        # Compute attention response using enc and probs
        attn = (probs.reshape(batch_size, ninp, 1) * enc).sum(1)

        return attn, probs


class AttentiveModel(BasicModel):
    def __init__(self, inp_voc, out_voc, emb_size=64, hid_size=128, attn_size=128):
        """Translation model that uses attention. See instructions above."""
        super().__init__(inp_voc, out_voc, emb_size, hid_size)
        self.inp_voc, self.out_voc = inp_voc, out_voc
        self.hid_size = hid_size

        self.enc0 = nn.LSTM(emb_size, hid_size, num_layers=2, batch_first=True)
        self.dec_start = nn.Linear(hid_size, hid_size)

        self.dec0 = nn.GRUCell(emb_size + hid_size, hid_size)
        self.attention = AttentionLayer(hid_size, hid_size, attn_size)

    def encode(self, inp, **flags):
        """
        Takes input sequences, computes initial state
        :param inp: matrix of input tokens [batch, time]
        :return: a list of initial decoder state tensors
        """
        # encode input sequence, create initial decoder states
        inp_emb = self.emb_inp(inp)
        enc_seq, last_state_but_not_really = self.enc0(inp_emb)
        # [dec_start] = super().encode(inp, **flags)

        lengths = (inp != self.inp_voc.eos_ix).to(torch.int64).sum(dim=1).clamp_max(inp.shape[1] - 1)
        last_state = enc_seq[torch.arange(len(enc_seq)), lengths]
        # ^-- shape: [batch_size, hid_size]
        dec_start = self.dec_start(last_state)

        # apply attention layer from initial decoder hidden state
        inp_mask = self.out_voc.compute_mask(inp)
        first_attn_probas = self.attention(enc_seq, dec_start, inp_mask)

        # Build first state: include
        # * initial states for decoder recurrent layers
        # * encoder sequence and encoder attn mask (for attention)
        # * make sure that last state item is attention probabilities tensor

        first_state = [dec_start, enc_seq, inp_mask, first_attn_probas]
        return first_state

    def decode_step(self, prev_state, prev_tokens, **flags):
        """
        Takes previous decoder state and tokens, returns new state and logits for next tokens
        :param prev_state: a list of previous decoder state tensors
        :param prev_tokens: previous output tokens, an int vector of [batch_size]
        :return: a list of next decoder state tensors, a tensor of logits [batch, n_tokens]
        """

        prev_gru0_state, enc_seq, enc_mask, _ = prev_state
        attn, attn_probs = self.attention(enc_seq, prev_gru0_state, enc_mask)

        x = self.emb_out(prev_tokens)
        x = torch.cat([attn, x], dim=-1)
        x = self.dec0(x, prev_gru0_state)

        new_dec_state = [x, enc_seq, enc_mask, attn_probs]
        output_logits = self.logits(x)
        return [new_dec_state, output_logits]

### Составим свои модели по заданию

1. Используем GRU вместо LSTM
2. Используем bi-directional LSTM

Не будем создавать слишком много моделей для оптимизации процесса обучения.

In [28]:
class AttentiveModel_GRU(BasicModel):
    def __init__(self, inp_voc, out_voc, emb_size=64, hid_size=128, attn_size=128):
        """Translation model with a 2-layer GRU encoder and GRUCell decoder."""
        super().__init__(inp_voc, out_voc, emb_size, hid_size)
        self.inp_voc, self.out_voc = inp_voc, out_voc
        self.hid_size = hid_size

        self.enc0 = nn.GRU(emb_size, hid_size, num_layers=2, batch_first=True)
        self.dec_start = nn.Linear(hid_size, hid_size)
        self.dec0 = nn.GRUCell(emb_size + hid_size, hid_size)
        self.attention = AttentionLayer(hid_size, hid_size, attn_size)

    def encode(self, inp, **flags):
        inp_emb = self.emb_inp(inp)
        enc_seq, last_state = self.enc0(inp_emb)
        last_state = last_state[-1]
        dec_start = self.dec_start(last_state)
        inp_mask = self.out_voc.compute_mask(inp)
        first_attn = self.attention(enc_seq, dec_start, inp_mask)
        return [dec_start, enc_seq, inp_mask, first_attn]

    def decode_step(self, prev_state, prev_tokens, **flags):
        prev_gru0_state, enc_seq, enc_mask, _ = prev_state
        attn, attn_probs = self.attention(enc_seq, prev_gru0_state, enc_mask)
        x = self.emb_out(prev_tokens)
        x = torch.cat([attn, x], dim=-1)
        x = self.dec0(x, prev_gru0_state)
        new_state = [x, enc_seq, enc_mask, attn_probs]
        output_logits = self.logits(x)
        return [new_state, output_logits]


class AttentiveModel_Bidirectional(BasicModel):
    def __init__(self, inp_voc, out_voc, emb_size=64, hid_size=128, attn_size=128):
        """Translation model with a 2-layer bidirectional LSTM encoder and GRUCell decoder."""
        super().__init__(inp_voc, out_voc, emb_size, hid_size)
        self.inp_voc, self.out_voc = inp_voc, out_voc
        self.hid_size = hid_size

        self.enc0 = nn.LSTM(emb_size, hid_size, num_layers=2, batch_first=True, bidirectional=True)
        self.dec_start = nn.Linear(hid_size, hid_size)  # combining forward and backward states
        self.attn_proj = nn.Linear(hid_size * 2, hid_size)
        self.dec0 = nn.GRUCell(emb_size + hid_size, hid_size)
        self.logits = nn.Linear(hid_size, len(out_voc))
        self.attention = AttentionLayer(hid_size * 2, hid_size, attn_size)

    def encode(self, inp, **flags):
        inp_emb = self.emb_inp(inp)
        enc_seq, (h_n, c_n) = self.enc0(inp_emb)
        forward = h_n[-2, :, :]
        backward = h_n[-1, :, :]
        last_state = forward + backward
        dec_start = self.dec_start(last_state)
        inp_mask = self.out_voc.compute_mask(inp)
        first_attn = self.attention(enc_seq, dec_start, inp_mask)
        return [dec_start, enc_seq, inp_mask, first_attn]

    def decode_step(self, prev_state, prev_tokens, **flags):
        prev_gru0_state, enc_seq, enc_mask, _ = prev_state
        attn, attn_probs = self.attention(enc_seq, prev_gru0_state, enc_mask)
        attn_proj = self.attn_proj(attn)
        x = self.emb_out(prev_tokens)
        x = torch.cat([attn_proj, x], dim=-1)
        x = self.dec0(x, prev_gru0_state)
        new_state = [x, enc_seq, enc_mask, attn_probs]
        output_logits = self.logits(x)
        return [new_state, output_logits]

### Методы для обучения и тестирования

In [20]:
def compute_loss(model, inp, out, **flags):
    """
    Compute loss (float32 scalar) as in the formula above
    :param inp: input tokens matrix, int32[batch, time]
    :param out: reference tokens matrix, int32[batch, time]

    In order to pass the tests, your function should
    * include loss at first EOS but not the subsequent ones
    * divide sum of losses by a sum of input lengths (use voc.compute_mask)
    """
    mask = model.out_voc.compute_mask(out)  # [batch_size, out_len]
    targets_1hot = F.one_hot(out, len(model.out_voc)).to(torch.float32)

    # outputs of the model, [batch_size, out_len, num_tokens]
    logits_seq = model(inp, out)

    # log-probabilities of all tokens at all steps, [batch_size, out_len, num_tokens]
    logprobs_seq = torch.log_softmax(logits_seq, dim=-1)

    # log-probabilities of correct outputs, [batch_size, out_len]
    logp_out = (logprobs_seq * targets_1hot).sum(dim=-1)
    # ^-- this will select the probability of the actual next token.
    # Note: you can compute loss more efficiently using using F.cross_entropy

    # average cross-entropy over tokens where mask == True
    return -logp_out[mask].mean()


def compute_bleu(model, inp_lines, out_lines, bpe_sep="@@ ", **flags):
    """
    Estimates corpora-level BLEU score of model's translations given inp and reference out
    """
    with torch.no_grad():
        translations, _ = model.translate_lines(inp_lines, **flags)
        translations = [line.replace(bpe_sep, "") for line in translations]
        actual = [line.replace(bpe_sep, "") for line in out_lines]
        return (
            corpus_bleu(
                [[ref.split()] for ref in actual],
                [trans.split() for trans in translations],
                smoothing_function=lambda precisions, **kw: [p + 1.0 / p.denominator for p in precisions],
            )
            * 100
        )


def train_model(
    model,
    train_inp,
    train_out,
    dev_inp,
    dev_out,
    inp_voc,
    out_voc,
    n_steps=5000,
    batch_size=32,
    eval_interval=100,
):
    opt = torch.optim.Adam(model.parameters(), lr=1e-3)
    metrics = {"train_loss": [], "dev_bleu": []}
    for step in trange(n_steps):
        batch_ix = np.random.randint(len(train_inp), size=batch_size)
        batch_inp = inp_voc.to_matrix(train_inp[batch_ix]).to(device)
        batch_out = out_voc.to_matrix(train_out[batch_ix]).to(device)
        opt.zero_grad()
        loss = compute_loss(model, batch_inp, batch_out)
        loss.backward()
        opt.step()
        metrics["train_loss"].append(loss.item())
        if step % eval_interval == 0:
            bleu = compute_bleu(model, dev_inp, dev_out)
            metrics["dev_bleu"].append(bleu)
    return metrics

In [None]:
model_lstm = AttentiveModel(inp_voc, out_voc).to(device)
metrics_lstm = train_model(model_lstm, train_inp, train_out, dev_inp, dev_out, inp_voc, out_voc, n_steps=5000)

100%|██████████| 5000/5000 [06:00<00:00, 13.87it/s]


In [24]:
final_bleu_lstm = np.mean(metrics_lstm["dev_bleu"][-10:])
print(f"{final_bleu_lstm:.3f}")

11.255


In [25]:
model_gru = AttentiveModel_GRU(inp_voc, out_voc).to(device)
metrics_gru = train_model(model_gru, train_inp, train_out, dev_inp, dev_out, inp_voc, out_voc, n_steps=5000)

100%|██████████| 5000/5000 [05:54<00:00, 14.11it/s]


In [26]:
final_bleu_gru = np.mean(metrics_gru["dev_bleu"][-10:])
print(f"{final_bleu_gru:.3f}")

12.015


In [29]:
model_bi = AttentiveModel_Bidirectional(inp_voc, out_voc).to(device)
metrics_bi = train_model(model_bi, train_inp, train_out, dev_inp, dev_out, inp_voc, out_voc, n_steps=5000)

100%|██████████| 5000/5000 [06:29<00:00, 12.84it/s]


In [30]:
final_bleu_bi = np.mean(metrics_bi["dev_bleu"][-10:])
print(f"{final_bleu_bi:.3f}")

14.159


### Вывод

Двунаправленный LSTM (BLEU ~ 14.15) значительно улучшает качество перевода по сравнению с базовой моделью (BLEU ~ 11.25) и моделью с GRU-энкодером (BLEU ~ 12.02). Это подтверждает, что учет контекста с обеих сторон (bidirectional) положительно влияет на результат.

## Этап 4 - Оптимизация процесса обучения

### Добавим указанные в задании методы в процесс обучения

In [31]:
def train_model_improved(
    model,
    train_inp,
    train_out,
    dev_inp,
    dev_out,
    inp_voc,
    out_voc,
    n_steps=5000,
    batch_size=32,
    eval_interval=100,
    patience=5,
    optimizer_choice="adam",
):

    if optimizer_choice == "sgd":
        optimizer = torch.optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)
    else:
        optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
        optimizer, mode="max", factor=0.5, patience=patience, verbose=True
    )

    metrics = {"train_loss": [], "dev_bleu": []}
    best_bleu = -float("inf")
    best_step = 0

    for step in trange(n_steps):
        batch_ix = np.random.randint(len(train_inp), size=batch_size)
        batch_inp = inp_voc.to_matrix(train_inp[batch_ix]).to(device)
        batch_out = out_voc.to_matrix(train_out[batch_ix]).to(device)

        optimizer.zero_grad()
        loss = compute_loss(model, batch_inp, batch_out)
        loss.backward()
        optimizer.step()

        metrics["train_loss"].append(loss.item())

        if step % eval_interval == 0:
            bleu = compute_bleu(model, dev_inp, dev_out)
            metrics["dev_bleu"].append(bleu)
            scheduler.step(bleu)

            if bleu > best_bleu:
                best_bleu = bleu
                best_step = step
            elif step - best_step > patience * eval_interval:
                print(f"Early stopping at step {step}, best BLEU: {best_bleu:.2f}")
                break

    return metrics

In [32]:
model_lstm_adam = AttentiveModel(inp_voc, out_voc).to(device)
metrics_lstm_adam = train_model_improved(
    model_lstm_adam, train_inp, train_out, dev_inp, dev_out, inp_voc, out_voc, n_steps=5000
)

100%|██████████| 5000/5000 [06:01<00:00, 13.83it/s]


In [33]:
final_bleu_lstm_adam = np.mean(metrics_lstm_adam["dev_bleu"][-10:])
print(f"{final_bleu_lstm_adam:.3f}")

11.442


In [None]:
model_lstm_sgd = AttentiveModel(inp_voc, out_voc).to(device)
metrics_lstm_sgd = train_model_improved(
    model_lstm_sgd, train_inp, train_out, dev_inp, dev_out, inp_voc, out_voc, n_steps=5000, optimizer_choice="sgd"
)

 50%|█████     | 2500/5000 [03:04<03:04, 13.55it/s]

Early stopping at step 2500, best BLEU: 2.48





In [41]:
final_bleu_lstm_sgd = np.mean(metrics_lstm_sgd["dev_bleu"][-10:])
print(f"{final_bleu_lstm_sgd:.3f}")

2.159


In [36]:
model_gru_adam = AttentiveModel_GRU(inp_voc, out_voc).to(device)
metrics_gru_adam = train_model_improved(
    model_gru_adam, train_inp, train_out, dev_inp, dev_out, inp_voc, out_voc, n_steps=5000
)

100%|██████████| 5000/5000 [06:01<00:00, 13.84it/s]


In [37]:
final_bleu_gru_adam = np.mean(metrics_gru_adam["dev_bleu"][-10:])
print(f"{final_bleu_gru_adam:.3f}")

13.390


In [38]:
model_gru_sgd = AttentiveModel_GRU(inp_voc, out_voc).to(device)
metrics_gru_sgd = train_model_improved(
    model_gru_adam, train_inp, train_out, dev_inp, dev_out, inp_voc, out_voc, n_steps=5000, optimizer_choice="sgd"
)

 42%|████▏     | 2100/5000 [02:32<03:30, 13.74it/s]

Early stopping at step 2100, best BLEU: 15.99





In [39]:
final_bleu_gru_sgd = np.mean(metrics_gru_sgd["dev_bleu"][-10:])
print(f"{final_bleu_gru_sgd:.3f}")

15.727


In [None]:
model_bi_adam = AttentiveModel_Bidirectional(inp_voc, out_voc).to(device)
metrics_bi_adam = train_model_improved(
    model_bi_adam, train_inp, train_out, dev_inp, dev_out, inp_voc, out_voc, n_steps=5000
)

100%|██████████| 5000/5000 [06:43<00:00, 12.40it/s]


In [45]:
final_bleu_bi_adam = np.mean(metrics_bi_adam["dev_bleu"][-10:])
print(f"{final_bleu_bi_adam:.3f}")

11.144


In [46]:
model_bi_sgd = AttentiveModel_Bidirectional(inp_voc, out_voc).to(device)
metrics_bi_sgd = train_model_improved(
    model_bi_sgd, train_inp, train_out, dev_inp, dev_out, inp_voc, out_voc, n_steps=5000, optimizer_choice="sgd"
)

 80%|████████  | 4000/5000 [05:19<01:19, 12.52it/s]

Early stopping at step 4000, best BLEU: 3.06





In [48]:
final_bleu_bi_sgd = np.mean(metrics_bi_sgd["dev_bleu"][-10:])
print(f"{final_bleu_bi_sgd:.3f}")

2.743


### Вывод

Получили вполне интересные результаты: оптимизация обучения с early stopping, ~16 BLEU для GRU модели с SGD оптимизатором.

Но были результаты и значительно хуже (~3 BLEU). Возможно, мы попали в плохой локальный минимум или выбрали неудачные параметры.

## Этап 5 - Эксперименты с inference



### Реализуем не только `greedy`, но и `sampling` методы

In [49]:
def perform_inference(model, inp_lines, max_len=100, method="greedy", top_k=10, temperature=1.0):
    inp = model.inp_voc.to_matrix(inp_lines).to(device)
    state = model.encode(inp)
    batch_size = state[0].shape[0]

    outputs = [torch.full([batch_size], model.out_voc.bos_ix, dtype=torch.int64, device=device)]
    all_states = [state]

    for _ in range(max_len):
        state, logits = model.decode_step(state, outputs[-1])
        if method == "greedy":
            next_tokens = logits.argmax(dim=-1)
        elif method == "sampling":
            scaled_logits = logits / temperature
            top_logits, top_indices = torch.topk(scaled_logits, top_k, dim=-1)
            probs = F.softmax(top_logits, dim=-1)
            next_idx = torch.multinomial(probs, 1).squeeze(dim=-1)
            next_tokens = top_indices.gather(dim=-1, index=next_idx.unsqueeze(-1)).squeeze(-1)
        else:
            raise ValueError("Unknown decoding method")
        outputs.append(next_tokens)
        all_states.append(state)

    output_ids = torch.stack(outputs, dim=1)
    translations = model.out_voc.to_lines(output_ids.cpu().numpy())
    return translations, all_states

In [50]:
sample_inputs = dev_inp[:5]
sample_inputs

array(['всем гостям предоставляются японские халаты юката . на территории предусмотрена бесплатная парковка .',
       'апартаменты casa della g@@ hi@@ anda@@ ia расположены в 5 минутах ходьбы от ближайшего песчаного пляжа в порто-@@ чезаре@@ о . к услугам гостей апартаменты с кондиционером и телевизором с плоским экраном .',
       'в каждом номере отеля cand@@ i@@ ani имеется кондиционер , мини-бар и телевизор со спутниковыми каналами .',
       'поездка до ресторана lang@@ dal@@ es viney@@ ard занимает 20 минут .',
       'в лобби-баре некоторые местные алкогольные и безалкогольные напитки подаются бесплатно .'],
      dtype='<U490')

In [None]:
greedy_translations, _ = perform_inference(model_lstm_adam, sample_inputs, max_len=25, method="greedy")
print("Greedy decoding results:")
for t in greedy_translations:
    print(t)

Greedy decoding results:
free parking is available on site .
located in a quiet area , just a 5-minute walk from the beach , this apartment features a balcony with a terrace .
all rooms at the hotel feature a flat-screen tv , air conditioning and a minibar .
the hotel is a 10-minute drive from the hotel .
the bar offers a variety of drinks and snacks .


In [None]:
sampling_translations, _ = perform_inference(
    model_lstm_adam, sample_inputs, max_len=25, method="sampling", top_k=10, temperature=1.0
)
print("Sampling decoding results:")
for t in sampling_translations:
    print(t)

Sampling decoding results:
free wifi is available throughout the hotel .
featuring a garden terrace , villa casa is a self-catering accommodation located in a quiet area .
all rooms at the hotel ' s apartments have a flat-screen tv and a minibar .
it takes just off the town of rome , is 12 miles away .
the lobby bar serves breakfast and copy , .


### Вывод

Очевидно, качество перевода оставляет желать лучшего. 

Но также видно, что метод `sampling` значительно разнообразил тексты, хотя и качество умозрительно кажется ниже, чем с `greedy`

## Этап 6 - Финальные выводы

1. **Размер словаря:**
Увеличение размера словаря с использованием BPE снижает среднюю длину последовательности – с ~17.72 токенов при 8192 до ~16.70 при 32768. Однако переход от 16k к 32k дает минимальное сокращение длины при существенном увеличении словаря, поэтому оптимальным кажется ~16k.

2. **Архитектура модели:**
Применение двунаправленного LSTM заметно улучшает качество перевода (BLEU ~ 14.15) по сравнению с базовой моделью (BLEU ~ 11.25) и моделью с GRU-энкодером (BLEU ~ 12.02). Это подтверждает, что учет контекста с обеих сторон положительно влияет на результат.

3. **Оптимизация обучения:**
Улучшения с early stopping и использованием, например, SGD для GRU-модели дали интересный результат (~16 BLEU). Но встречались случаи, когда качество было крайне низким (~3 BLEU), что может говорить о попадании в плохой локальный минимум или неподходящие гиперпараметрах.

4. **Методы инференса:**
Метод `sampling` обеспечивает разнообразие переводов, однако по умозрительной оценке их качество уступает `greedy` версии. Возможно, были выбраны не лучшие параметры для `sampling`.

В итоге, мы видим, что результаты работы на занятии можно улучшить с помощью описанных шагов, что и было сделано.