<a href="https://colab.research.google.com/github/saahas-parise/german-to-english/blob/main/German_to_English_MachineTranslation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# German to English LSTM-based seq2seq Machine Translation

## Imports


In [None]:
%%capture
!pip install --upgrade sacrebleu sentencepiece gdown
# Standard library imports
import json
import math
import random

# Third party imports
import matplotlib.pyplot as plt
import numpy as np
import sacrebleu
import sentencepiece
import torch
import torch.nn as nn
import torch.nn.functional as F
import tqdm.notebook

In [None]:
try:
    assert torch.cuda.is_available()
    device = torch.device("cuda")
except:
    device = torch.device("cpu")
print("Using device:", device)

Using device: cuda


## Data

[Multi30K dataset](https://arxiv.org/abs/1605.00459)

In [None]:
!gdown 1ll4fDiPLQ0u9osdtSlsUcehK_p_2dykV
!gdown 1OEBVpX9F2FX0Mqj17jOWJKI2efUN_HBR
!gdown 1zZF8EXtzcd3oXSGEfyKywkSMosX_T6Jo

Downloading...
From: https://drive.google.com/uc?id=1ll4fDiPLQ0u9osdtSlsUcehK_p_2dykV
To: /content/training_data.json
100% 4.28M/4.28M [00:00<00:00, 96.1MB/s]
Downloading...
From: https://drive.google.com/uc?id=1OEBVpX9F2FX0Mqj17jOWJKI2efUN_HBR
To: /content/validation_data.json
100% 152k/152k [00:00<00:00, 86.2MB/s]
Downloading...
From: https://drive.google.com/uc?id=1zZF8EXtzcd3oXSGEfyKywkSMosX_T6Jo
To: /content/test_data.json
100% 80.2k/80.2k [00:00<00:00, 70.9MB/s]


In [None]:
with open("training_data.json","r") as f:
    training_data = json.load(f)
with open("validation_data.json","r") as f:
    validation_data = json.load(f)
with open("test_data.json","r") as f:
    test_data = json.load(f)
print("Number of training examples:", len(list(training_data)))
print("Number of validation examples:", len(list(validation_data)))
print("Number of test examples:", len(list(test_data)))
print()

for example in training_data[:10]:
  print(example[0])
  print(example[1])
  print()

Number of training examples: 29001
Number of validation examples: 1015
Number of test examples: 1000

Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.
Two young, White males are outside near many bushes.

Mehrere Männer mit Schutzhelmen bedienen ein Antriebsradsystem.
Several men in hard hats are operating a giant pulley system.

Ein kleines Mädchen klettert in ein Spielhaus aus Holz.
A little girl climbing into a wooden playhouse.

Ein Mann in einem blauen Hemd steht auf einer Leiter und putzt ein Fenster.
A man in a blue shirt is standing on a ladder cleaning a window.

Zwei Männer stehen am Herd und bereiten Essen zu.
Two men are at the stove preparing food.

Ein Mann in grün hält eine Gitarre, während der andere Mann sein Hemd ansieht.
A man in green holds a guitar while the other man observes his shirt.

Ein Mann lächelt einen ausgestopften Löwen an.
A man is smiling at a stuffed lion

Ein schickes Mädchen spricht mit dem Handy während sie langsam die Straße entla

## Vocabulary:
`sentencepiece` : helps create joint German-English subword vocabulary training corpus.

Download training corpus:

In [None]:
!gdown 1bO7SVCjvVzp__ibwED8wbMRSiJQNNP52
!gdown 1A2w-F6kmUXNtuFtG2qdpfw0dx2Mk7phR

Downloading...
From: https://drive.google.com/uc?id=1bO7SVCjvVzp__ibwED8wbMRSiJQNNP52
To: /content/train.en
100% 1.80M/1.80M [00:00<00:00, 165MB/s]
Downloading...
From: https://drive.google.com/uc?id=1A2w-F6kmUXNtuFtG2qdpfw0dx2Mk7phR
To: /content/train.de
100% 2.11M/2.11M [00:00<00:00, 151MB/s]


In [None]:
args = {
    "pad_id": 0,
    "bos_id": 1,
    "eos_id": 2,
    "unk_id": 3,
    "input": "train.de,train.en",
    "vocab_size": 8000,
    "model_prefix": "Multi30k",
    # "model_type": "word",
}
combined_args = " ".join(
    "--{}={}".format(key, value) for key, value in args.items())
sentencepiece.SentencePieceTrainer.Train(combined_args)

In [None]:
!head -n 30 Multi30k.vocab

<pad>	0
<s>	0
</s>	0
<unk>	0
.	-2.72718
▁a	-3.21357
▁in	-3.43973
m	-3.78503
▁eine	-3.82141
▁A	-3.86856
s	-4.06457
▁Ein	-4.11399
,	-4.20405
▁the	-4.35217
▁und	-4.5704
▁mit	-4.57911
▁auf	-4.58144
▁on	-4.65674
n	-4.67038
▁Mann	-4.70521
▁is	-4.73988
▁man	-4.75331
▁and	-4.76404
▁	-4.76512
ing	-4.8072
▁of	-4.83344
▁einer	-4.86421
▁with	-4.93426
▁Eine	-4.98902
▁ein	-5.126


In [None]:
vocab = sentencepiece.SentencePieceProcessor()
vocab.Load("Multi30k.model")

True

In [None]:
print("Vocabulary size:", vocab.GetPieceSize())
print()

for example in training_data[:3]:
  sentence = example[1]
  pieces = vocab.EncodeAsPieces(sentence)
  indices = vocab.EncodeAsIds(sentence)
  print(sentence)
  print(pieces)
  print(vocab.DecodePieces(pieces))
  print(indices)
  print(vocab.DecodeIds(indices))
  print()

piece = vocab.EncodeAsPieces("the")[0]
index = vocab.PieceToId(piece)
print(piece)
print(index)
print(vocab.IdToPiece(index))

Vocabulary size: 8000

Two young, White males are outside near many bushes.
['▁Two', '▁young', ',', '▁White', '▁males', '▁are', '▁outside', '▁near', '▁many', '▁bushes', '.']
Two young, White males are outside near many bushes.
[42, 54, 12, 2889, 2225, 36, 127, 173, 815, 3513, 4]
Two young, White males are outside near many bushes.

Several men in hard hats are operating a giant pulley system.
['▁Se', 'veral', '▁men', '▁in', '▁hard', '▁hats', '▁are', '▁operating', '▁a', '▁g', 'iant', '▁pull', 'e', 'y', '▁s', 'y', 'ste', 'm', '.']
Several men in hard hats are operating a giant pulley system.
[298, 240, 73, 6, 712, 730, 36, 3106, 5, 631, 1679, 583, 32, 96, 552, 96, 1076, 7, 4]
Several men in hard hats are operating a giant pulley system.

A little girl climbing into a wooden playhouse.
['▁A', '▁little', '▁girl', '▁climbing', '▁in', 'to', '▁a', '▁wooden', '▁play', 'house', '.']
A little girl climbing into a wooden playhouse.
[9, 132, 66, 500, 6, 112, 5, 542, 245, 4599, 4]
A little girl cli

In [None]:
pad_id = vocab.PieceToId("<pad>")
bos_id = vocab.PieceToId("<s>")
eos_id = vocab.PieceToId("</s>")
print(f"<pad>: {pad_id}, <s>: {bos_id}, </s>: {eos_id}")

<pad>: 0, <s>: 1, </s>: 2


In [None]:
sentence = training_data[0][1]
indices = vocab.EncodeAsIds(sentence)
indices_augmented = [bos_id] + indices + [eos_id, pad_id, pad_id, pad_id]
print(vocab.DecodeIds(indices))
print(vocab.DecodeIds(indices_augmented))
print(vocab.DecodeIds(indices) == vocab.DecodeIds(indices_augmented))

Two young, White males are outside near many bushes.
Two young, White males are outside near many bushes.
True


In [None]:
# Please do not change the code below
def generate_predictions_file_for_submission(filepath, model, dataset, method, batch_size=64):
    assert method in {"greedy", "beam"}
    source_sentences = [example[0] for example in dataset]
    model.eval()
    predictions = []
    with torch.no_grad():
      for start_index in range(0, len(source_sentences), batch_size):
        if method == "greedy":
          prediction_batch = predict_greedy(
              model, source_sentences[start_index:start_index + batch_size])
          prediction_batch = [[x] for x in prediction_batch]
        else:
          prediction_batch = predict_beam(
              model, source_sentences[start_index:start_index + batch_size])
        predictions.extend(prediction_batch)
    with open(filepath, "w") as outfile:
        json.dump(predictions, outfile, indent=2)
    print("Finished writing predictions to {}.".format(filepath))


## Seq2Seq Machine Translation Model

`Encode` method below encodes input sequences using a bi-directional LSTM (a stack of two LSTM networks). For both LSTMs, the output hidden states are concatenated to acquire each position representation.



In [None]:
def make_batch(sentences):
  """Convert a list of sentences into a batch of subword indices.

  Args:
    sentences: A list of sentences, each of which is a string.

  Returns:
    A LongTensor of size (max_sequence_length, batch_size) containing the
    subword indices for the sentences, where max_sequence_length is the length
    of the longest sentence as encoded by the subword vocabulary and batch_size
    is the number of sentences in the batch. A beginning-of-sentence token
    should be included before each sequence, and an end-of-sentence token should
    be included after each sequence. Empty slots at the end of shorter sequences
    should be filled with padding tokens. The tensor should be located on the
    device defined at the beginning of the notebook.
  """

  batch_indices = []
  for sentence in sentences:
    indices = vocab.EncodeAsIds(sentence)
    indices_augmented = [bos_id] + indices + [eos_id]
    indices_augmented = torch.LongTensor(indices_augmented)
    batch_indices.append(indices_augmented)

  batched_seq = torch.nn.utils.rnn.pad_sequence(batch_indices, padding_value=pad_id).to(device)
  return batched_seq

def make_batch_iterator(dataset, batch_size, shuffle=False):
  """Make a batch iterator that yields source-target pairs.

  Args:
    dataset: A torchtext dataset object.
    batch_size: An integer batch size.
    shuffle: A boolean indicating whether to shuffle the examples.

  Yields:
    Pairs of tensors constructed by calling the make_batch function on the
    source and target sentences in the current group of examples. The max
    sequence length can differ between the source and target tensor, but the
    batch size will be the same. The final batch may be smaller than the given
    batch size.
  """

  examples = list(dataset)
  if shuffle:
    random.shuffle(examples)

  for start_index in range(0, len(examples), batch_size):
    example_batch = examples[start_index:start_index + batch_size]
    source_sentences = [example[0] for example in example_batch]
    target_sentences = [example[1] for example in example_batch]
    yield make_batch(source_sentences), make_batch(target_sentences)

test_batch = make_batch(["a test input", "a longer input than the first"])
print("Example batch tensor:")
print(test_batch)
assert test_batch.shape[1] == 2
assert test_batch[0, 0] == bos_id
assert test_batch[0, 1] == bos_id
assert test_batch[-1, 0] == pad_id
assert test_batch[-1, 1] == eos_id

Example batch tensor:
tensor([[   1,    1],
        [   5,    5],
        [3966,  354],
        [   6,   60],
        [ 236,    6],
        [ 698,  236],
        [   2,  698],
        [   0, 5285],
        [   0,   13],
        [   0, 3759],
        [   0,    2]], device='cuda:0')


In [None]:
class Seq2seqBaseline(nn.Module):
  def __init__(self, hidden_dim, word_vector_dim, dropout,num_layers):
    super().__init__()
    """
    args:
      hidden_dim: hidden state size of LSTM
      word_vector_dim: size of the word embedding table
      dropout: this is applied to the output of the LSTM
    """
    ### Encoder Params. Please do not change these functions at all.
    self.dropout = nn.Dropout(dropout)
    # Embedding table over input vocabulary
    self.embedder = nn.Embedding(vocab.GetPieceSize(), word_vector_dim)
    self.lstm = nn.LSTM(word_vector_dim, hidden_dim, bidirectional=True, num_layers=num_layers)
    self.layer = nn.Linear(hidden_dim*num_layers*2, hidden_dim*num_layers)
    self.layer2 = nn.Linear(hidden_dim*num_layers*2, hidden_dim*num_layers)
    self.num_layers = num_layers
    ### Decoder Params.
    self.dropout2 = nn.Dropout(dropout)
    self.lstm2 = nn.LSTM(word_vector_dim, hidden_dim, num_layers=2)
    self.output_layer2 = nn.Linear(hidden_dim, vocab.GetPieceSize())
    self.log_softmax_layer = nn.LogSoftmax(dim=2)

  def encode(self, source):
    """Encode the source batch using a bidirectional LSTM encoder.

    Args:
      source: An integer tensor with shape (max_source_sequence_length,
        batch_size) containing subword indices for the source sentences.

    Returns:
      A tuple with three elements:
        encoder_output: The output of the bidirectional LSTM with shape
          (max_source_sequence_length, batch_size, 2 * hidden_size).
        encoder_mask: A boolean tensor with shape (max_source_sequence_length,
          batch_size) indicating which encoder outputs correspond to padding
          tokens. Its elements should be True at positions corresponding to
          padding tokens and False elsewhere.
        encoder_hidden: The final hidden states of the bidirectional LSTM (after
          a suitable projection) that will be used to initialize the decoder.
          This should be a pair of tensors (h_n, c_n), each with shape
          (num_layers, batch_size, hidden_size). Note that the hidden state
          returned by the LSTM cannot be used directly. Its initial dimension is
          twice the required size because it contains state from two directions.

    The first two return values are not required for the baseline model and will
    only be used later in the attention model. If desired, they can be replaced
    with None for the initial implementation.
    """

    # Using packed sequences to more easily work
    # with the variable-length sequences represented by the source tensor.
    # See https://pytorch.org/docs/stable/nn.html#torch.nn.utils.rnn.PackedSequence.


    # Compute a tensor containing the length of each source sequence.
    lengths = torch.sum(source != pad_id, axis=0).cpu()

    seq_len, batch_size = source.size()

    # embedded_sentence: seq_len x batch_size x word_vector_dim
    embedded_sentence = self.embedder(source)

    # pack it for rnn input
    embedded_sentence = torch.nn.utils.rnn.pack_padded_sequence(embedded_sentence,lengths,enforce_sorted=False)

    # lstm_out: seq_len x batch_size x 2 * hidden_dim
    # h_n, c_n: num_lay*2 x batch_size x hidden_dim
    lstm_out, (h_n, c_n) = self.lstm(embedded_sentence)

    # Take average of states across forward and reverse directions.
    #h_n = h_n.view(self.num_layers, 2, batch_size, h_n.shape[-1]).sum(1)
    #c_n = c_n.view(self.num_layers, 2, batch_size, c_n.shape[-1]).sum(1)

    h_n = h_n.view(2, -1, batch_size, h_n.shape[-1]).sum(1)
    c_n = c_n.view(2, -1, batch_size, c_n.shape[-1]).sum(1)

    encoder_mask = source == pad_id

    lstm_out, lens_unpacked = torch.nn.utils.rnn.pad_packed_sequence(lstm_out)
    lstm_out = self.dropout(lstm_out)

    return lstm_out, encoder_mask,(h_n, c_n)



  def decode(self, decoder_input, initial_hidden, encoder_output, encoder_mask):
    """Run the decoder LSTM starting from an initial hidden state.

    The third and fourth arguments are not used in the baseline model, but are
    included for compatibility with the attention model in the next section.

    Args:
      decoder_input: An integer tensor with shape (max_decoder_sequence_length,
        batch_size) containing the subword indices for the decoder input. During
        evaluation, where decoding proceeds one step at a time, the initial
        dimension should be 1.
      initial_hidden: A pair of tensors (h_0, c_0) representing the initial
        state of the decoder, each with shape (num_layers, batch_size,
        hidden_size).
      encoder_output: The output of the encoder with shape
        (max_source_sequence_length, batch_size, 2 * hidden_size).
      encoder_mask: The output mask from the encoder with shape
        (max_source_sequence_length, batch_size). Encoder outputs at positions
        with a True value correspond to padding tokens and should be ignored.

    Returns:
      A tuple with three elements:
        log_probs: A tensor with shape (max_decoder_sequence_length, batch_size,
          vocab_size) containing scores for the next-word
          predictions at each position.
        decoder_hidden: A pair of tensors (h_n, c_n) with the same shape as
          initial_hidden representing the updated decoder state after processing
          the decoder input.
        attention_weights: This will be implemented later in the attention
          model, but in order to maintain compatible type signatures, we also
          include it here. This can be None or any other placeholder value.
    """

    # These arguments are not used in the baseline model.
    del encoder_output
    del encoder_mask


    #lengths = torch.sum(decoder_input != pad_id, axis=0).cpu()
    attention_weights = None

    embedded_decoder_input = self.embedder(decoder_input)

    lstm_vals, decoder_hidden = self.lstm2(embedded_decoder_input, initial_hidden)

    log_probs = self.log_softmax_layer(self.output_layer2(lstm_vals))


    return (log_probs, decoder_hidden, attention_weights)





  def compute_loss(self, source, target):
    """Run the model on the source and compute the loss on the target.

    Args:
      source: An integer tensor with shape (max_source_sequence_length,
        batch_size) containing subword indices for the source sentences.
      target: An integer tensor with shape (max_target_sequence_length,
        batch_size) containing subword indices for the target sentences.

    Returns:
      A scalar float tensor representing cross-entropy loss on the current batch.
    """

    # Note that for a target sequence like <s> A B C </s>, you would
    # want to run the decoder on the prefix <s> A B C and have it predict the
    # suffix A B C </s>.

    _, batch_size = source.size()
    enc_output, encoder_mask, curr_state = self.encode(source)

    lengths = torch.sum(target != pad_id, axis=0).cpu()-1
    target_prefix = torch.clone(target).cpu()
    target_prefix[lengths,torch.arange(target_prefix.size(1))] = pad_id
    decoder_input = target_prefix[:-1,:].to(device)

    log_probs, _, _ = self.decode(decoder_input, curr_state, enc_output, encoder_mask)

    criterion = nn.NLLLoss(ignore_index=pad_id)
    target = target[1:,:]
    loss = criterion(log_probs.view(-1, vocab.GetPieceSize()), target.view(-1))

    return loss



In [None]:
def train(model, num_epochs, batch_size, model_file):
  """Train the model and save its best checkpoint.

  Model performance across epochs is evaluated using token-level accuracy on the
  validation set. The best checkpoint obtained during training will be stored on
  disk and loaded back into the model at the end of training.
  """
  optimizer = torch.optim.Adam(model.parameters())
  best_accuracy = 0.0
  for epoch in tqdm.notebook.trange(num_epochs, desc="training", unit="epoch"):
    with tqdm.notebook.tqdm(
        make_batch_iterator(training_data, batch_size, shuffle=True),
        desc="epoch {}".format(epoch + 1),
        unit="batch",
        total=math.ceil(len(training_data) / batch_size)) as batch_iterator:
      model.train()
      total_loss = 0.0
      for i, (source, target) in enumerate(batch_iterator, start=1):
        source, target = source.to(device),target.to(device)
        optimizer.zero_grad()
        loss = model.compute_loss(source, target)
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
        batch_iterator.set_postfix(mean_loss=total_loss / i)
      validation_perplexity, validation_accuracy = evaluate_next_token(
          model, validation_data)
      batch_iterator.set_postfix(
          mean_loss=total_loss / i,
          validation_perplexity=validation_perplexity,
          validation_token_accuracy=validation_accuracy)
      if validation_accuracy > best_accuracy:
        print(
            "Obtained a new best validation accuracy of {:.2f}, saving model "
            "checkpoint to {}...".format(validation_accuracy, model_file))
        torch.save(model.state_dict(), model_file)
        best_accuracy = validation_accuracy
  print("Reloading best model checkpoint from {}...".format(model_file))
  model.load_state_dict(torch.load(model_file))

def evaluate_next_token(model, dataset, batch_size=64):
  """Compute token-level perplexity and accuracy metrics.

  Note that the perplexity here is over subwords, not words.

  This function is used for validation set evaluation at the end of each epoch
  and should not be modified.
  """
  model.eval()
  total_cross_entropy = 0.0
  total_predictions = 0
  correct_predictions = 0
  with torch.no_grad():
    for source, target in make_batch_iterator(dataset, batch_size):
      encoder_output, encoder_mask, encoder_hidden = model.encode(source)
      decoder_input, decoder_target = target[:-1], target[1:]
      logits, decoder_hidden, attention_weights = model.decode(
          decoder_input, encoder_hidden, encoder_output, encoder_mask)
      total_cross_entropy += F.cross_entropy(
          logits.permute(1, 2, 0), decoder_target.permute(1, 0),
          ignore_index=pad_id, reduction="sum").item()
      total_predictions += (decoder_target != pad_id).sum().item()
      correct_predictions += (
          (decoder_target != pad_id) &
          (decoder_target == logits.argmax(2))).sum().item()
  perplexity = math.exp(total_cross_entropy / total_predictions)
  accuracy = 100 * correct_predictions / total_predictions
  return perplexity, accuracy

In [None]:
# We tune it to be fairly optimal, so idealy you don't have to change the parameters below.
# But you are welcome to adjust these parameters.

num_epochs = 10
batch_size = 16
hidden_dim = 256
word_vector_dim = 256
num_layers = 2
dropout = 0.3

baseline_model = Seq2seqBaseline(hidden_dim,word_vector_dim,dropout,num_layers).to(device)
train(baseline_model, num_epochs, batch_size, "baseline_model.pt")

training:   0%|          | 0/10 [00:00<?, ?epoch/s]

epoch 1:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 45.27, saving model checkpoint to baseline_model.pt...


epoch 2:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 49.89, saving model checkpoint to baseline_model.pt...


epoch 3:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 52.47, saving model checkpoint to baseline_model.pt...


epoch 4:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 54.11, saving model checkpoint to baseline_model.pt...


epoch 5:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 54.85, saving model checkpoint to baseline_model.pt...


epoch 6:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 55.30, saving model checkpoint to baseline_model.pt...


epoch 7:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 55.39, saving model checkpoint to baseline_model.pt...


epoch 8:   0%|          | 0/1813 [00:00<?, ?batch/s]

epoch 9:   0%|          | 0/1813 [00:00<?, ?batch/s]

epoch 10:   0%|          | 0/1813 [00:00<?, ?batch/s]

Reloading best model checkpoint from baseline_model.pt...


In [None]:
def predict_greedy(model, sentences, max_length=100):
  """Make predictions for the given inputs using greedy inference.

  Args:
    model: A sequence-to-sequence model.
    sentences: A list of input sentences, represented as strings.
    max_length: The maximum length at which to truncate outputs in order to
      avoid non-terminating inference.

  Returns:
    A list of predicted translations, represented as strings.
  """

  model.eval()
  batch_size = len(sentences)
  indices = make_batch(sentences)
  pred_translations = torch.zeros(max_length,batch_size, dtype=torch.long) # max_seq_length x batch_size

  enc_output, encoder_mask, curr_state = model.encode(indices)
  input = torch.LongTensor([bos_id] * batch_size).view(1, -1).to(device)
  finished_mask = torch.zeros(batch_size, dtype=torch.bool).to(device) # mask ones that have finished because some may finish ealier than the other in the same batch
  for i in range(0, max_length):
      log_probs, hidden, _ = model.decode(input, curr_state, enc_output, encoder_mask)

      # Prevent finished sequences from producing non-padding tokens.
      log_probs[:, finished_mask, pad_id] = 1e9

      # Get the most likely next token and its index.
      _, next_tokens = log_probs.squeeze(0).max(dim=1)
      pred_translations[i] = next_tokens

      # Update the input for the next decoding step.
      input = next_tokens.unsqueeze(0)

      # Update the state and finished masks.
      curr_state = hidden

      finished_mask = finished_mask | next_tokens.eq(eos_id)
      if finished_mask.all():
          break

  pred_translation_str = []
  for i in range(batch_size):
      string = vocab.DecodeIds(pred_translations[:,i].detach().cpu().numpy().astype(int).tolist())
      pred_translation_str.append(string)
  return pred_translation_str


def evaluate(model, dataset, batch_size=64, method="greedy"):
  assert method in {"greedy", "beam"}
  source_sentences = [example[0] for example in dataset]
  target_sentences = [example[1] for example in dataset]
  model.eval()
  predictions = []
  with torch.no_grad():
    for start_index in range(0, len(source_sentences), batch_size):
      if method == "greedy":
        prediction_batch = predict_greedy(
            model, source_sentences[start_index:start_index + batch_size])
      else:
        prediction_batch = predict_beam(
            model, source_sentences[start_index:start_index + batch_size])
        prediction_batch = [candidates[0] for candidates in prediction_batch]
      predictions.extend(prediction_batch)
  return sacrebleu.corpus_bleu(predictions, [target_sentences]).score

print("Baseline model validation BLEU using greedy search:",
      evaluate(baseline_model, validation_data))

### Generate the predictions for the baseline model using greedy decoding on the test_data.
generate_predictions_file_for_submission("seq2seq_predictions_baseline.json", baseline_model, test_data, "greedy")

Baseline model validation BLEU using greedy search: 20.79330043469394
Finished writing predictions to seq2seq_predictions_baseline.json.


In [None]:
def show_predictions(model, num_examples=4, num_beam=5,include_beam=False):
  for example in validation_data[:num_examples]:
    print("Input:")
    print(" ", example[0])
    print("Target:")
    print(" ", example[1])
    print("Greedy prediction:")
    print(" ", predict_greedy(model, [example[0]])[0])
    if include_beam:
      print(f"Beam predictions (showing top {num_beam}):")
      for candidate in predict_beam(model, [example[0]])[0][:num_beam]:
        print(" ", candidate)
    print()

print("Baseline model sample predictions:")
print()
show_predictions(baseline_model)

Baseline model sample predictions:

Input:
  Eine Gruppe von Männern lädt Baumwolle auf einen Lastwagen
Target:
  A group of men are loading cotton onto a truck
Greedy prediction:
  A group of men are cooking on a tree.

Input:
  Ein Mann schläft in einem grünen Raum auf einem Sofa.
Target:
  A man sleeping in a green room on a couch.
Greedy prediction:
  A man is sleeping in a green room with a blue blanket.

Input:
  Ein Junge mit Kopfhörern sitzt auf den Schultern einer Frau.
Target:
  A boy wearing headphones sits on a woman's shoulders.
Greedy prediction:
  A boy wearing a hat is sitting on a porch's shoulders.

Input:
  Zwei Männer bauen eine blaue Eisfischerhütte auf einem zugefrorenen See auf
Target:
  Two men setting up a blue ice fishing hut on an iced over lake
Greedy prediction:
  Two men are in a yellow boat on a dock near a pile of water.



## Attention-based Mechanism


In [None]:
class Seq2seqAttention(Seq2seqBaseline):
  # Note that this class inherents from Seq2seqBaseline, so all the parameters in Seq2seqBaseline are initialized when this class is
  # initialized.
  def __init__(self, hidden_dim, enc_output_size, word_vector_dim, dropout,num_layers):
    super().__init__(hidden_dim, word_vector_dim, dropout,num_layers)




    dEnc = enc_output_size
    dDec = word_vector_dim

    self.Wc = nn.Linear(dDec+dEnc, dDec+dEnc, bias=False)

    self.softmax = nn.Softmax(dim=-1)
    self.Wa = nn.Linear(dEnc, dDec, bias=False)
    self.outputlayer = nn.Linear(dDec+dEnc, vocab.GetPieceSize())
    self.tanh = nn.Tanh()





  def decode(self, decoder_input, initial_hidden, encoder_output, encoder_mask):
    """Run the decoder LSTM starting from an initial hidden state.

    Args:
      decoder_input: An integer tensor with shape (max_decoder_sequence_length,
        batch_size) containing the subword indices for the decoder input. During
        evaluation, where decoding proceeds one step at a time, the initial
        dimension should be 1.
      initial_hidden: A pair of tensors (h_0, c_0) representing the initial
        state of the decoder, each with shape (num_layers, batch_size,
        hidden_size).
      encoder_output: The output of the encoder with shape
        (max_source_sequence_length, batch_size, 2 * hidden_size).
      encoder_mask: The output mask from the encoder with shape
        (max_source_sequence_length, batch_size). Encoder outputs at positions
        with a True value correspond to padding tokens and should be ignored.

    Returns:
      A tuple with three elements:
        logits: A tensor with shape (max_decoder_sequence_length, batch_size,
          vocab_size) containing scores for the next-word
          predictions at each position.
        decoder_hidden: A pair of tensors (h_n, c_n) with the same shape as
          initial_hidden representing the updated decoder state after processing
          the decoder input.
        attention_weights: A tensor with shape (max_decoder_sequence_length,
          batch_size, max_source_sequence_length) representing the normalized
          attention weights. This should sum to 1 along the last dimension.
    """


    emDec = self.embedder(decoder_input)

    lstm_out, decoder_hidden = self.lstm2(emDec,initial_hidden)

    waOut = self.Wa(encoder_output)
    lstm_out_trans = lstm_out.permute(1,0,2)
    waOut = waOut.permute(1, 2, 0)


    out = torch.bmm(lstm_out_trans, waOut)
    h_t = lstm_out.permute(1,2,0)

    out = self.softmax(out)
    cont_enc = encoder_output.permute(1,2,0)
    attention_weights = out.permute(1,0,2)

    cont_att = attention_weights.permute(1,2,0)
    context = torch.bmm(cont_enc, cont_att)
    cont_h_t = torch.cat((context, h_t), dim=1).permute(0,2,1)


    h_tild = self.tanh(self.Wc(cont_h_t))
    logs = self.log_softmax_layer(self.outputlayer(self.dropout(h_tild)))
    logits = logs.permute(1,0,2)


    return logits.contiguous(), decoder_hidden, attention_weights



In [None]:
# You are welcome to adjust these parameters based on your model implementation.
num_epochs = 10
batch_size = 16
hidden_dim = 256
enc_output_size = hidden_dim * 2
word_vector_dim = 256
num_layers = 2
dropout = 0.3

attention_model = Seq2seqAttention(hidden_dim,enc_output_size, word_vector_dim,dropout,num_layers).to(device)
train(attention_model, num_epochs, batch_size, "attention_model.pt")
print("Attention model validation BLEU using greedy search:",
      evaluate(attention_model, validation_data))
# Generate the predictions for the attention model using greedy decoding on the test_data.
# Corret implementation of the baseline model and attention model should get you full credits here.
generate_predictions_file_for_submission("seq2seq_predictions_attention.json", attention_model, test_data, "greedy")

training:   0%|          | 0/10 [00:00<?, ?epoch/s]

epoch 1:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 58.11, saving model checkpoint to attention_model.pt...


epoch 2:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 64.05, saving model checkpoint to attention_model.pt...


epoch 3:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 65.74, saving model checkpoint to attention_model.pt...


epoch 4:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 67.06, saving model checkpoint to attention_model.pt...


epoch 5:   0%|          | 0/1813 [00:00<?, ?batch/s]

epoch 6:   0%|          | 0/1813 [00:00<?, ?batch/s]

epoch 7:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 67.07, saving model checkpoint to attention_model.pt...


epoch 8:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 67.14, saving model checkpoint to attention_model.pt...


epoch 9:   0%|          | 0/1813 [00:00<?, ?batch/s]

epoch 10:   0%|          | 0/1813 [00:00<?, ?batch/s]

Reloading best model checkpoint from attention_model.pt...
Attention model validation BLEU using greedy search: 38.05422988176068
Finished writing predictions to seq2seq_predictions_attention.json.


## Beam Search

In [None]:
class Hyp:
  """
  A helper class representing a hypothesis (i.e., the prefix of a prediction) on the beam,
  using a linked list.
  """
  def __init__(self, token_id:int, parent, score:float):
    """
    args:
      token: a word (as a string) representing the most recent word added to this hypothesis
      parent: the Hyp object representing the prefix to which we've added this token
      score: the score of this hypothesis
    """
    self.token_id = token_id
    self.parent = parent
    self.score = score


  def trace(self):
    """
    Traces backward through the linked-list to recover the whole hypothesis.
    returns:
      A list of word tokens representing the entire hypothesis.
    """
    pred = []
    temp = self
    while temp.token_id is not None:
      pred.append(temp.token_id)
      temp = temp.parent
    return pred[::-1]

def predict_beam(model, sentences, k=5, max_length=100):
    """Output the beam search result for the given sentences.

    Args:
      model: The model that will be used to generate the beams.
      sentences: A list of sentences (str) that the model will encode and do
        beam search over. To keep things simple, this will just be a list of
        length 1. So it will be a list of a single string.
      k: Beam size.
      max_length: Maximum timesteps you will generate. If it exceeds this
        timestep stop no matter what.

    Returns:
      A list of decoded generations (strings) sorted by their scores in descending order.
      The format should be [[beam decoded sentence1, beam decoded sentence2,...]]
      Note the extra list outside. We do this to keep the format compatible when
      batch_size > 1 in case you want to implement batched beam search (It is not required).
    """
    model.eval()
    all_dg = []
    V = vocab.GetPieceSize()
    indices = make_batch(sentences)
    enc_output, encoder_mask, curr_state = model.encode(indices) # seq_len x batch_size x 2d, seq_len x batch_size, (2 x batch_size x d, 2 x batch_size x d) batch_size would be 1 here.
    # always start with SOS_SYMBOL
    input = torch.LongTensor([bos_id]).view(1, 1).to(device)
    hyps = [Hyp(None, None, 0)] # holds current hypotheses; initialized with a dummy hypothesis.
    dg = [] # holds decoded generations (obtained from finished hypotheses)
    beam_logprobs = torch.zeros(1).to(device) # log probabilities of all hypotheses on the beam




    for t in range(0,max_length):

      out, decoder_hidden, _ = model.decode(input, curr_state, enc_output, encoder_mask)
      out = out.squeeze()
      ns = beam_logprobs + out
      ns = ns.reshape(-1)

      ns_top, topk = torch.topk(ns, k)
      new_bi = []
      nwi = (topk % V)
      nhs = []

      niis = []
      bi = (topk // V)

      new_beam_logprobs = []

      i = 0
      num_items = len(bi)

      while i < num_items:
        beam_index = bi[i].item()
        nwi = nwi[i].item()
        score = ns_top[i].item()

        if eos_id == nwi:
          dg.append(Hyp(nwi, hyps[beam_index], score))
          if len(dg) == k:
              dg_sorted = sorted(dg, key=lambda x: x.score, reverse=True)
              return [[vocab.DecodeIds(x.trace()) for x in dg_sorted]]

          i += 1
          continue

        nh = Hyp(nwi, hyps[beam_index], score)
        nhs.append(nh)
        new_beam_logprobs.append(score)
        new_bi.append(beam_index)
        niis.append(nwi)



        i += 1


      blps = torch.Tensor(new_beam_logprobs).unsqueeze(1).to(device)

      input = torch.LongTensor(niis).unsqueeze(0).to(device)

      new_bi = torch.LongTensor(new_bi).to(device)

      c_n = torch.index_select(decoder_hidden[0], 1, new_bi)
      h_n = torch.index_select(decoder_hidden[1], 1, new_bi)
      curr_state = (c_n,h_n)
      enc_output = torch.index_select(enc_output, 1, new_bi)
      encoder_mask = torch.index_select(encoder_mask, 1, new_bi)

      hyps = nhs


    g = 0
    n_hyps = len(hyps)

    while g < n_hyps:
      hyp = hyps[i]

      if len(dg) >= k:
        dg_sorted = sorted(dg, key=lambda x: x.score, reverse=True)[:k]
        return [[vocab.DecodeIds(x.trace()) for x in dg_sorted]]

    dg.append(hyp)

    g += 1

    dg_sorted = sorted(dg, key=lambda x: x.score, reverse = True)

    final_vals = [[vocab.DecodeIds(x.trace()) for x in dg_sorted]]
    return final_vals


    ##### END YOUR CODE HERE!!!!!!!!

print("Baseline model validation BLEU using beam search:",
      evaluate(baseline_model, validation_data, batch_size=1, method="beam"))
print()
print("Baseline model sample predictions:")
print()
show_predictions(baseline_model, include_beam=True)

Baseline model validation BLEU using beam search: 21.71012593650597

Baseline model sample predictions:

Input:
  Eine Gruppe von Männern lädt Baumwolle auf einen Lastwagen
Target:
  A group of men are loading cotton onto a truck
Greedy prediction:
  A group of men are cooking on a tree.
Beam predictions (showing top 5):
  A group of men are cooking on a tree.
  A group of men are picking up a tree.
  A group of men are picking a tree.
  A group of men are picking up a tree
  A group of men are cooking on a tree

Input:
  Ein Mann schläft in einem grünen Raum auf einem Sofa.
Target:
  A man sleeping in a green room on a couch.
Greedy prediction:
  A man is sleeping in a green room with a blue blanket.
Beam predictions (showing top 5):
  A man is sleeping on a green blanket in a room.
  A man sleeping in a green chair in a room.
  A man is sleeping in a green chair on a city street.
  A man is sleeping in a green room with a blue blanket.
  A man is sleeping on a green blanket in the mi

In [None]:
print("Attention model validation BLEU using beam search:",
      evaluate(attention_model, validation_data, batch_size=1, method="beam"))
print()
print("Attention model sample predictions:")
print()
show_predictions(attention_model, include_beam=True)

Attention model validation BLEU using beam search: 38.61895427870179

Attention model sample predictions:

Input:
  Eine Gruppe von Männern lädt Baumwolle auf einen Lastwagen
Target:
  A group of men are loading cotton onto a truck
Greedy prediction:
  A group of men are putting on a sidewalk
Beam predictions (showing top 5):
  A group of men are putting up to a truck
  A group of men are putting on a sidewalk
  A group of men are putting cotton for a truck
  A group of men are putting on tree collection.
  A group of men are putting cotton on a truck

Input:
  Ein Mann schläft in einem grünen Raum auf einem Sofa.
Target:
  A man sleeping in a green room on a couch.
Greedy prediction:
  A man is sleeping in a green room on a couch.
Beam predictions (showing top 5):
  A man sleeping on a couch in a green room.
  Man sleeping in green room on a couch.
  A man is sleeping in a green room on a couch.
  Man sleeping in a green room on a couch.
  A man is sleeping on a couch in a green room


In [None]:
!gdown 1zKM1vgKkRye1COYh4IlH_m0xDCq7chFF

Downloading...
From: https://drive.google.com/uc?id=1zKM1vgKkRye1COYh4IlH_m0xDCq7chFF
To: /content/special_model_beam_search.pt
  0% 0.00/7.89M [00:00<?, ?B/s]100% 7.89M/7.89M [00:00<00:00, 186MB/s]


In [None]:
device = torch.device("cpu") # we load the special model to cpu to compute the beams. Beam search under CPU and GPU may have small variations due to how Pytorch implemented them.
hidden_dim = 100
word_vector_dim = 100
num_layers = 1
dropout = 0.3

special_model = Seq2seqBaseline(hidden_dim,word_vector_dim,dropout,num_layers).to(device)
sd = torch.load("special_model_beam_search.pt")
special_model.load_state_dict(sd)

V = vocab.GetPieceSize()
nsrcs, srcsize = 11, 6
special_preds = {}
for beam_size in [1, 5, 10, 15]:
  torch.manual_seed(beam_size)
  srcs = [(vocab.DecodeIds(torch.LongTensor(srcsize).random_(0, V).numpy().tolist()),
           'filler target sentence filler target sentence filler target sentence') for _ in range(nsrcs)]
  predictions = []
  source_sentences = [x[0] for x in srcs]
  for start_index in range(0, len(source_sentences), 1):
        prediction_batch = predict_beam(
            special_model, source_sentences[start_index:start_index + 1], k=beam_size,max_length=50)
        predictions.extend(prediction_batch)
  special_preds[beam_size] = predictions

with open("beam_seqs.json", "w") as f:
    json.dump(special_preds, f)

In [None]:
generate_predictions_file_for_submission("seq2seq_predictions_attention.json", attention_model, test_data, "beam", batch_size=1)

Finished writing predictions to seq2seq_predictions_attention.json.


# Experimentation

## Baseline, without weight decay, without back translation

In [None]:
# proper training_data and validation_data without back translation
with open("training_data.json","r") as f:
    training_data = json.load(f)
with open("validation_data.json","r") as f:
    validation_data = json.load(f)
with open("test_data.json","r") as f:
    test_data = json.load(f)

In [None]:
# WITHOUT weight decay
def train(model, num_epochs, batch_size, model_file):
  """Train the model and save its best checkpoint.

  Model performance across epochs is evaluated using token-level accuracy on the
  validation set. The best checkpoint obtained during training will be stored on
  disk and loaded back into the model at the end of training.
  """

  #optimizer = torch.optim.Adam(model.parameters(), weight_decay=1e-5)
  optimizer = torch.optim.Adam(model.parameters())

  best_accuracy = 0.0
  for epoch in tqdm.notebook.trange(num_epochs, desc="training", unit="epoch"):
    with tqdm.notebook.tqdm(
        make_batch_iterator(training_data, batch_size, shuffle=True),
        desc="epoch {}".format(epoch + 1),
        unit="batch",
        total=math.ceil(len(training_data) / batch_size)) as batch_iterator:
      model.train()
      total_loss = 0.0
      for i, (source, target) in enumerate(batch_iterator, start=1):
        source, target = source.to(device),target.to(device)
        optimizer.zero_grad()
        loss = model.compute_loss(source, target)
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
        batch_iterator.set_postfix(mean_loss=total_loss / i)
      validation_perplexity, validation_accuracy = evaluate_next_token(
          model, validation_data)
      batch_iterator.set_postfix(
          mean_loss=total_loss / i,
          validation_perplexity=validation_perplexity,
          validation_token_accuracy=validation_accuracy)
      if validation_accuracy > best_accuracy:
        print(
            "Obtained a new best validation accuracy of {:.2f}, saving model "
            "checkpoint to {}...".format(validation_accuracy, model_file))
        torch.save(model.state_dict(), model_file)
        best_accuracy = validation_accuracy
  print("Reloading best model checkpoint from {}...".format(model_file))
  model.load_state_dict(torch.load(model_file))

def evaluate_next_token(model, dataset, batch_size=64):
  """Compute token-level perplexity and accuracy metrics.

  Note that the perplexity here is over subwords, not words.

  This function is used for validation set evaluation at the end of each epoch
  and should not be modified.
  """
  model.eval()
  total_cross_entropy = 0.0
  total_predictions = 0
  correct_predictions = 0
  with torch.no_grad():
    for source, target in make_batch_iterator(dataset, batch_size):
      encoder_output, encoder_mask, encoder_hidden = model.encode(source)
      decoder_input, decoder_target = target[:-1], target[1:]
      logits, decoder_hidden, attention_weights = model.decode(
          decoder_input, encoder_hidden, encoder_output, encoder_mask)
      total_cross_entropy += F.cross_entropy(
          logits.permute(1, 2, 0), decoder_target.permute(1, 0),
          ignore_index=pad_id, reduction="sum").item()
      total_predictions += (decoder_target != pad_id).sum().item()
      correct_predictions += (
          (decoder_target != pad_id) &
          (decoder_target == logits.argmax(2))).sum().item()
  perplexity = math.exp(total_cross_entropy / total_predictions)
  accuracy = 100 * correct_predictions / total_predictions
  return perplexity, accuracy

In [None]:
# We tune it to be fairly optimal, so idealy you don't have to change the parameters below.
# But you are welcome to adjust these parameters.

num_epochs = 10
batch_size = 16
hidden_dim = 256
word_vector_dim = 256
num_layers = 2
dropout = 0.3

baseline_model = Seq2seqBaseline(hidden_dim,word_vector_dim,dropout,num_layers).to(device)
train(baseline_model, num_epochs, batch_size, "baseline_model.pt")

training:   0%|          | 0/10 [00:00<?, ?epoch/s]

epoch 1:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 45.30, saving model checkpoint to baseline_model.pt...


epoch 2:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 49.77, saving model checkpoint to baseline_model.pt...


epoch 3:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 52.24, saving model checkpoint to baseline_model.pt...


epoch 4:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 53.76, saving model checkpoint to baseline_model.pt...


epoch 5:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 54.72, saving model checkpoint to baseline_model.pt...


epoch 6:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 55.12, saving model checkpoint to baseline_model.pt...


epoch 7:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 55.35, saving model checkpoint to baseline_model.pt...


epoch 8:   0%|          | 0/1813 [00:00<?, ?batch/s]

epoch 9:   0%|          | 0/1813 [00:00<?, ?batch/s]

epoch 10:   0%|          | 0/1813 [00:00<?, ?batch/s]

Reloading best model checkpoint from baseline_model.pt...


In [None]:
print("Baseline model validation BLEU using beam search:",
      evaluate(baseline_model, validation_data, batch_size=1, method="beam"))

Baseline model validation BLEU using beam search: 21.214919074473436


## Baseline, WITH weight decay, without back translation


In [None]:
# proper training_data and validation_data without back translation
with open("training_data.json","r") as f:
    training_data = json.load(f)
with open("validation_data.json","r") as f:
    validation_data = json.load(f)
with open("test_data.json","r") as f:
    test_data = json.load(f)

In [None]:
# WITH weight decay
def train(model, num_epochs, batch_size, model_file):
  """Train the model and save its best checkpoint.

  Model performance across epochs is evaluated using token-level accuracy on the
  validation set. The best checkpoint obtained during training will be stored on
  disk and loaded back into the model at the end of training.
  """

  optimizer = torch.optim.Adam(model.parameters(), weight_decay=1e-5)
  #optimizer = torch.optim.Adam(model.parameters())

  best_accuracy = 0.0
  for epoch in tqdm.notebook.trange(num_epochs, desc="training", unit="epoch"):
    with tqdm.notebook.tqdm(
        make_batch_iterator(training_data, batch_size, shuffle=True),
        desc="epoch {}".format(epoch + 1),
        unit="batch",
        total=math.ceil(len(training_data) / batch_size)) as batch_iterator:
      model.train()
      total_loss = 0.0
      for i, (source, target) in enumerate(batch_iterator, start=1):
        source, target = source.to(device),target.to(device)
        optimizer.zero_grad()
        loss = model.compute_loss(source, target)
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
        batch_iterator.set_postfix(mean_loss=total_loss / i)
      validation_perplexity, validation_accuracy = evaluate_next_token(
          model, validation_data)
      batch_iterator.set_postfix(
          mean_loss=total_loss / i,
          validation_perplexity=validation_perplexity,
          validation_token_accuracy=validation_accuracy)
      if validation_accuracy > best_accuracy:
        print(
            "Obtained a new best validation accuracy of {:.2f}, saving model "
            "checkpoint to {}...".format(validation_accuracy, model_file))
        torch.save(model.state_dict(), model_file)
        best_accuracy = validation_accuracy
  print("Reloading best model checkpoint from {}...".format(model_file))
  model.load_state_dict(torch.load(model_file))

def evaluate_next_token(model, dataset, batch_size=64):
  """Compute token-level perplexity and accuracy metrics.

  Note that the perplexity here is over subwords, not words.

  This function is used for validation set evaluation at the end of each epoch
  and should not be modified.
  """
  model.eval()
  total_cross_entropy = 0.0
  total_predictions = 0
  correct_predictions = 0
  with torch.no_grad():
    for source, target in make_batch_iterator(dataset, batch_size):
      encoder_output, encoder_mask, encoder_hidden = model.encode(source)
      decoder_input, decoder_target = target[:-1], target[1:]
      logits, decoder_hidden, attention_weights = model.decode(
          decoder_input, encoder_hidden, encoder_output, encoder_mask)
      total_cross_entropy += F.cross_entropy(
          logits.permute(1, 2, 0), decoder_target.permute(1, 0),
          ignore_index=pad_id, reduction="sum").item()
      total_predictions += (decoder_target != pad_id).sum().item()
      correct_predictions += (
          (decoder_target != pad_id) &
          (decoder_target == logits.argmax(2))).sum().item()
  perplexity = math.exp(total_cross_entropy / total_predictions)
  accuracy = 100 * correct_predictions / total_predictions
  return perplexity, accuracy

In [None]:
# We tune it to be fairly optimal, so idealy you don't have to change the parameters below.
# But you are welcome to adjust these parameters.

num_epochs = 10
batch_size = 16
hidden_dim = 256
word_vector_dim = 256
num_layers = 2
dropout = 0.3

baseline_model = Seq2seqBaseline(hidden_dim,word_vector_dim,dropout,num_layers).to(device)
train(baseline_model, num_epochs, batch_size, "baseline_model.pt")

training:   0%|          | 0/10 [00:00<?, ?epoch/s]

epoch 1:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 44.72, saving model checkpoint to baseline_model.pt...


epoch 2:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 49.45, saving model checkpoint to baseline_model.pt...


epoch 3:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 52.01, saving model checkpoint to baseline_model.pt...


epoch 4:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 54.19, saving model checkpoint to baseline_model.pt...


epoch 5:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 55.25, saving model checkpoint to baseline_model.pt...


epoch 6:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 56.20, saving model checkpoint to baseline_model.pt...


epoch 7:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 56.90, saving model checkpoint to baseline_model.pt...


epoch 8:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 57.21, saving model checkpoint to baseline_model.pt...


epoch 9:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 57.30, saving model checkpoint to baseline_model.pt...


epoch 10:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 57.41, saving model checkpoint to baseline_model.pt...
Reloading best model checkpoint from baseline_model.pt...


In [None]:
print("Baseline model validation BLEU using beam search:",
      evaluate(baseline_model, validation_data, batch_size=1, method="beam"))

Baseline model validation BLEU using beam search: 24.488207774208043


## Baseline, without weight decay, WITH back translation

In [None]:
with open("training_data.json","r") as f:
    training_data = json.load(f)
with open("validation_data.json","r") as f:
    validation_data = json.load(f)
with open("test_data.json","r") as f:
    test_data = json.load(f)

training_data_switched = [[english, german] for german, english in training_data]
validation_data_switched = [[english, german] for german, english in validation_data]

In [None]:
# MODIFIED train
def train_switched(model, num_epochs, batch_size, model_file):
  """Train the model and save its best checkpoint.

  Model performance across epochs is evaluated using token-level accuracy on the
  validation set. The best checkpoint obtained during training will be stored on
  disk and loaded back into the model at the end of training.
  """
  optimizer = torch.optim.Adam(model.parameters())
  best_accuracy = 0.0
  for epoch in tqdm.notebook.trange(num_epochs, desc="training", unit="epoch"):
    with tqdm.notebook.tqdm(
        make_batch_iterator(training_data_switched, batch_size, shuffle=True),
        desc="epoch {}".format(epoch + 1),
        unit="batch",
        total=math.ceil(len(training_data_switched) / batch_size)) as batch_iterator:
      model.train()
      total_loss = 0.0
      for i, (source, target) in enumerate(batch_iterator, start=1):
        source, target = source.to(device),target.to(device)
        optimizer.zero_grad()
        loss = model.compute_loss(source, target)
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
        batch_iterator.set_postfix(mean_loss=total_loss / i)
      validation_perplexity, validation_accuracy = evaluate_next_token(
          model, validation_data_switched)
      batch_iterator.set_postfix(
          mean_loss=total_loss / i,
          validation_perplexity=validation_perplexity,
          validation_token_accuracy=validation_accuracy)
      if validation_accuracy > best_accuracy:
        print(
            "Obtained a new best validation accuracy of {:.2f}, saving model "
            "checkpoint to {}...".format(validation_accuracy, model_file))
        torch.save(model.state_dict(), model_file)
        best_accuracy = validation_accuracy
  print("Reloading best model checkpoint from {}...".format(model_file))
  model.load_state_dict(torch.load(model_file))

def show_predictions_switched(model, num_examples=4, num_beam=5,include_beam=False):
  for example in validation_data_switched[:num_examples]:
    print("Input:")
    print(" ", example[0])
    print("Target:")
    print(" ", example[1])
    print("Greedy prediction:")
    print(" ", predict_greedy(model, [example[0]])[0])
    if include_beam:
      print(f"Beam predictions (showing top {num_beam}):")
      for candidate in predict_beam(model, [example[0]])[0][:num_beam]:
        print(" ", candidate)
    print()

In [None]:
# You are welcome to adjust these parameters based on your model implementation.
num_epochs = 10
batch_size = 16

#modified hidden_dim
#hidden_dim = 256
hidden_dim = 256

enc_output_size = hidden_dim * 2

#modified word_vector_dim (d_dec is set to this value and needs to match
#hidden_dim!!!
#word_vector_dim = 256
word_vector_dim = 256

num_layers = 2
dropout = 0.3

switched_attention_model = Seq2seqAttention(hidden_dim,enc_output_size, word_vector_dim,dropout,num_layers).to(device)
train_switched(switched_attention_model, num_epochs, batch_size, "attention_model.pt")

training:   0%|          | 0/10 [00:00<?, ?epoch/s]

epoch 1:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 55.25, saving model checkpoint to attention_model.pt...


epoch 2:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 61.11, saving model checkpoint to attention_model.pt...


epoch 3:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 63.01, saving model checkpoint to attention_model.pt...


epoch 4:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 64.35, saving model checkpoint to attention_model.pt...


epoch 5:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 64.92, saving model checkpoint to attention_model.pt...


epoch 6:   0%|          | 0/1813 [00:00<?, ?batch/s]

epoch 7:   0%|          | 0/1813 [00:00<?, ?batch/s]

epoch 8:   0%|          | 0/1813 [00:00<?, ?batch/s]

epoch 9:   0%|          | 0/1813 [00:00<?, ?batch/s]

epoch 10:   0%|          | 0/1813 [00:00<?, ?batch/s]

Reloading best model checkpoint from attention_model.pt...


In [None]:
num_beam = 5
include_beam = True
model = switched_attention_model

new_german_to_english_sentences = []

i = 0

#for example in training_data_switched:
for example in training_data_switched[:3000]:
  if i % 100 == 0:
    print("Iteration " +str(i))
  english = ""
  german = ""

  # if i < 10:
    # print("Input:")
    # print(" ", example[0])
    # print("Target:")
    # print(" ", example[1])
    # print("Greedy prediction:")
    # print(" ", predict_greedy(model, [example[0]])[0])
  #if include_beam:
    # print(f"Beam predictions (showing top {num_beam}):")
    # for candidate in predict_beam(model, [example[0]])[0][:num_beam]:

    #   german = candidate
    #   # if i<10:
    #   #   print(" ", candidate)

    #   break

  candidate = predict_beam(model, [example[0]])[0][0]
  #print(candidate)
  german = candidate

  english = example[0]

  # if i <10:
    # print([german, english])
    # print()
  new_german_to_english_sentences.append([german, english])

  i += 1

Iteration 0
Iteration 100
Iteration 200
Iteration 300
Iteration 400
Iteration 500
Iteration 600
Iteration 700
Iteration 800
Iteration 900
Iteration 1000
Iteration 1100
Iteration 1200
Iteration 1300
Iteration 1400
Iteration 1500
Iteration 1600
Iteration 1700
Iteration 1800
Iteration 1900
Iteration 2000
Iteration 2100
Iteration 2200
Iteration 2300
Iteration 2400
Iteration 2500
Iteration 2600
Iteration 2700
Iteration 2800
Iteration 2900


In [None]:
training_data = training_data + new_german_to_english_sentences

In [None]:
len(training_data)

32001

In [None]:
# WITHOUT weight decay
def train(model, num_epochs, batch_size, model_file):
  """Train the model and save its best checkpoint.

  Model performance across epochs is evaluated using token-level accuracy on the
  validation set. The best checkpoint obtained during training will be stored on
  disk and loaded back into the model at the end of training.
  """

  #optimizer = torch.optim.Adam(model.parameters(), weight_decay=1e-5)
  optimizer = torch.optim.Adam(model.parameters())

  best_accuracy = 0.0
  for epoch in tqdm.notebook.trange(num_epochs, desc="training", unit="epoch"):
    with tqdm.notebook.tqdm(
        make_batch_iterator(training_data, batch_size, shuffle=True),
        desc="epoch {}".format(epoch + 1),
        unit="batch",
        total=math.ceil(len(training_data) / batch_size)) as batch_iterator:
      model.train()
      total_loss = 0.0
      for i, (source, target) in enumerate(batch_iterator, start=1):
        source, target = source.to(device),target.to(device)
        optimizer.zero_grad()
        loss = model.compute_loss(source, target)
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
        batch_iterator.set_postfix(mean_loss=total_loss / i)
      validation_perplexity, validation_accuracy = evaluate_next_token(
          model, validation_data)
      batch_iterator.set_postfix(
          mean_loss=total_loss / i,
          validation_perplexity=validation_perplexity,
          validation_token_accuracy=validation_accuracy)
      if validation_accuracy > best_accuracy:
        print(
            "Obtained a new best validation accuracy of {:.2f}, saving model "
            "checkpoint to {}...".format(validation_accuracy, model_file))
        torch.save(model.state_dict(), model_file)
        best_accuracy = validation_accuracy
  print("Reloading best model checkpoint from {}...".format(model_file))
  model.load_state_dict(torch.load(model_file))

def evaluate_next_token(model, dataset, batch_size=64):
  """Compute token-level perplexity and accuracy metrics.

  Note that the perplexity here is over subwords, not words.

  This function is used for validation set evaluation at the end of each epoch
  and should not be modified.
  """
  model.eval()
  total_cross_entropy = 0.0
  total_predictions = 0
  correct_predictions = 0
  with torch.no_grad():
    for source, target in make_batch_iterator(dataset, batch_size):
      encoder_output, encoder_mask, encoder_hidden = model.encode(source)
      decoder_input, decoder_target = target[:-1], target[1:]
      logits, decoder_hidden, attention_weights = model.decode(
          decoder_input, encoder_hidden, encoder_output, encoder_mask)
      total_cross_entropy += F.cross_entropy(
          logits.permute(1, 2, 0), decoder_target.permute(1, 0),
          ignore_index=pad_id, reduction="sum").item()
      total_predictions += (decoder_target != pad_id).sum().item()
      correct_predictions += (
          (decoder_target != pad_id) &
          (decoder_target == logits.argmax(2))).sum().item()
  perplexity = math.exp(total_cross_entropy / total_predictions)
  accuracy = 100 * correct_predictions / total_predictions
  return perplexity, accuracy

In [None]:
baseline_model = Seq2seqBaseline(hidden_dim,word_vector_dim,dropout,num_layers).to(device)
train(baseline_model, num_epochs, batch_size, "baseline_model.pt")

training:   0%|          | 0/10 [00:00<?, ?epoch/s]

epoch 1:   0%|          | 0/2001 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 45.47, saving model checkpoint to baseline_model.pt...


epoch 2:   0%|          | 0/2001 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 50.33, saving model checkpoint to baseline_model.pt...


epoch 3:   0%|          | 0/2001 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 53.05, saving model checkpoint to baseline_model.pt...


epoch 4:   0%|          | 0/2001 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 54.70, saving model checkpoint to baseline_model.pt...


epoch 5:   0%|          | 0/2001 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 55.70, saving model checkpoint to baseline_model.pt...


epoch 6:   0%|          | 0/2001 [00:00<?, ?batch/s]

epoch 7:   0%|          | 0/2001 [00:00<?, ?batch/s]

epoch 8:   0%|          | 0/2001 [00:00<?, ?batch/s]

epoch 9:   0%|          | 0/2001 [00:00<?, ?batch/s]

epoch 10:   0%|          | 0/2001 [00:00<?, ?batch/s]

Reloading best model checkpoint from baseline_model.pt...


In [None]:
print("Baseline model validation BLEU using beam search:",
      evaluate(baseline_model, validation_data, batch_size=1, method="beam"))

Baseline model validation BLEU using beam search: 21.932380659234063


## Baseline, WITH weight decay, WITH back translation


In [None]:
with open("training_data.json","r") as f:
    training_data = json.load(f)
with open("validation_data.json","r") as f:
    validation_data = json.load(f)
with open("test_data.json","r") as f:
    test_data = json.load(f)

training_data_switched = [[english, german] for german, english in training_data]
validation_data_switched = [[english, german] for german, english in validation_data]

In [None]:
# MODIFIED train
def train_switched(model, num_epochs, batch_size, model_file):
  """Train the model and save its best checkpoint.

  Model performance across epochs is evaluated using token-level accuracy on the
  validation set. The best checkpoint obtained during training will be stored on
  disk and loaded back into the model at the end of training.
  """
  optimizer = torch.optim.Adam(model.parameters())
  best_accuracy = 0.0
  for epoch in tqdm.notebook.trange(num_epochs, desc="training", unit="epoch"):
    with tqdm.notebook.tqdm(
        make_batch_iterator(training_data_switched, batch_size, shuffle=True),
        desc="epoch {}".format(epoch + 1),
        unit="batch",
        total=math.ceil(len(training_data_switched) / batch_size)) as batch_iterator:
      model.train()
      total_loss = 0.0
      for i, (source, target) in enumerate(batch_iterator, start=1):
        source, target = source.to(device),target.to(device)
        optimizer.zero_grad()
        loss = model.compute_loss(source, target)
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
        batch_iterator.set_postfix(mean_loss=total_loss / i)
      validation_perplexity, validation_accuracy = evaluate_next_token(
          model, validation_data_switched)
      batch_iterator.set_postfix(
          mean_loss=total_loss / i,
          validation_perplexity=validation_perplexity,
          validation_token_accuracy=validation_accuracy)
      if validation_accuracy > best_accuracy:
        print(
            "Obtained a new best validation accuracy of {:.2f}, saving model "
            "checkpoint to {}...".format(validation_accuracy, model_file))
        torch.save(model.state_dict(), model_file)
        best_accuracy = validation_accuracy
  print("Reloading best model checkpoint from {}...".format(model_file))
  model.load_state_dict(torch.load(model_file))

def show_predictions_switched(model, num_examples=4, num_beam=5,include_beam=False):
  for example in validation_data_switched[:num_examples]:
    print("Input:")
    print(" ", example[0])
    print("Target:")
    print(" ", example[1])
    print("Greedy prediction:")
    print(" ", predict_greedy(model, [example[0]])[0])
    if include_beam:
      print(f"Beam predictions (showing top {num_beam}):")
      for candidate in predict_beam(model, [example[0]])[0][:num_beam]:
        print(" ", candidate)
    print()

In [None]:
# You are welcome to adjust these parameters based on your model implementation.
num_epochs = 10
batch_size = 16

#modified hidden_dim
#hidden_dim = 256
hidden_dim = 256

enc_output_size = hidden_dim * 2

#modified word_vector_dim (d_dec is set to this value and needs to match
#hidden_dim!!!
#word_vector_dim = 256
word_vector_dim = 256

num_layers = 2
dropout = 0.3

switched_attention_model = Seq2seqAttention(hidden_dim,enc_output_size, word_vector_dim,dropout,num_layers).to(device)
train_switched(switched_attention_model, num_epochs, batch_size, "attention_model.pt")

training:   0%|          | 0/10 [00:00<?, ?epoch/s]

epoch 1:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 55.50, saving model checkpoint to attention_model.pt...


epoch 2:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 60.89, saving model checkpoint to attention_model.pt...


epoch 3:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 63.08, saving model checkpoint to attention_model.pt...


epoch 4:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 64.19, saving model checkpoint to attention_model.pt...


epoch 5:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 64.47, saving model checkpoint to attention_model.pt...


epoch 6:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 64.83, saving model checkpoint to attention_model.pt...


epoch 7:   0%|          | 0/1813 [00:00<?, ?batch/s]

epoch 8:   0%|          | 0/1813 [00:00<?, ?batch/s]

epoch 9:   0%|          | 0/1813 [00:00<?, ?batch/s]

epoch 10:   0%|          | 0/1813 [00:00<?, ?batch/s]

Reloading best model checkpoint from attention_model.pt...


In [None]:
num_beam = 5
include_beam = True
model = switched_attention_model

new_german_to_english_sentences = []

i = 0

#for example in training_data_switched:
for example in training_data_switched[:3000]:
  if i % 100 == 0:
    print("Iteration " +str(i))
  english = ""
  german = ""

  # if i < 10:
    # print("Input:")
    # print(" ", example[0])
    # print("Target:")
    # print(" ", example[1])
    # print("Greedy prediction:")
    # print(" ", predict_greedy(model, [example[0]])[0])
  #if include_beam:
    # print(f"Beam predictions (showing top {num_beam}):")
    # for candidate in predict_beam(model, [example[0]])[0][:num_beam]:

    #   german = candidate
    #   # if i<10:
    #   #   print(" ", candidate)

    #   break

  candidate = predict_beam(model, [example[0]])[0][0]
  #print(candidate)
  german = candidate

  english = example[0]

  # if i <10:
    # print([german, english])
    # print()
  new_german_to_english_sentences.append([german, english])

  i += 1

Iteration 0
Iteration 100
Iteration 200
Iteration 300
Iteration 400
Iteration 500
Iteration 600
Iteration 700
Iteration 800
Iteration 900
Iteration 1000
Iteration 1100
Iteration 1200
Iteration 1300
Iteration 1400
Iteration 1500
Iteration 1600
Iteration 1700
Iteration 1800
Iteration 1900
Iteration 2000
Iteration 2100
Iteration 2200
Iteration 2300
Iteration 2400
Iteration 2500
Iteration 2600
Iteration 2700
Iteration 2800
Iteration 2900


In [None]:
training_data = training_data + new_german_to_english_sentences

In [None]:
len(training_data)

32001

In [None]:
# WITH weight decay
def train(model, num_epochs, batch_size, model_file):
  """Train the model and save its best checkpoint.

  Model performance across epochs is evaluated using token-level accuracy on the
  validation set. The best checkpoint obtained during training will be stored on
  disk and loaded back into the model at the end of training.
  """

  optimizer = torch.optim.Adam(model.parameters(), weight_decay=1e-5)
  #optimizer = torch.optim.Adam(model.parameters())

  best_accuracy = 0.0
  for epoch in tqdm.notebook.trange(num_epochs, desc="training", unit="epoch"):
    with tqdm.notebook.tqdm(
        make_batch_iterator(training_data, batch_size, shuffle=True),
        desc="epoch {}".format(epoch + 1),
        unit="batch",
        total=math.ceil(len(training_data) / batch_size)) as batch_iterator:
      model.train()
      total_loss = 0.0
      for i, (source, target) in enumerate(batch_iterator, start=1):
        source, target = source.to(device),target.to(device)
        optimizer.zero_grad()
        loss = model.compute_loss(source, target)
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
        batch_iterator.set_postfix(mean_loss=total_loss / i)
      validation_perplexity, validation_accuracy = evaluate_next_token(
          model, validation_data)
      batch_iterator.set_postfix(
          mean_loss=total_loss / i,
          validation_perplexity=validation_perplexity,
          validation_token_accuracy=validation_accuracy)
      if validation_accuracy > best_accuracy:
        print(
            "Obtained a new best validation accuracy of {:.2f}, saving model "
            "checkpoint to {}...".format(validation_accuracy, model_file))
        torch.save(model.state_dict(), model_file)
        best_accuracy = validation_accuracy
  print("Reloading best model checkpoint from {}...".format(model_file))
  model.load_state_dict(torch.load(model_file))

def evaluate_next_token(model, dataset, batch_size=64):
  """Compute token-level perplexity and accuracy metrics.

  Note that the perplexity here is over subwords, not words.

  This function is used for validation set evaluation at the end of each epoch
  and should not be modified.
  """
  model.eval()
  total_cross_entropy = 0.0
  total_predictions = 0
  correct_predictions = 0
  with torch.no_grad():
    for source, target in make_batch_iterator(dataset, batch_size):
      encoder_output, encoder_mask, encoder_hidden = model.encode(source)
      decoder_input, decoder_target = target[:-1], target[1:]
      logits, decoder_hidden, attention_weights = model.decode(
          decoder_input, encoder_hidden, encoder_output, encoder_mask)
      total_cross_entropy += F.cross_entropy(
          logits.permute(1, 2, 0), decoder_target.permute(1, 0),
          ignore_index=pad_id, reduction="sum").item()
      total_predictions += (decoder_target != pad_id).sum().item()
      correct_predictions += (
          (decoder_target != pad_id) &
          (decoder_target == logits.argmax(2))).sum().item()
  perplexity = math.exp(total_cross_entropy / total_predictions)
  accuracy = 100 * correct_predictions / total_predictions
  return perplexity, accuracy

In [None]:
baseline_model = Seq2seqBaseline(hidden_dim,word_vector_dim,dropout,num_layers).to(device)
train(baseline_model, num_epochs, batch_size, "baseline_model.pt")

training:   0%|          | 0/10 [00:00<?, ?epoch/s]

epoch 1:   0%|          | 0/2001 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 44.98, saving model checkpoint to baseline_model.pt...


epoch 2:   0%|          | 0/2001 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 49.85, saving model checkpoint to baseline_model.pt...


epoch 3:   0%|          | 0/2001 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 52.66, saving model checkpoint to baseline_model.pt...


epoch 4:   0%|          | 0/2001 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 54.40, saving model checkpoint to baseline_model.pt...


epoch 5:   0%|          | 0/2001 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 55.89, saving model checkpoint to baseline_model.pt...


epoch 6:   0%|          | 0/2001 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 56.51, saving model checkpoint to baseline_model.pt...


epoch 7:   0%|          | 0/2001 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 57.14, saving model checkpoint to baseline_model.pt...


epoch 8:   0%|          | 0/2001 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 57.65, saving model checkpoint to baseline_model.pt...


epoch 9:   0%|          | 0/2001 [00:00<?, ?batch/s]

epoch 10:   0%|          | 0/2001 [00:00<?, ?batch/s]

Reloading best model checkpoint from baseline_model.pt...


In [None]:
print("Baseline model validation BLEU using beam search:",
      evaluate(baseline_model, validation_data, batch_size=1, method="beam"))

Baseline model validation BLEU using beam search: 24.686687242731608


## Attention, without weight decay, without back translation


In [None]:
# proper training_data and validation_data without back translation
with open("training_data.json","r") as f:
    training_data = json.load(f)
with open("validation_data.json","r") as f:
    validation_data = json.load(f)
with open("test_data.json","r") as f:
    test_data = json.load(f)

In [None]:
# WITHOUT weight decay
def train(model, num_epochs, batch_size, model_file):
  """Train the model and save its best checkpoint.

  Model performance across epochs is evaluated using token-level accuracy on the
  validation set. The best checkpoint obtained during training will be stored on
  disk and loaded back into the model at the end of training.
  """

  #optimizer = torch.optim.Adam(model.parameters(), weight_decay=1e-5)
  optimizer = torch.optim.Adam(model.parameters())

  best_accuracy = 0.0
  for epoch in tqdm.notebook.trange(num_epochs, desc="training", unit="epoch"):
    with tqdm.notebook.tqdm(
        make_batch_iterator(training_data, batch_size, shuffle=True),
        desc="epoch {}".format(epoch + 1),
        unit="batch",
        total=math.ceil(len(training_data) / batch_size)) as batch_iterator:
      model.train()
      total_loss = 0.0
      for i, (source, target) in enumerate(batch_iterator, start=1):
        source, target = source.to(device),target.to(device)
        optimizer.zero_grad()
        loss = model.compute_loss(source, target)
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
        batch_iterator.set_postfix(mean_loss=total_loss / i)
      validation_perplexity, validation_accuracy = evaluate_next_token(
          model, validation_data)
      batch_iterator.set_postfix(
          mean_loss=total_loss / i,
          validation_perplexity=validation_perplexity,
          validation_token_accuracy=validation_accuracy)
      if validation_accuracy > best_accuracy:
        print(
            "Obtained a new best validation accuracy of {:.2f}, saving model "
            "checkpoint to {}...".format(validation_accuracy, model_file))
        torch.save(model.state_dict(), model_file)
        best_accuracy = validation_accuracy
  print("Reloading best model checkpoint from {}...".format(model_file))
  model.load_state_dict(torch.load(model_file))

def evaluate_next_token(model, dataset, batch_size=64):
  """Compute token-level perplexity and accuracy metrics.

  Note that the perplexity here is over subwords, not words.

  This function is used for validation set evaluation at the end of each epoch
  and should not be modified.
  """
  model.eval()
  total_cross_entropy = 0.0
  total_predictions = 0
  correct_predictions = 0
  with torch.no_grad():
    for source, target in make_batch_iterator(dataset, batch_size):
      encoder_output, encoder_mask, encoder_hidden = model.encode(source)
      decoder_input, decoder_target = target[:-1], target[1:]
      logits, decoder_hidden, attention_weights = model.decode(
          decoder_input, encoder_hidden, encoder_output, encoder_mask)
      total_cross_entropy += F.cross_entropy(
          logits.permute(1, 2, 0), decoder_target.permute(1, 0),
          ignore_index=pad_id, reduction="sum").item()
      total_predictions += (decoder_target != pad_id).sum().item()
      correct_predictions += (
          (decoder_target != pad_id) &
          (decoder_target == logits.argmax(2))).sum().item()
  perplexity = math.exp(total_cross_entropy / total_predictions)
  accuracy = 100 * correct_predictions / total_predictions
  return perplexity, accuracy

In [None]:
# You are welcome to adjust these parameters based on your model implementation.
num_epochs = 10
batch_size = 16
hidden_dim = 256
enc_output_size = hidden_dim * 2
word_vector_dim = 256
num_layers = 2
dropout = 0.3

attention_model = Seq2seqAttention(hidden_dim,enc_output_size, word_vector_dim,dropout,num_layers).to(device)
train(attention_model, num_epochs, batch_size, "attention_model.pt")

training:   0%|          | 0/10 [00:00<?, ?epoch/s]

epoch 1:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 59.18, saving model checkpoint to attention_model.pt...


epoch 2:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 64.25, saving model checkpoint to attention_model.pt...


epoch 3:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 65.70, saving model checkpoint to attention_model.pt...


epoch 4:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 66.86, saving model checkpoint to attention_model.pt...


epoch 5:   0%|          | 0/1813 [00:00<?, ?batch/s]

epoch 6:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 66.89, saving model checkpoint to attention_model.pt...


epoch 7:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 66.92, saving model checkpoint to attention_model.pt...


epoch 8:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 67.17, saving model checkpoint to attention_model.pt...


epoch 9:   0%|          | 0/1813 [00:00<?, ?batch/s]

epoch 10:   0%|          | 0/1813 [00:00<?, ?batch/s]

Reloading best model checkpoint from attention_model.pt...


In [None]:
print("Attention model validation BLEU using beam search:",
      evaluate(attention_model, validation_data, batch_size=1, method="beam"))
print()
print("Attention model sample predictions:")
print()
show_predictions(attention_model, include_beam=True)

Attention model validation BLEU using beam search: 38.89841600023651

Attention model sample predictions:

Input:
  Eine Gruppe von Männern lädt Baumwolle auf einen Lastwagen
Target:
  A group of men are loading cotton onto a truck
Greedy prediction:
  A group of men are farming on a truck
Beam predictions (showing top 5):
  A group of men patche from a truck
  A group of men are farming on a truck
  A group of men patch a truck
  A group of men patche of tree.
  A group of men are farming on a truck.

Input:
  Ein Mann schläft in einem grünen Raum auf einem Sofa.
Target:
  A man sleeping in a green room on a couch.
Greedy prediction:
  A man sleeping on a couch in a green room.
Beam predictions (showing top 5):
  A man sleeping on a couch in a green room.
  A man sleeping on a couch on a green room.
  A man sleeping on a couch inside a green room.
  A man sleeping on a sofa in a green room.
  A man sleeps on a couch in a green room.

Input:
  Ein Junge mit Kopfhörern sitzt auf den Sch

## Attention, WITH weight decay, without back translation


In [None]:
# proper training_data and validation_data without back translation
with open("training_data.json","r") as f:
    training_data = json.load(f)
with open("validation_data.json","r") as f:
    validation_data = json.load(f)
with open("test_data.json","r") as f:
    test_data = json.load(f)

In [None]:
# WITH weight decay
def train(model, num_epochs, batch_size, model_file):
  """Train the model and save its best checkpoint.

  Model performance across epochs is evaluated using token-level accuracy on the
  validation set. The best checkpoint obtained during training will be stored on
  disk and loaded back into the model at the end of training.
  """

  optimizer = torch.optim.Adam(model.parameters(), weight_decay=1e-5)
  #optimizer = torch.optim.Adam(model.parameters())

  best_accuracy = 0.0
  for epoch in tqdm.notebook.trange(num_epochs, desc="training", unit="epoch"):
    with tqdm.notebook.tqdm(
        make_batch_iterator(training_data, batch_size, shuffle=True),
        desc="epoch {}".format(epoch + 1),
        unit="batch",
        total=math.ceil(len(training_data) / batch_size)) as batch_iterator:
      model.train()
      total_loss = 0.0
      for i, (source, target) in enumerate(batch_iterator, start=1):
        source, target = source.to(device),target.to(device)
        optimizer.zero_grad()
        loss = model.compute_loss(source, target)
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
        batch_iterator.set_postfix(mean_loss=total_loss / i)
      validation_perplexity, validation_accuracy = evaluate_next_token(
          model, validation_data)
      batch_iterator.set_postfix(
          mean_loss=total_loss / i,
          validation_perplexity=validation_perplexity,
          validation_token_accuracy=validation_accuracy)
      if validation_accuracy > best_accuracy:
        print(
            "Obtained a new best validation accuracy of {:.2f}, saving model "
            "checkpoint to {}...".format(validation_accuracy, model_file))
        torch.save(model.state_dict(), model_file)
        best_accuracy = validation_accuracy
  print("Reloading best model checkpoint from {}...".format(model_file))
  model.load_state_dict(torch.load(model_file))

def evaluate_next_token(model, dataset, batch_size=64):
  """Compute token-level perplexity and accuracy metrics.

  Note that the perplexity here is over subwords, not words.

  This function is used for validation set evaluation at the end of each epoch
  and should not be modified.
  """
  model.eval()
  total_cross_entropy = 0.0
  total_predictions = 0
  correct_predictions = 0
  with torch.no_grad():
    for source, target in make_batch_iterator(dataset, batch_size):
      encoder_output, encoder_mask, encoder_hidden = model.encode(source)
      decoder_input, decoder_target = target[:-1], target[1:]
      logits, decoder_hidden, attention_weights = model.decode(
          decoder_input, encoder_hidden, encoder_output, encoder_mask)
      total_cross_entropy += F.cross_entropy(
          logits.permute(1, 2, 0), decoder_target.permute(1, 0),
          ignore_index=pad_id, reduction="sum").item()
      total_predictions += (decoder_target != pad_id).sum().item()
      correct_predictions += (
          (decoder_target != pad_id) &
          (decoder_target == logits.argmax(2))).sum().item()
  perplexity = math.exp(total_cross_entropy / total_predictions)
  accuracy = 100 * correct_predictions / total_predictions
  return perplexity, accuracy

In [None]:
# You are welcome to adjust these parameters based on your model implementation.
num_epochs = 10
batch_size = 16
hidden_dim = 256
enc_output_size = hidden_dim * 2
word_vector_dim = 256
num_layers = 2
dropout = 0.3

attention_model = Seq2seqAttention(hidden_dim,enc_output_size, word_vector_dim,dropout,num_layers).to(device)
train(attention_model, num_epochs, batch_size, "attention_model.pt")

training:   0%|          | 0/10 [00:00<?, ?epoch/s]

epoch 1:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 58.72, saving model checkpoint to attention_model.pt...


epoch 2:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 63.74, saving model checkpoint to attention_model.pt...


epoch 3:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 65.83, saving model checkpoint to attention_model.pt...


epoch 4:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 66.91, saving model checkpoint to attention_model.pt...


epoch 5:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 67.67, saving model checkpoint to attention_model.pt...


epoch 6:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 68.15, saving model checkpoint to attention_model.pt...


epoch 7:   0%|          | 0/1813 [00:00<?, ?batch/s]

epoch 8:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 68.30, saving model checkpoint to attention_model.pt...


epoch 9:   0%|          | 0/1813 [00:00<?, ?batch/s]

epoch 10:   0%|          | 0/1813 [00:00<?, ?batch/s]

Reloading best model checkpoint from attention_model.pt...


In [None]:
print("Attention model validation BLEU using beam search:",
      evaluate(attention_model, validation_data, batch_size=1, method="beam"))
print()
print("Attention model sample predictions:")
print()
show_predictions(attention_model, include_beam=True)

Attention model validation BLEU using beam search: 39.11135178426874

Attention model sample predictions:

Input:
  Eine Gruppe von Männern lädt Baumwolle auf einen Lastwagen
Target:
  A group of men are loading cotton onto a truck
Greedy prediction:
  A group of men are protesting-ing down a truck.
Beam predictions (showing top 5):
  A group of men hang out of a truck.
  A group of men unloading crops from a truck.
  A group of men are unloading treells.
  A group of men hang out of a truck
  A group of men unloading crops from a truck

Input:
  Ein Mann schläft in einem grünen Raum auf einem Sofa.
Target:
  A man sleeping in a green room on a couch.
Greedy prediction:
  A man is sleeping on a couch in a green room.
Beam predictions (showing top 5):
  A man sleeps on a couch in a green room.
  A man sleeps in a green room on a couch.
  Man sleeping on couch in green room.
  A man is sleeping on a couch in a green room
  A man sleeps in a green room on a couch

Input:
  Ein Junge mit K

## Attention, without weight decay, WITH back translation

In [None]:
with open("training_data.json","r") as f:
    training_data = json.load(f)
with open("validation_data.json","r") as f:
    validation_data = json.load(f)
with open("test_data.json","r") as f:
    test_data = json.load(f)

training_data_switched = [[english, german] for german, english in training_data]
validation_data_switched = [[english, german] for german, english in validation_data]

In [None]:
# MODIFIED train
def train_switched(model, num_epochs, batch_size, model_file):
  """Train the model and save its best checkpoint.

  Model performance across epochs is evaluated using token-level accuracy on the
  validation set. The best checkpoint obtained during training will be stored on
  disk and loaded back into the model at the end of training.
  """
  optimizer = torch.optim.Adam(model.parameters())
  best_accuracy = 0.0
  for epoch in tqdm.notebook.trange(num_epochs, desc="training", unit="epoch"):
    with tqdm.notebook.tqdm(
        make_batch_iterator(training_data_switched, batch_size, shuffle=True),
        desc="epoch {}".format(epoch + 1),
        unit="batch",
        total=math.ceil(len(training_data_switched) / batch_size)) as batch_iterator:
      model.train()
      total_loss = 0.0
      for i, (source, target) in enumerate(batch_iterator, start=1):
        source, target = source.to(device),target.to(device)
        optimizer.zero_grad()
        loss = model.compute_loss(source, target)
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
        batch_iterator.set_postfix(mean_loss=total_loss / i)
      validation_perplexity, validation_accuracy = evaluate_next_token(
          model, validation_data_switched)
      batch_iterator.set_postfix(
          mean_loss=total_loss / i,
          validation_perplexity=validation_perplexity,
          validation_token_accuracy=validation_accuracy)
      if validation_accuracy > best_accuracy:
        print(
            "Obtained a new best validation accuracy of {:.2f}, saving model "
            "checkpoint to {}...".format(validation_accuracy, model_file))
        torch.save(model.state_dict(), model_file)
        best_accuracy = validation_accuracy
  print("Reloading best model checkpoint from {}...".format(model_file))
  model.load_state_dict(torch.load(model_file))

def show_predictions_switched(model, num_examples=4, num_beam=5,include_beam=False):
  for example in validation_data_switched[:num_examples]:
    print("Input:")
    print(" ", example[0])
    print("Target:")
    print(" ", example[1])
    print("Greedy prediction:")
    print(" ", predict_greedy(model, [example[0]])[0])
    if include_beam:
      print(f"Beam predictions (showing top {num_beam}):")
      for candidate in predict_beam(model, [example[0]])[0][:num_beam]:
        print(" ", candidate)
    print()

In [None]:
# You are welcome to adjust these parameters based on your model implementation.
num_epochs = 10
batch_size = 16

#modified hidden_dim
#hidden_dim = 256
hidden_dim = 256

enc_output_size = hidden_dim * 2

#modified word_vector_dim (d_dec is set to this value and needs to match
#hidden_dim!!!
#word_vector_dim = 256
word_vector_dim = 256

num_layers = 2
dropout = 0.3

switched_attention_model = Seq2seqAttention(hidden_dim,enc_output_size, word_vector_dim,dropout,num_layers).to(device)
train_switched(switched_attention_model, num_epochs, batch_size, "attention_model.pt")

training:   0%|          | 0/10 [00:00<?, ?epoch/s]

epoch 1:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 56.26, saving model checkpoint to attention_model.pt...


epoch 2:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 61.52, saving model checkpoint to attention_model.pt...


epoch 3:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 63.06, saving model checkpoint to attention_model.pt...


epoch 4:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 63.90, saving model checkpoint to attention_model.pt...


epoch 5:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 64.65, saving model checkpoint to attention_model.pt...


epoch 6:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 64.97, saving model checkpoint to attention_model.pt...


epoch 7:   0%|          | 0/1813 [00:00<?, ?batch/s]

epoch 8:   0%|          | 0/1813 [00:00<?, ?batch/s]

epoch 9:   0%|          | 0/1813 [00:00<?, ?batch/s]

epoch 10:   0%|          | 0/1813 [00:00<?, ?batch/s]

Reloading best model checkpoint from attention_model.pt...


In [None]:
num_beam = 5
include_beam = True
model = switched_attention_model

new_german_to_english_sentences = []

i = 0

#for example in training_data_switched:
for example in training_data_switched[:6000]:
  if i % 100 == 0:
    print("Iteration " +str(i))
  english = ""
  german = ""

  # if i < 10:
    # print("Input:")
    # print(" ", example[0])
    # print("Target:")
    # print(" ", example[1])
    # print("Greedy prediction:")
    # print(" ", predict_greedy(model, [example[0]])[0])
  #if include_beam:
    # print(f"Beam predictions (showing top {num_beam}):")
    # for candidate in predict_beam(model, [example[0]])[0][:num_beam]:

    #   german = candidate
    #   # if i<10:
    #   #   print(" ", candidate)

    #   break

  candidate = predict_beam(model, [example[0]])[0][0]
  #print(candidate)
  german = candidate

  english = example[0]

  # if i <10:
    # print([german, english])
    # print()
  new_german_to_english_sentences.append([german, english])

  i += 1

Iteration 0
Iteration 100
Iteration 200
Iteration 300
Iteration 400
Iteration 500
Iteration 600
Iteration 700
Iteration 800
Iteration 900
Iteration 1000
Iteration 1100
Iteration 1200
Iteration 1300
Iteration 1400
Iteration 1500
Iteration 1600
Iteration 1700
Iteration 1800
Iteration 1900
Iteration 2000
Iteration 2100
Iteration 2200
Iteration 2300
Iteration 2400
Iteration 2500
Iteration 2600
Iteration 2700
Iteration 2800
Iteration 2900
Iteration 3000
Iteration 3100
Iteration 3200
Iteration 3300
Iteration 3400
Iteration 3500
Iteration 3600
Iteration 3700
Iteration 3800
Iteration 3900
Iteration 4000
Iteration 4100
Iteration 4200
Iteration 4300
Iteration 4400
Iteration 4500
Iteration 4600
Iteration 4700
Iteration 4800
Iteration 4900
Iteration 5000
Iteration 5100
Iteration 5200
Iteration 5300
Iteration 5400
Iteration 5500
Iteration 5600
Iteration 5700
Iteration 5800
Iteration 5900


In [None]:
training_data = training_data + new_german_to_english_sentences

In [None]:
len(training_data)

35001

In [None]:
# WITHOUT weight decay
def train(model, num_epochs, batch_size, model_file):
  """Train the model and save its best checkpoint.

  Model performance across epochs is evaluated using token-level accuracy on the
  validation set. The best checkpoint obtained during training will be stored on
  disk and loaded back into the model at the end of training.
  """

  #optimizer = torch.optim.Adam(model.parameters(), weight_decay=1e-5)
  optimizer = torch.optim.Adam(model.parameters())

  best_accuracy = 0.0
  for epoch in tqdm.notebook.trange(num_epochs, desc="training", unit="epoch"):
    with tqdm.notebook.tqdm(
        make_batch_iterator(training_data, batch_size, shuffle=True),
        desc="epoch {}".format(epoch + 1),
        unit="batch",
        total=math.ceil(len(training_data) / batch_size)) as batch_iterator:
      model.train()
      total_loss = 0.0
      for i, (source, target) in enumerate(batch_iterator, start=1):
        source, target = source.to(device),target.to(device)
        optimizer.zero_grad()
        loss = model.compute_loss(source, target)
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
        batch_iterator.set_postfix(mean_loss=total_loss / i)
      validation_perplexity, validation_accuracy = evaluate_next_token(
          model, validation_data)
      batch_iterator.set_postfix(
          mean_loss=total_loss / i,
          validation_perplexity=validation_perplexity,
          validation_token_accuracy=validation_accuracy)
      if validation_accuracy > best_accuracy:
        print(
            "Obtained a new best validation accuracy of {:.2f}, saving model "
            "checkpoint to {}...".format(validation_accuracy, model_file))
        torch.save(model.state_dict(), model_file)
        best_accuracy = validation_accuracy
  print("Reloading best model checkpoint from {}...".format(model_file))
  model.load_state_dict(torch.load(model_file))

def evaluate_next_token(model, dataset, batch_size=64):
  """Compute token-level perplexity and accuracy metrics.

  Note that the perplexity here is over subwords, not words.

  This function is used for validation set evaluation at the end of each epoch
  and should not be modified.
  """
  model.eval()
  total_cross_entropy = 0.0
  total_predictions = 0
  correct_predictions = 0
  with torch.no_grad():
    for source, target in make_batch_iterator(dataset, batch_size):
      encoder_output, encoder_mask, encoder_hidden = model.encode(source)
      decoder_input, decoder_target = target[:-1], target[1:]
      logits, decoder_hidden, attention_weights = model.decode(
          decoder_input, encoder_hidden, encoder_output, encoder_mask)
      total_cross_entropy += F.cross_entropy(
          logits.permute(1, 2, 0), decoder_target.permute(1, 0),
          ignore_index=pad_id, reduction="sum").item()
      total_predictions += (decoder_target != pad_id).sum().item()
      correct_predictions += (
          (decoder_target != pad_id) &
          (decoder_target == logits.argmax(2))).sum().item()
  perplexity = math.exp(total_cross_entropy / total_predictions)
  accuracy = 100 * correct_predictions / total_predictions
  return perplexity, accuracy

In [None]:
# You are welcome to adjust these parameters based on your model implementation.
num_epochs = 10
batch_size = 16
hidden_dim = 256
enc_output_size = hidden_dim * 2
word_vector_dim = 256
num_layers = 2
dropout = 0.3

attention_model = Seq2seqAttention(hidden_dim,enc_output_size, word_vector_dim,dropout,num_layers).to(device)
train(attention_model, num_epochs, batch_size, "attention_model.pt")

training:   0%|          | 0/10 [00:00<?, ?epoch/s]

epoch 1:   0%|          | 0/2188 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 60.57, saving model checkpoint to attention_model.pt...


epoch 2:   0%|          | 0/2188 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 64.91, saving model checkpoint to attention_model.pt...


epoch 3:   0%|          | 0/2188 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 66.11, saving model checkpoint to attention_model.pt...


epoch 4:   0%|          | 0/2188 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 66.99, saving model checkpoint to attention_model.pt...


epoch 5:   0%|          | 0/2188 [00:00<?, ?batch/s]

epoch 6:   0%|          | 0/2188 [00:00<?, ?batch/s]

epoch 7:   0%|          | 0/2188 [00:00<?, ?batch/s]

epoch 8:   0%|          | 0/2188 [00:00<?, ?batch/s]

epoch 9:   0%|          | 0/2188 [00:00<?, ?batch/s]

epoch 10:   0%|          | 0/2188 [00:00<?, ?batch/s]

Reloading best model checkpoint from attention_model.pt...


In [None]:
print("Attention model validation BLEU using beam search:",
      evaluate(attention_model, validation_data, batch_size=1, method="beam"))
print()
print("Attention model sample predictions:")
print()
show_predictions(attention_model, include_beam=True)

Attention model validation BLEU using beam search: 38.307577579855554

Attention model sample predictions:

Input:
  Eine Gruppe von Männern lädt Baumwolle auf einen Lastwagen
Target:
  A group of men are loading cotton onto a truck
Greedy prediction:
  A group of men are sweeping a truck to a truck
Beam predictions (showing top 5):
  A group of men are picking snow cones on a truck
  A group of men are picking fallen branches on a truck
  A group of men are picking tree branches on a truck
  A group of men are picking fallen branches on a truck.
  A group of men are sweeping a truck to a truck

Input:
  Ein Mann schläft in einem grünen Raum auf einem Sofa.
Target:
  A man sleeping in a green room on a couch.
Greedy prediction:
  A man sleeping on a couch in a green room.
Beam predictions (showing top 5):
  A man sleeping on a couch in a green room.
  A man sleeping in a green room on a couch.
  A man asleep in a green room on a couch.
  A man is sleeping on a couch in a green room.
  

## Attention, WITH weight decay, WITH back translation

In [None]:
with open("training_data.json","r") as f:
    training_data = json.load(f)
with open("validation_data.json","r") as f:
    validation_data = json.load(f)
with open("test_data.json","r") as f:
    test_data = json.load(f)

training_data_switched = [[english, german] for german, english in training_data]
validation_data_switched = [[english, german] for german, english in validation_data]

In [None]:
# MODIFIED train
def train_switched(model, num_epochs, batch_size, model_file):
  """Train the model and save its best checkpoint.

  Model performance across epochs is evaluated using token-level accuracy on the
  validation set. The best checkpoint obtained during training will be stored on
  disk and loaded back into the model at the end of training.
  """
  optimizer = torch.optim.Adam(model.parameters())
  best_accuracy = 0.0
  for epoch in tqdm.notebook.trange(num_epochs, desc="training", unit="epoch"):
    with tqdm.notebook.tqdm(
        make_batch_iterator(training_data_switched, batch_size, shuffle=True),
        desc="epoch {}".format(epoch + 1),
        unit="batch",
        total=math.ceil(len(training_data_switched) / batch_size)) as batch_iterator:
      model.train()
      total_loss = 0.0
      for i, (source, target) in enumerate(batch_iterator, start=1):
        source, target = source.to(device),target.to(device)
        optimizer.zero_grad()
        loss = model.compute_loss(source, target)
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
        batch_iterator.set_postfix(mean_loss=total_loss / i)
      validation_perplexity, validation_accuracy = evaluate_next_token(
          model, validation_data_switched)
      batch_iterator.set_postfix(
          mean_loss=total_loss / i,
          validation_perplexity=validation_perplexity,
          validation_token_accuracy=validation_accuracy)
      if validation_accuracy > best_accuracy:
        print(
            "Obtained a new best validation accuracy of {:.2f}, saving model "
            "checkpoint to {}...".format(validation_accuracy, model_file))
        torch.save(model.state_dict(), model_file)
        best_accuracy = validation_accuracy
  print("Reloading best model checkpoint from {}...".format(model_file))
  model.load_state_dict(torch.load(model_file))

def show_predictions_switched(model, num_examples=4, num_beam=5,include_beam=False):
  for example in validation_data_switched[:num_examples]:
    print("Input:")
    print(" ", example[0])
    print("Target:")
    print(" ", example[1])
    print("Greedy prediction:")
    print(" ", predict_greedy(model, [example[0]])[0])
    if include_beam:
      print(f"Beam predictions (showing top {num_beam}):")
      for candidate in predict_beam(model, [example[0]])[0][:num_beam]:
        print(" ", candidate)
    print()

In [None]:
# You are welcome to adjust these parameters based on your model implementation.
num_epochs = 10
batch_size = 16

#modified hidden_dim
#hidden_dim = 256
hidden_dim = 256

enc_output_size = hidden_dim * 2

#modified word_vector_dim (d_dec is set to this value and needs to match
#hidden_dim!!!
#word_vector_dim = 256
word_vector_dim = 256

num_layers = 2
dropout = 0.3

switched_attention_model = Seq2seqAttention(hidden_dim,enc_output_size, word_vector_dim,dropout,num_layers).to(device)
train_switched(switched_attention_model, num_epochs, batch_size, "attention_model.pt")

training:   0%|          | 0/10 [00:00<?, ?epoch/s]

epoch 1:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 55.94, saving model checkpoint to attention_model.pt...


epoch 2:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 60.53, saving model checkpoint to attention_model.pt...


epoch 3:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 62.84, saving model checkpoint to attention_model.pt...


epoch 4:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 63.64, saving model checkpoint to attention_model.pt...


epoch 5:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 64.53, saving model checkpoint to attention_model.pt...


epoch 6:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 64.70, saving model checkpoint to attention_model.pt...


epoch 7:   0%|          | 0/1813 [00:00<?, ?batch/s]

epoch 8:   0%|          | 0/1813 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 64.77, saving model checkpoint to attention_model.pt...


epoch 9:   0%|          | 0/1813 [00:00<?, ?batch/s]

epoch 10:   0%|          | 0/1813 [00:00<?, ?batch/s]

Reloading best model checkpoint from attention_model.pt...


In [None]:
num_beam = 5
include_beam = True
model = switched_attention_model

new_german_to_english_sentences = []

i = 0

#for example in training_data_switched:
for example in training_data_switched[:6000]:
  if i % 100 == 0:
    print("Iteration " +str(i))
  english = ""
  german = ""

  # if i < 10:
    # print("Input:")
    # print(" ", example[0])
    # print("Target:")
    # print(" ", example[1])
    # print("Greedy prediction:")
    # print(" ", predict_greedy(model, [example[0]])[0])
  #if include_beam:
    # print(f"Beam predictions (showing top {num_beam}):")
    # for candidate in predict_beam(model, [example[0]])[0][:num_beam]:

    #   german = candidate
    #   # if i<10:
    #   #   print(" ", candidate)

    #   break

  candidate = predict_beam(model, [example[0]])[0][0]
  #print(candidate)
  german = candidate

  english = example[0]

  # if i <10:
    # print([german, english])
    # print()
  new_german_to_english_sentences.append([german, english])

  i += 1

Iteration 0
Iteration 100
Iteration 200
Iteration 300
Iteration 400
Iteration 500
Iteration 600
Iteration 700
Iteration 800
Iteration 900
Iteration 1000
Iteration 1100
Iteration 1200
Iteration 1300
Iteration 1400
Iteration 1500
Iteration 1600
Iteration 1700
Iteration 1800
Iteration 1900
Iteration 2000
Iteration 2100
Iteration 2200
Iteration 2300
Iteration 2400
Iteration 2500
Iteration 2600
Iteration 2700
Iteration 2800
Iteration 2900
Iteration 3000
Iteration 3100
Iteration 3200
Iteration 3300
Iteration 3400
Iteration 3500
Iteration 3600
Iteration 3700
Iteration 3800
Iteration 3900
Iteration 4000
Iteration 4100
Iteration 4200
Iteration 4300
Iteration 4400
Iteration 4500
Iteration 4600
Iteration 4700
Iteration 4800
Iteration 4900
Iteration 5000
Iteration 5100
Iteration 5200
Iteration 5300
Iteration 5400
Iteration 5500
Iteration 5600
Iteration 5700
Iteration 5800
Iteration 5900


In [None]:
training_data = training_data + new_german_to_english_sentences

In [None]:
len(training_data)

35001

In [None]:
# WITH weight decay
def train(model, num_epochs, batch_size, model_file):
  """Train the model and save its best checkpoint.

  Model performance across epochs is evaluated using token-level accuracy on the
  validation set. The best checkpoint obtained during training will be stored on
  disk and loaded back into the model at the end of training.
  """

  optimizer = torch.optim.Adam(model.parameters(), weight_decay=1e-5)
  #optimizer = torch.optim.Adam(model.parameters())

  best_accuracy = 0.0
  for epoch in tqdm.notebook.trange(num_epochs, desc="training", unit="epoch"):
    with tqdm.notebook.tqdm(
        make_batch_iterator(training_data, batch_size, shuffle=True),
        desc="epoch {}".format(epoch + 1),
        unit="batch",
        total=math.ceil(len(training_data) / batch_size)) as batch_iterator:
      model.train()
      total_loss = 0.0
      for i, (source, target) in enumerate(batch_iterator, start=1):
        source, target = source.to(device),target.to(device)
        optimizer.zero_grad()
        loss = model.compute_loss(source, target)
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
        batch_iterator.set_postfix(mean_loss=total_loss / i)
      validation_perplexity, validation_accuracy = evaluate_next_token(
          model, validation_data)
      batch_iterator.set_postfix(
          mean_loss=total_loss / i,
          validation_perplexity=validation_perplexity,
          validation_token_accuracy=validation_accuracy)
      if validation_accuracy > best_accuracy:
        print(
            "Obtained a new best validation accuracy of {:.2f}, saving model "
            "checkpoint to {}...".format(validation_accuracy, model_file))
        torch.save(model.state_dict(), model_file)
        best_accuracy = validation_accuracy
  print("Reloading best model checkpoint from {}...".format(model_file))
  model.load_state_dict(torch.load(model_file))

def evaluate_next_token(model, dataset, batch_size=64):
  """Compute token-level perplexity and accuracy metrics.

  Note that the perplexity here is over subwords, not words.

  This function is used for validation set evaluation at the end of each epoch
  and should not be modified.
  """
  model.eval()
  total_cross_entropy = 0.0
  total_predictions = 0
  correct_predictions = 0
  with torch.no_grad():
    for source, target in make_batch_iterator(dataset, batch_size):
      encoder_output, encoder_mask, encoder_hidden = model.encode(source)
      decoder_input, decoder_target = target[:-1], target[1:]
      logits, decoder_hidden, attention_weights = model.decode(
          decoder_input, encoder_hidden, encoder_output, encoder_mask)
      total_cross_entropy += F.cross_entropy(
          logits.permute(1, 2, 0), decoder_target.permute(1, 0),
          ignore_index=pad_id, reduction="sum").item()
      total_predictions += (decoder_target != pad_id).sum().item()
      correct_predictions += (
          (decoder_target != pad_id) &
          (decoder_target == logits.argmax(2))).sum().item()
  perplexity = math.exp(total_cross_entropy / total_predictions)
  accuracy = 100 * correct_predictions / total_predictions
  return perplexity, accuracy

In [None]:
# You are welcome to adjust these parameters based on your model implementation.
num_epochs = 10
batch_size = 16
hidden_dim = 256
enc_output_size = hidden_dim * 2
word_vector_dim = 256
num_layers = 2
dropout = 0.3

attention_model = Seq2seqAttention(hidden_dim,enc_output_size, word_vector_dim,dropout,num_layers).to(device)
train(attention_model, num_epochs, batch_size, "attention_model.pt")

training:   0%|          | 0/10 [00:00<?, ?epoch/s]

epoch 1:   0%|          | 0/2188 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 60.01, saving model checkpoint to attention_model.pt...


epoch 2:   0%|          | 0/2188 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 64.95, saving model checkpoint to attention_model.pt...


epoch 3:   0%|          | 0/2188 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 66.36, saving model checkpoint to attention_model.pt...


epoch 4:   0%|          | 0/2188 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 67.52, saving model checkpoint to attention_model.pt...


epoch 5:   0%|          | 0/2188 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 67.80, saving model checkpoint to attention_model.pt...


epoch 6:   0%|          | 0/2188 [00:00<?, ?batch/s]

epoch 7:   0%|          | 0/2188 [00:00<?, ?batch/s]

Obtained a new best validation accuracy of 68.23, saving model checkpoint to attention_model.pt...


epoch 8:   0%|          | 0/2188 [00:00<?, ?batch/s]

epoch 9:   0%|          | 0/2188 [00:00<?, ?batch/s]

epoch 10:   0%|          | 0/2188 [00:00<?, ?batch/s]

Reloading best model checkpoint from attention_model.pt...


In [None]:
print("Attention model validation BLEU using beam search:",
      evaluate(attention_model, validation_data, batch_size=1, method="beam"))
print()
print("Attention model sample predictions:")
print()
show_predictions(attention_model, include_beam=True)

Attention model validation BLEU using beam search: 39.966018276463686

Attention model sample predictions:

Input:
  Eine Gruppe von Männern lädt Baumwolle auf einen Lastwagen
Target:
  A group of men are loading cotton onto a truck
Greedy prediction:
  A group of men unloading tree-side a truck
Beam predictions (showing top 5):
  A group of men unloading the tree.
  A group of men unloading the tree's trucks.
  A group of men unloading the tree-side.
  A group of men are putting last items on a truck
  A group of men unloading the tree's trucks

Input:
  Ein Mann schläft in einem grünen Raum auf einem Sofa.
Target:
  A man sleeping in a green room on a couch.
Greedy prediction:
  A man sleeps on a couch in a green room.
Beam predictions (showing top 5):
  A man sleeps on a couch in a green room.
  A man sleeping on a couch in a green room.
  A man sleeps in a green room on a couch.
  Man sleeping on a couch in a green room.
  A man is sleeping on a couch in a green room

Input:
  Ein 

In [None]:
generate_predictions_file_for_submission("seq2seq_predictions_attention.json", attention_model, test_data, "beam", batch_size=1)

Finished writing predictions to seq2seq_predictions_attention.json.


#### Results from Experimentation:
- Weight decay (by itself) seemed to have more consistent and significant improvement in BLEU compared to back translation (by itself)
- Weight decay and back translation (with the attention based model) allowed the model to reach the highest BLEU score across all of the experimentation trials (~39.966)