# Problem 1: Training a Simple Chatbot using a Seq-to-Seq Model (25 points)

You will train a simple chatbot using movie scripts from the Cornell Movie Dialogs Corpus based on the [PyTorch Chatbot Tutorial](https://pytorch.org/tutorials/beginner/chatbot_tutorial.html).

This tutorial allows you to train a recurrent sequence-to-sequence model. You will learn the following concepts:

- Handle loading and pre-processing of [the Cornell Movie-Dialogs Corpus dataset](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html)
- Implement a sequence-to-sequence model with [Luong attention mechanism(s)](https://arxiv.org/abs/1508.04025)
- Jointly train encoder and decoder models using mini-batches
- Implement greedy-search decoding module
- Interact with the trained chatbot

---

## Scoring Breakdown

| Task | Points |
|------|--------|
| Task 1: Run the tutorial end-to-end in Colab | 5 |
| Task 3: Create W&B sweep configuration | 5 |
| Task 4: Run hyperparameter sweeps on GPU Colab | 5 |
| Task 5: Analysis of best hyperparameters & feature importance | 10 |
| **Total** | **25** |

---

## References
- [The Cornell Movie Dialogs Corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html)
- [Hyperparameter sweeps with Weights and Biases (video tutorial)](https://www.youtube.com/watch?v=9zrmUIlScdY)
- [Sample Google Colab project for W&B sweeps](https://colab.research.google.com/github/wandb/examples/blob/master/colabs/pytorch/Organizing_Hyperparameter_Sweeps_in_PyTorch_with_W%26B.ipynb)
- [Weights and Biases Website](https://wandb.ai/site)

---
## Task 1 [5 points]: Run the Tutorial End-to-End

Make a copy of the [PyTorch Chatbot Tutorial](https://pytorch.org/tutorials/beginner/chatbot_tutorial.html) notebook, follow the instructions to train and evaluate the chatbot model in your **Google Colab** environment (GPU recommended).

The tutorial code is provided below as your starting point. Run each cell in order and verify that the model trains successfully and you can interact with the chatbot at the end.

### Setup: Install Dependencies and Download Data

The Cornell Movie-Dialogs Corpus must be downloaded before running the tutorial. The dataset is available via [ConvoKit](https://convokit.cornell.edu/documentation/movie.html).

In [None]:
# Install dependencies (torch is pre-installed in Colab; run the %%writefile cell above first)
!pip install -r requirements.txt -q

# Download the Cornell Movie-Dialogs Corpus via ConvoKit
import convokit
corpus = convokit.Corpus(filename=convokit.download("movie-corpus"))

In [None]:
# Install required packages (run once in Colab)
# Requires PyTorch >= 2.4.0 (pre-installed in Colab; verify with: import torch; print(torch.__version__))
!pip install "convokit>=3.0,<4.0" -q
!pip install "wandb>=0.18" -q

# Download the Cornell Movie-Dialogs Corpus via ConvoKit
import convokit
corpus = convokit.Corpus(filename=convokit.download("movie-corpus"))

### Preparations

In [None]:
import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
import csv
import random
import re
import os
import unicodedata
import codecs
from io import open
import itertools
import math
import json

assert torch.__version__ >= "2.4", f"PyTorch >= 2.4 required, got {torch.__version__}"

if torch.accelerator.is_available():
    device = torch.accelerator.current_accelerator().type
elif torch.backends.mps.is_available():
    device = "mps"
else:
    device = "cpu"
print(f"Using {device} device")

### Load & Preprocess Data

The Cornell Movie-Dialogs Corpus is stored in `utterances.jsonl` format. We parse the raw file to extract consecutive question-answer sentence pairs from each conversation.

In [ ]:
def loadLinesAndConversations(fileName):
    lines = {}
    conversations = {}
    with open(fileName, 'r', encoding='iso-8859-1') as f:
        for line in f:
            lineJson = json.loads(line)
            lineObj = {}
            lineObj["lineID"] = lineJson["id"]
            lineObj["characterID"] = lineJson["speaker"]
            lineObj["text"] = lineJson["text"]
            lines[lineObj['lineID']] = lineObj

            if lineJson["conversation_id"] not in conversations:
                convObj = {}
                convObj["conversationID"] = lineJson["conversation_id"]
                convObj["movieID"] = lineJson["meta"]["movie_id"]
                convObj["lines"] = [lineObj]
            else:
                convObj = conversations[lineJson["conversation_id"]]
                convObj["lines"].insert(0, lineObj)
            conversations[convObj["conversationID"]] = convObj

    return lines, conversations


def extractSentencePairs(conversations):
    qa_pairs = []
    for conversation in conversations.values():
        for i in range(len(conversation["lines"]) - 1):
            inputLine = conversation["lines"][i]["text"].strip()
            targetLine = conversation["lines"][i+1]["text"].strip()
            if inputLine and targetLine:
                qa_pairs.append([inputLine, targetLine])
    return qa_pairs

#### Create Formatted Data File

Parse the corpus and write tab-separated input/output pairs to a text file for training.

In [ ]:
# Define paths — update corpus_path to where ConvoKit downloaded the dataset
corpus_name = "movie-corpus"
corpus_path = os.path.join("/root/.convokit/saved-corpora", corpus_name)
datafile = os.path.join(corpus_path, "formatted_movie_lines.txt")
save_dir = os.path.join("data", "save")

delimiter = '\t'
delimiter = str(codecs.decode(delimiter, "unicode_escape"))

print("\nProcessing corpus...")
lines, conversations = loadLinesAndConversations(
    os.path.join(corpus_path, "utterances.jsonl")
)

print("\nWriting newly formatted file...")
with open(datafile, 'w', encoding='utf-8') as outputfile:
    writer = csv.writer(outputfile, delimiter=delimiter, lineterminator='\n')
    for pair in extractSentencePairs(conversations):
        writer.writerow(pair)

print("\nSample lines from file:")
with open(datafile, 'rb') as f:
    lines_sample = f.readlines()
for line in lines_sample[:3]:
    print(line)

#### Vocabulary Class

The `Voc` class maintains word-to-index and index-to-word mappings. Three special tokens are reserved:
- `PAD` (0): padding token used to equalize batch sequence lengths
- `SOS` (1): start-of-sequence token fed as the first decoder input
- `EOS` (2): end-of-sequence token appended to every target sequence

In [ ]:
PAD_token = 0
SOS_token = 1
EOS_token = 2

class Voc:
    def __init__(self, name):
        self.name = name
        self.trimmed = False
        self.word2index = {}
        self.word2count = {}
        self.index2word = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}
        self.num_words = 3

    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.num_words
            self.word2count[word] = 1
            self.index2word[self.num_words] = word
            self.num_words += 1
        else:
            self.word2count[word] += 1

    def trim(self, min_count):
        if self.trimmed:
            return
        self.trimmed = True
        keep_words = []
        for k, v in self.word2count.items():
            if v >= min_count:
                keep_words.append(k)
        print('keep_words {} / {} = {:.4f}'.format(
            len(keep_words), len(self.word2index), len(keep_words) / len(self.word2index)
        ))
        self.word2index = {}
        self.word2count = {}
        self.index2word = {PAD_token: "PAD", SOS_token: "SOS", EOS_token: "EOS"}
        self.num_words = 3
        for word in keep_words:
            self.addWord(word)

#### Text Normalization & Data Loading

In [ ]:
MAX_LENGTH = 10  # Maximum sentence length (in words) to consider

def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

def normalizeString(s):
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([\[.!?\]])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    s = re.sub(r"\s+", r" ", s).strip()
    return s

def readVocs(datafile, corpus_name):
    print("Reading lines...")
    lines = open(datafile, encoding='utf-8').read().strip().split('\n')
    pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]
    voc = Voc(corpus_name)
    return voc, pairs

def filterPair(p):
    return len(p[0].split(' ')) < MAX_LENGTH and len(p[1].split(' ')) < MAX_LENGTH

def filterPairs(pairs):
    return [pair for pair in pairs if filterPair(pair)]

def loadPrepareData(corpus, corpus_name, datafile, save_dir):
    print("Start preparing training data ...")
    voc, pairs = readVocs(datafile, corpus_name)
    print("Read {!s} sentence pairs".format(len(pairs)))
    pairs = filterPairs(pairs)
    print("Trimmed to {!s} sentence pairs".format(len(pairs)))
    print("Counting words...")
    for pair in pairs:
        voc.addSentence(pair[0])
        voc.addSentence(pair[1])
    print("Counted words:", voc.num_words)
    return voc, pairs

voc, pairs = loadPrepareData(corpus_name, corpus_name, datafile, save_dir)
print("\nSample pairs:")
for pair in pairs[:5]:
    print(pair)

#### Trim Rare Words

Remove words appearing fewer than `MIN_COUNT` times to reduce vocabulary size and improve generalization.

In [ ]:
MIN_COUNT = 3

def trimRareWords(voc, pairs, MIN_COUNT):
    voc.trim(MIN_COUNT)
    keep_pairs = []
    for pair in pairs:
        input_sentence = pair[0]
        output_sentence = pair[1]
        keep_input = True
        keep_output = True
        for word in input_sentence.split(' '):
            if word not in voc.word2index:
                keep_input = False
                break
        for word in output_sentence.split(' '):
            if word not in voc.word2index:
                keep_output = False
                break
        if keep_input and keep_output:
            keep_pairs.append(pair)
    print("Trimmed from {} pairs to {}, {:.4f} of total".format(
        len(pairs), len(keep_pairs), len(keep_pairs) / len(pairs)
    ))
    return keep_pairs

pairs = trimRareWords(voc, pairs, MIN_COUNT)

### Prepare Data for Models

Convert sentence pairs into padded tensors suitable for batch training. Sequences in a batch are padded to the same length, and a binary mask is created so that the loss function ignores padding positions.

In [ ]:
def indexesFromSentence(voc, sentence):
    return [voc.word2index[word] for word in sentence.split(' ')] + [EOS_token]

def zeroPadding(l, fillvalue=PAD_token):
    return list(itertools.zip_longest(*l, fillvalue=fillvalue))

def binaryMatrix(l, value=PAD_token):
    m = []
    for i, seq in enumerate(l):
        m.append([])
        for token in seq:
            if token == PAD_token:
                m[i].append(0)
            else:
                m[i].append(1)
    return m

def inputVar(l, voc):
    """Returns padded input sequence tensor and lengths."""
    indexes_batch = [indexesFromSentence(voc, sentence) for sentence in l]
    lengths = torch.tensor([len(indexes) for indexes in indexes_batch])
    padList = zeroPadding(indexes_batch)
    padVar = torch.LongTensor(padList)
    return padVar, lengths

def outputVar(l, voc):
    """Returns padded target sequence tensor, mask, and max target length."""
    indexes_batch = [indexesFromSentence(voc, sentence) for sentence in l]
    max_target_len = max([len(indexes) for indexes in indexes_batch])
    padList = zeroPadding(indexes_batch)
    mask = binaryMatrix(padList)
    mask = torch.BoolTensor(mask)
    padVar = torch.LongTensor(padList)
    return padVar, mask, max_target_len

def batch2TrainData(voc, pair_batch):
    """Returns all items for a given batch of pairs."""
    pair_batch.sort(key=lambda x: len(x[0].split(" ")), reverse=True)
    input_batch, output_batch = [], []
    for pair in pair_batch:
        input_batch.append(pair[0])
        output_batch.append(pair[1])
    inp, lengths = inputVar(input_batch, voc)
    output, mask, max_target_len = outputVar(output_batch, voc)
    return inp, lengths, output, mask, max_target_len

# Sanity check
small_batch_size = 5
batches = batch2TrainData(voc, [random.choice(pairs) for _ in range(small_batch_size)])
input_variable, lengths, target_variable, mask, max_target_len = batches
print("input_variable:", input_variable)
print("lengths:", lengths)
print("target_variable:", target_variable)
print("mask:", mask)
print("max_target_len:", max_target_len)

### Define Models

#### Encoder

The encoder is a bidirectional GRU. For each input token it produces a hidden state; the forward and backward outputs are **summed** to form a single context vector per time step.

In [ ]:
class EncoderRNN(nn.Module):
    def __init__(self, hidden_size, embedding, n_layers=1, dropout=0):
        super(EncoderRNN, self).__init__()
        self.n_layers = n_layers
        self.hidden_size = hidden_size
        self.embedding = embedding
        # TODO: Define a bidirectional GRU with the given hidden_size, n_layers, and dropout.
        # Use nn.GRU — remember to set bidirectional=True.
        # YOUR CODE HERE
        raise NotImplementedError

    def forward(self, input_seq, input_lengths, hidden=None):
        # TODO: Embed input_seq, pack the padded sequence, run through the GRU,
        # unpack, then SUM the forward and backward outputs to get a single
        # context vector per time step. Return (outputs, hidden).
        # YOUR CODE HERE
        raise NotImplementedError


#### Attention Layer

The [Luong attention mechanism](https://arxiv.org/abs/1508.04025) computes a context vector as a weighted sum of encoder outputs. Three scoring functions are supported:

| Method | Formula |
|--------|---------|
| `dot` | $h_t^\top \bar{h}_s$ |
| `general` | $h_t^\top W_a \bar{h}_s$ |
| `concat` | $v_a^\top \tanh(W_a [h_t ; \bar{h}_s])$ |

In [ ]:
class Attn(nn.Module):
    def __init__(self, method, hidden_size):
        super(Attn, self).__init__()
        self.method = method
        if self.method not in ['dot', 'general', 'concat']:
            raise ValueError(self.method, "is not an appropriate attention method.")
        self.hidden_size = hidden_size
        # TODO: For 'general', create a Linear(hidden_size, hidden_size).
        # For 'concat', create Linear(hidden_size*2, hidden_size) and a learnable
        # parameter vector v of size hidden_size.
        # YOUR CODE HERE

    def dot_score(self, hidden, encoder_output):
        # TODO: Compute element-wise product and sum over the last dimension.
        # YOUR CODE HERE
        raise NotImplementedError

    def general_score(self, hidden, encoder_output):
        # TODO: Apply self.attn to encoder_output, then dot with hidden.
        # YOUR CODE HERE
        raise NotImplementedError

    def concat_score(self, hidden, encoder_output):
        # TODO: Expand hidden to match encoder_output shape, concatenate,
        # apply self.attn + tanh, then dot with self.v.
        # YOUR CODE HERE
        raise NotImplementedError

    def forward(self, hidden, encoder_outputs):
        # TODO: Dispatch to the correct scoring function based on self.method,
        # transpose the energies, and return a softmax probability distribution
        # with an added dimension (shape: batch x 1 x src_len).
        # YOUR CODE HERE
        raise NotImplementedError


#### Decoder

The `LuongAttnDecoderRNN` generates one output token per step. It attends to encoder outputs via the `Attn` module, concatenates the context vector with the GRU output, and projects the result to a vocabulary-sized distribution.

In [ ]:
class LuongAttnDecoderRNN(nn.Module):
    def __init__(self, attn_model, embedding, hidden_size, output_size, n_layers=1, dropout=0.1):
        super(LuongAttnDecoderRNN, self).__init__()
        self.attn_model = attn_model
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers
        self.dropout = dropout
        self.embedding = embedding
        self.embedding_dropout = nn.Dropout(dropout)
        # TODO: Define a unidirectional GRU (hidden_size -> hidden_size, n_layers).
        # TODO: Define a concat Linear(hidden_size*2, hidden_size) to merge context + GRU output.
        # TODO: Define an output Linear(hidden_size, output_size) for the vocabulary projection.
        # TODO: Instantiate an Attn(attn_model, hidden_size) attention module.
        # YOUR CODE HERE
        raise NotImplementedError

    def forward(self, input_step, last_hidden, encoder_outputs):
        # TODO:
        # 1. Embed input_step and apply dropout.
        # 2. Run through self.gru to get rnn_output and new hidden state.
        # 3. Compute attention weights over encoder_outputs using self.attn.
        # 4. Compute context vector via batch matrix multiply (bmm).
        # 5. Concatenate rnn_output and context, apply self.concat + tanh.
        # 6. Project to vocabulary size with self.out and apply softmax.
        # Return (output, hidden).
        # YOUR CODE HERE
        raise NotImplementedError


### Define Training Procedure

#### Masked NLL Loss

Because sequences are padded to the same length within a batch, we compute loss only over non-padding positions using a binary mask.

In [ ]:
def maskNLLLoss(inp, target, mask):
    """Compute NLL loss over non-padded positions only.

    Args:
        inp:    (batch, vocab_size) softmax probabilities from the decoder
        target: (batch,) ground-truth token indices
        mask:   (batch,) boolean mask — True for real tokens, False for PAD
    Returns:
        loss (scalar tensor), nTotal (int count of real tokens)
    """
    # TODO:
    # 1. Count the number of non-padded tokens (mask.sum()).
    # 2. Gather the log-probability of the correct token for each item in the batch.
    # 3. Select only the masked (real) tokens and take the mean.
    # 4. Move loss to device and return (loss, nTotal).
    # YOUR CODE HERE
    raise NotImplementedError


#### Single Training Iteration

The `train` function performs one forward and backward pass over a single batch. Key techniques:

- **Teacher forcing**: with probability `teacher_forcing_ratio`, the ground-truth token is fed as the next decoder input instead of the model's own prediction. Higher values accelerate early convergence but may hurt generalization.
- **Gradient clipping**: gradients are clipped to a maximum norm of `clip` to prevent exploding gradients, which are common in RNN training.

In [ ]:
def train(input_variable, lengths, target_variable, mask, max_target_len,
          encoder, decoder, embedding,
          encoder_optimizer, decoder_optimizer,
          batch_size, clip, max_length=MAX_LENGTH):
    """Run one mini-batch forward + backward pass.

    Steps:
      1. Zero gradients on both optimizers.
      2. Move tensors to device (keep lengths on CPU for pack_padded_sequence).
      3. Run encoder to get encoder_outputs and encoder_hidden.
      4. Initialize decoder_input with SOS tokens (shape: 1 x batch_size).
      5. Set decoder_hidden from encoder_hidden[:decoder.n_layers].
      6. Decide teacher forcing: if random() < teacher_forcing_ratio use ground truth,
         otherwise use the decoder's own top-1 prediction as the next input.
      7. Loop over max_target_len steps, accumulate maskNLLLoss.
      8. loss.backward(), clip gradients for both encoder and decoder, step optimizers.
    Returns:
      Average loss per real token (float).
    """
    # YOUR CODE HERE
    raise NotImplementedError


#### Training Loop

`trainIters` manages the full training loop: printing average loss every `print_every` iterations and saving checkpoints every `save_every` iterations.

In [ ]:
def trainIters(model_name, voc, pairs, encoder, decoder,
               encoder_optimizer, decoder_optimizer, embedding,
               encoder_n_layers, decoder_n_layers, save_dir,
               n_iteration, batch_size, print_every, save_every,
               clip, corpus_name, loadFilename):

    training_batches = [
        batch2TrainData(voc, [random.choice(pairs) for _ in range(batch_size)])
        for _ in range(n_iteration)
    ]

    print('Initializing ...')
    start_iteration = 1
    print_loss = 0
    if loadFilename:
        start_iteration = checkpoint['iteration'] + 1

    print("Training...")
    for iteration in range(start_iteration, n_iteration + 1):
        training_batch = training_batches[iteration - 1]
        input_variable, lengths, target_variable, mask, max_target_len = training_batch

        loss = train(input_variable, lengths, target_variable, mask, max_target_len,
                     encoder, decoder, embedding,
                     encoder_optimizer, decoder_optimizer, batch_size, clip)
        print_loss += loss

        if iteration % print_every == 0:
            print_loss_avg = print_loss / print_every
            print("Iteration: {}; Percent complete: {:.1f}%; Average loss: {:.4f}".format(
                iteration, iteration / n_iteration * 100, print_loss_avg))
            print_loss = 0

        if iteration % save_every == 0:
            directory = os.path.join(
                save_dir, model_name, corpus_name,
                '{}-{}_{}'.format(encoder_n_layers, decoder_n_layers, hidden_size)
            )
            if not os.path.exists(directory):
                os.makedirs(directory)
            torch.save({
                'iteration': iteration,
                'en': encoder.state_dict(),
                'de': decoder.state_dict(),
                'en_opt': encoder_optimizer.state_dict(),
                'de_opt': decoder_optimizer.state_dict(),
                'loss': loss,
                'voc_dict': voc.__dict__,
                'embedding': embedding.state_dict()
            }, os.path.join(directory, '{}_{}.tar'.format(iteration, 'checkpoint')))

### Define Evaluation

#### Greedy Search Decoder

At inference time we use greedy decoding: at each step, select the token with the highest probability and feed it as input to the next step.

In [ ]:
class GreedySearchDecoder(nn.Module):
    def __init__(self, encoder, decoder):
        super(GreedySearchDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, input_seq, input_length, max_length):
        """Greedily decode up to max_length tokens.

        Steps:
          1. Run self.encoder to get encoder_outputs and encoder_hidden.
          2. Initialize decoder_hidden and decoder_input (SOS token).
          3. At each step: run self.decoder, take the argmax token,
             append it to all_tokens and its score to all_scores.
          4. Feed the chosen token back as the next decoder input.
        Returns:
          (all_tokens, all_scores) — both 1-D tensors of length max_length.
        """
        # YOUR CODE HERE
        raise NotImplementedError


def evaluate(encoder, decoder, searcher, voc, sentence, max_length=MAX_LENGTH):
    """Convert a normalized sentence string to a list of decoded word strings."""
    # TODO:
    # 1. Convert sentence to index tensor (indexesFromSentence), get lengths.
    # 2. Transpose to (seq_len, 1) for the encoder.
    # 3. Run searcher to get token indices.
    # 4. Map indices back to words via voc.index2word.
    # YOUR CODE HERE
    raise NotImplementedError


def evaluateInput(encoder, decoder, searcher, voc):
    """Interactive loop: read input from stdin, print bot response. Type 'q' to quit."""
    input_sentence = ''
    while True:
        try:
            input_sentence = input('> ')
            if input_sentence == 'q' or input_sentence == 'quit':
                break
            input_sentence = normalizeString(input_sentence)
            output_words = evaluate(encoder, decoder, searcher, voc, input_sentence)
            output_words[:] = [x for x in output_words if not (x == 'EOS' or x == 'PAD')]
            print('Bot:', ' '.join(output_words))
        except KeyError:
            print("Error: Encountered unknown word.")


### Model Initialization & Run Training

Configure the model hyperparameters, build the encoder and decoder, initialize optimizers, and start training.

In [ ]:
# ---- Model hyperparameters ----
model_name = 'cb_model'
attn_model = 'dot'        # attention scoring: 'dot', 'general', or 'concat'
hidden_size = 500
encoder_n_layers = 2
decoder_n_layers = 2
dropout = 0.1
batch_size = 64

# ---- Training hyperparameters ----
clip = 50.0
teacher_forcing_ratio = 1.0
learning_rate = 0.0001
decoder_learning_ratio = 5.0
n_iteration = 4000
print_every = 100
save_every = 500

loadFilename = None  # set to a .tar checkpoint path to resume training

# ---- Build models ----
embedding = nn.Embedding(voc.num_words, hidden_size)
encoder = EncoderRNN(hidden_size, embedding, encoder_n_layers, dropout)
decoder = LuongAttnDecoderRNN(attn_model, embedding, hidden_size,
                               voc.num_words, decoder_n_layers, dropout)
encoder = encoder.to(device)
decoder = decoder.to(device)
print('Models built and ready to go!')

# ---- Optimizers ----
encoder_optimizer = optim.Adam(encoder.parameters(), lr=learning_rate)
decoder_optimizer = optim.Adam(decoder.parameters(),
                               lr=learning_rate * decoder_learning_ratio)

# ---- Train ----
encoder.train()
decoder.train()
trainIters(model_name, voc, pairs, encoder, decoder,
           encoder_optimizer, decoder_optimizer, embedding,
           encoder_n_layers, decoder_n_layers, save_dir,
           n_iteration, batch_size, print_every, save_every,
           clip, corpus_name, loadFilename)

### Interact with the Chatbot

Switch models to evaluation mode and start a conversation. Type `q` to quit.

In [ ]:
encoder.eval()
decoder.eval()

searcher = GreedySearchDecoder(encoder, decoder)
evaluateInput(encoder, decoder, searcher, voc)

---
## Task 2: Learn Weights & Biases (W&B) — No Points, Required for Tasks 3–5

Before proceeding, watch the [Hyperparameter Sweeps with W&B video tutorial](https://www.youtube.com/watch?v=9zrmUIlScdY) and review the [accompanying Colab notebook](https://colab.research.google.com/github/wandb/examples/blob/master/colabs/pytorch/Organizing_Hyperparameter_Sweeps_in_PyTorch_with_W%26B.ipynb).

Then install and authenticate the W&B library below.

In [None]:
# wandb was already installed with a version pin in the setup cell above
import wandb
wandb.login()  # Enter your API key when prompted — sign up free at https://wandb.ai/site

---
## Task 3 [5 points]: Create a W&B Sweep Configuration

Define a sweep configuration using the **W&B Random Search** strategy over the following hyperparameters:

| Hyperparameter | Search values |
|---|---|
| `learning_rate` | 0.0001, 0.00025, 0.0005, 0.001 |
| `optimizer` | adam, sgd |
| `clip` | 0, 25, 50, 100 |
| `teacher_forcing_ratio` | 0, 0.5, 1.0 |
| `decoder_learning_ratio` | 1.0, 3.0, 5.0, 10.0 |

The sweep should **minimize** the metric `train_loss`.

**Hints:**
- Use `method: "random"` and specify each hyperparameter under `parameters` with a `values` list.
- Instrument your training loop with `wandb.init(config=...)` and `wandb.log({"train_loss": ...})`.
- Register your sweep with `sweep_id = wandb.sweep(sweep_config, project="chatbot-sweep")`.

In [ ]:
# TODO: Define sweep_config using W&B random search over the hyperparameters listed above.
# Then register the sweep:
#   sweep_id = wandb.sweep(sweep_config, project='chatbot-sweep')
# YOUR CODE HERE


### Instrument the Training Loop for W&B

Write a `train_sweep()` function that:
1. Calls `wandb.init()` to start a new run
2. Reads hyperparameter values from `wandb.config` (e.g. `wandb.config.learning_rate`)
3. Builds the models and optimizers using those values
4. Calls your training loop and logs `train_loss` at each `print_every` step with `wandb.log()`

In [ ]:
def train_sweep():
    """Single sweep run invoked by the W&B agent.

    Steps:
      1. Call wandb.init() (use as a context manager).
      2. Read hyperparameters from wandb.config
         (learning_rate, optimizer, clip, teacher_forcing_ratio, decoder_learning_ratio).
      3. Build fresh encoder/decoder and optimizers using those values.
      4. Run the training loop; after every print_every iterations call
         wandb.log({'train_loss': avg_loss, 'iteration': iteration}).
    """
    # YOUR CODE HERE
    raise NotImplementedError


---
## Task 4 [5 points]: Run Hyperparameter Sweeps on GPU Colab

Launch a W&B sweep agent on a **GPU-enabled** Colab runtime (Runtime > Change runtime type > T4 GPU). The agent will automatically sample configurations and execute `train_sweep()` for each one.

- Run at least **10 sweep trials** to cover a meaningful portion of the search space.
- Monitor results live in the [W&B console](https://wandb.ai).
- Paste your W&B project URL or a screenshot of the sweep results dashboard below.

In [ ]:
# TODO: Launch the W&B sweep agent.
# Run at least 10 trials on a GPU-enabled Colab runtime.
#   wandb.agent(sweep_id, function=train_sweep, count=10)
# YOUR CODE HERE


**W&B Project Link / Screenshot:**

*Paste your W&B sweep URL here (e.g. https://wandb.ai/\<username\>/chatbot-sweep/sweeps/\<sweep-id\>) or embed a screenshot of the parallel coordinates / loss curves.*

---
## Task 5 [10 points]: Analysis of Best Hyperparameters & Feature Importance

After completing your sweeps, answer all four questions below.

### 5a. Best Configuration
Report the hyperparameter values that achieved the **lowest `train_loss`** across all sweep runs.

In [ ]:
# (Optional) Retrieve the best run programmatically via the W&B API:
#
# api = wandb.Api()
# runs = api.runs("<your-entity>/chatbot-sweep")
# best_run = min(runs, key=lambda r: r.summary.get("train_loss", float("inf")))
# print("Best config:", best_run.config)
# print("Best loss: ", best_run.summary["train_loss"])

**Best hyperparameter configuration:**

| Hyperparameter | Best Value |
|---|---|
| `learning_rate` | *(your answer)* |
| `optimizer` | *(your answer)* |
| `clip` | *(your answer)* |
| `teacher_forcing_ratio` | *(your answer)* |
| `decoder_learning_ratio` | *(your answer)* |
| **Best `train_loss`** | *(your answer)* |

### 5b. Feature Importance

Use the **W&B feature importance panel** (available in the sweep UI under "Parameter Importance") to identify which hyperparameters had the greatest and least impact on `train_loss`.

*Paste a screenshot of the feature importance chart here and list the top-3 most important hyperparameters.*

### 5c. Convergence Analysis

Explain, in your own words, **why** the top hyperparameters from 5b affect model convergence. Address each of the following:

- **Learning rate** — how does its magnitude affect gradient update steps and the risk of overshooting minima?
- **Optimizer choice (Adam vs SGD)** — how do adaptive vs. fixed learning rates influence training on sparse/noisy sequence data?
- **Gradient clipping (`clip`)** — why does clipping stabilize RNN training, and what happens when `clip=0` (no clipping) or `clip=100` (very loose)?
- **Teacher forcing ratio** — how does the tradeoff between training with ground-truth vs. predicted tokens affect convergence speed and exposure bias?
- **Decoder learning ratio** — why might the decoder benefit from a different learning rate than the encoder?

*Write your analysis here (aim for 200–400 words).*

### 5d. Chatbot Quality

Load the best checkpoint and interact with the chatbot. Report **at least 5 example exchanges** and briefly comment on the quality of the responses.

```
> <your input>
Bot: <model output>

> <your input>
Bot: <model output>
```

*Replace the template above with your actual exchanges.*