# HW3

In this homework, we'll learn about transformers and chatbots.

It will probably be easiest to run this on http://colab.research.google.com

## minGPT Character Language Model

First, will inspect Karpathy's [minGPT](https://github.com/karpathy/minGPT/tree/master) library to learn more about transformers.

We'll first fit a character language model using mingpt. We'll use as training data all the text of Shakespeare.

In [None]:
# clone the library
!git clone https://github.com/karpathy/minGPT.git

In [None]:
# Add mingpt to your Python path, so you can import it.
import sys
sys.path.insert(0, './minGPT')
from mingpt.model import GPT
from mingpt.trainer import Trainer
from mingpt.utils import set_seed
import pandas as pd
import pickle
import torch
from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader
set_seed(3407)

In [None]:
# download shakespeare data
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

### Data loading and training code

In [None]:
from mingpt.utils import set_seed, setup_logging, CfgNode as CN
import os
import sys

class CharDataset(Dataset):
    """
    This represents a dataset of characters.
    """
    @staticmethod
    def get_default_config():
        C = CN()
        C.block_size = 128
        return C

    def __init__(self, config, data):
        self.config = config
        self.parse_data(data)

    def parse_data(self, data):
        print('parsing char data')
        # get list of all characters
        chars = sorted(list(set(data)))
        data_size, vocab_size = len(data), len(chars)
        print('data has %d characters, %d unique.' % (data_size, vocab_size))
        # map from char to int
        self.stoi = { ch:i for i,ch in enumerate(chars) }
        # map from into to char
        self.itos = { i:ch for i,ch in enumerate(chars) }
        self.vocab_size = vocab_size
        self.data = data

    def get_vocab_size(self):
        return self.vocab_size

    def get_block_size(self):
        return self.config.block_size

    def __len__(self):
        return len(self.data) - self.config.block_size

    def __getitem__(self, idx):
        # grab a chunk of (block_size + 1) characters from the data
        chunk = self.data[idx:idx + self.config.block_size + 1]
        # encode every character to an integer
        dix = [self.stoi[s] for s in chunk]
        # return as tensors
        x = torch.tensor(dix[:-1], dtype=torch.long)
        y = torch.tensor(dix[1:], dtype=torch.long)
        return x, y

def get_config():

    C = CN()

    # system
    C.system = CN()
    C.system.seed = 3407
    C.system.work_dir = './out'

    # data
    C.data = CharDataset.get_default_config()

    # model
    C.model = GPT.get_default_config()
    C.model.model_type = 'gpt-micro'

    # trainer
    C.trainer = Trainer.get_default_config()
    C.trainer.learning_rate = 5e-4 # the model we're using is so small that we can go a bit faster

    return C


def train_model(config, train_dataset, sample_fn):
    """
    Train the model.
    config..........CfgNode
    train_dataset...Dataset that emits strings for training
    sample_fn.......function to call during training to show sample output.
    """
    # construct the model
    config.model.vocab_size = train_dataset.get_vocab_size()
    config.model.block_size = train_dataset.get_block_size()
    model = GPT(config.model)

    # construct the trainer object
    trainer = Trainer(config.trainer, model, train_dataset)

    # iteration callback
    def batch_end_callback(trainer):

        if trainer.iter_num % 10 == 0:
            print(f"iter_dt {trainer.iter_dt * 1000:.2f}ms; iter {trainer.iter_num}: train loss {trainer.loss.item():.5f}")

        if trainer.iter_num % 500 == 0:
            # evaluate both the train and test score
            model.eval()
            with torch.no_grad():
                # sample from the model...
                context = list(train_dataset.itos.values())[0]
                completion = sample_fn(context, model, trainer, train_dataset, maxlen=100, temperature=1.)
                print('sample from the model:')
                print(completion)
            # save the latest model
            print("saving model")
            ckpt_path = os.path.join(config.system.work_dir, "model.pt")
            torch.save(model.state_dict(), ckpt_path)
            # revert model to training mode
            model.train()

    trainer.set_callback('on_batch_end', batch_end_callback)

    # run the optimization
    trainer.run()
    model.eval()
    return model, trainer

def configure_model(max_iters=100, block_size=128):
    config = get_config()
    config.merge_from_args(['--trainer.max_iters=%d' % max_iters,
                            '--data.block_size=%d' % block_size,
                            '--model.block_size=%d' % block_size])
    setup_logging(config)
    set_seed(config.system.seed)
    return config


def create_char_data(config):
    # construct the training dataset
    text = open('input.txt', 'r').read()
    return CharDataset(config.data, text)

def sample_from_char_model(context, model, trainer, train_dataset, maxlen=500, temperature=1.):
    x = torch.tensor([train_dataset.stoi[s] for s in context], dtype=torch.long)[None,...].to(trainer.device)
    y = model.generate(x, maxlen, temperature=temperature, do_sample=True, top_k=10)[0]
    return ''.join([train_dataset.itos[int(i)] for i in y])

In [None]:
# train the character model.
config = configure_model(max_iters=100, block_size=64)
train_dataset = create_char_data(config)
model, trainer = train_model(config, train_dataset, sample_from_char_model)

In [None]:
print(sample_from_char_model("Romeo:", model, trainer, train_dataset, maxlen=10, temperature=1))

**What is the `block_size` variable? Describe in detail what it does.**

You might want to consult the code for [model.py](https://github.com/karpathy/minGPT/blob/master/mingpt/model.py).



**YOUR ANSWER**

**What is the relationship between `block_size` and the total number of parameters in the model?** That is, if we double `block_size`, what happens to the total number of model parameters?

**YOUR ANSWER**

**What is the `n_layer` parameter? Describe in detail what id does. If we double this parameter, what happens to the total number of model parameters?**

**YOUR ANSWER**

**What does the temperature paramter do?** See the generate method in [model.py](https://github.com/karpathy/minGPT/blob/37baab71b9abea1b76ab957409a1cc2fbfba8a26/mingpt/model.py#L283).

Try setting temperature to different values. What do you observe about the output?

**YOUR ANSWER**

**What does [line 148](https://github.com/karpathy/minGPT/blob/37baab71b9abea1b76ab957409a1cc2fbfba8a26/mingpt/model.py#L148) in model.py do? How does this relate to the transformer model?**  

**YOUR ANSWER**

## Word Model
Now, let's fit a word model instead of a character model.

Given a string like:

> The cow     jumped over the moon. The moon is full tonight!

The `WordDataset` class below should create tokens for each space-delimited string:

> ['The', 'cow', 'jumped', 'over', 'the', 'moon', '.', 'The', 'moon', 'is', 'full', 'tonight', '!']

Note that multiple space characters are treated as one (Hint: `re` may help here.)

Using `CharDataset::parse_data` function above as an example, complete the `parse_data` function below to set the `stoi`, `itos`, `vocab_size`, and `data` attributes of the `WordDataset` class.

In [None]:
import re

class WordDataset(CharDataset):
  def parse_data(self, data):
    """
    data.....A single string representing many sentences.
    """
    ### YOUR CODE HERE
    pass

word_config = configure_model(max_iters=200, block_size=4)
word_data = WordDataset(word_config.data, 'The cow jumped over the moon. The moon is full tonight!')
word_data.data

In [None]:
# we can now reuse the training code to fit the word language model.
def sample_from_word_model(context, model, trainer, train_dataset, maxlen=500, temperature=1.):
    x = torch.tensor([train_dataset.stoi[s] for s in context], dtype=torch.long)[None,...].to(trainer.device)
    y = model.generate(x, maxlen, temperature=temperature, do_sample=True, top_k=10)[0]
    return ' '.join([train_dataset.itos[int(i)] for i in y])

word_model, word_trainer = train_model(word_config, word_data, sample_from_word_model)

In [None]:
sample_from_word_model(["The"], word_model, word_trainer, word_data, maxlen=50, temperature=1.)

### Wikipedia

With our word model, let's now fit a language model on the Wikipedia page for [New Orleans](https://en.wikipedia.org/wiki/New_Orleans)

First, we'll install a library to help us fetch the plain text of a wikipedia page.

In [None]:
!pip install wikipedia

In [None]:
import wikipedia
wikipedia.set_lang('en')
page = wikipedia.page('New Orleans')
print(page.content[:100])

**Create new variables `word_config`, `word_data`, `word_model`, `word_trainer` that are analogous to the ones used previously. These should fit a model to the `page` text defined in the previous cell.**

In [None]:
### YOUR CODE HERE


In [None]:
sample_from_word_model(["A", "local", "variant", "for", "hip", "hop", "is"], word_model, word_trainer, word_data, maxlen=200, temperature=1.)

Investigate different model settings (`block_size, max_iters, learning_rate, n_embd, n_layer`).

**What effect do you notice from trying different values? Which setting appears to generate the best generated text?**


**YOUR ANSWER**



Suppose you wanted to take the word model trained on the New Orleans Wikipedia page and use supervised fine-tuning to create a chatbot that answers questions about New Orleans.

**What type of additional training data would you need to do this?**

Provide example data below.

**YOUR ANSWER**




**If this new data contains words that don't appear in the New Orleans wikipedia page, what will happen? How can you fix this?**

**YOUR ANSWER**