# Homework and bakeoff: Compositional generalization

In [83]:
__author__ = "Christopher Potts and Zhengxuan Wu"
__version__ = "CS224u, Stanford, Spring 2023"

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cgpotts/cs224u/blob/master/hw_recogs.ipynb)
[![Open in SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/cgpotts/cs224u/blob/master/hw_recogs.ipynb)

If Colab is opened with this badge, please save a copy to drive (from the File menu) before running the notebook.

## Overview

This assignment is about _compositional generalization_. We are going to assess the degree to which our apparently very good models have learned to process and interpret language _systematically_. To do this, we are going to ask them to interpret novel combinations of familiar words and phrases. For humans, these tasks are very easy. For our models, the situation seems to be quite different.

The basis for the assignment is the ReCOGS dataset of [Wu, Manning, and Potts 2023](https://arxiv.org/abs/2303.13716). ReCOGS modifies the COGS dataset of [Kim and Linzen 2020](https://aclanthology.org/2020.emnlp-main.731) in a number of ways, with the goal of more directly assessing the interpretive abilities of models.

The assignment questions are fairly diverse. Question 1 asks you to conduct a specific analysis of the ReCOGS dataset, and Question 2 follows this up with a corresponding analysis of the errors made by a top-performing ReCOGS model. For Question 3, you try some in-context learning with DSP. And then we open things up as usual: you can do anything you want for your original system, and you enter that system's predictions into a bakeoff.

There is only one rule that we need to enforce throughout this work:

__You cannot train your system on any examples from `dataset["gen"]`, nor can the output representations from those examples be included in any prompts used for in-context learning.__

The nature of your original system is otherwise unconstrained.

## Set-up

In [84]:
try:
    # This library is our indicator that the required installs
    # need to be done.
    import datasets
except ModuleNotFoundError:
    !git clone https://github.com/cgpotts/cs224u/
    !pip install -r cs224u/requirements.txt
    import sys
    sys.path.append("cs224u")

In [85]:
import os
import pandas as pd
from compgen import recogs_exact_match

In [86]:
# panda column width to dynamic
pd.set_option('display.max_colwidth', None)

The default location of the data:

In [87]:
SRC_DIRNAME = os.path.join("data", "recogs")

The following code should grab the dataset for you; if it fails for any reason, you can manually download it from [this link](https://web.stanford.edu/class/cs224u/data/recogs.tgz) and then put it in `SRC_DIRNAME`.

In [88]:
if not os.path.exists(SRC_DIRNAME):
    !mkdir -p data
    !wget https://web.stanford.edu/class/cs224u/data/recogs.tgz -P data
    !tar xvf data/recogs.tgz -C data/

## Load the COGS and ReCOGS datasets

In [89]:
def load_split(filename):
    return pd.read_csv(
        filename,
        delimiter="\t",
        names=['input', 'output', 'category'])

In [90]:
dataset = {}

for splitname in ("train", "dev", "gen", "test"):
    dataset[splitname] = load_split(f"{SRC_DIRNAME}/{splitname}.tsv")

Here's a look at the dataset. Fundamentally, the task is to map simple English sentences to logical forms. For ReCOGS, you need only predict these forms up to semantic equivalence, which means that we abstract away from the order of the conjuncts and the names of specific variables.

In [91]:
dataset['train'].head(2)

Unnamed: 0,input,output,category
0,A rose was helped by a dog .,"rose ( 1 ) ; dog ( 6 ) ; help ( 3 ) AND theme ( 3 , 1 ) AND agent ( 3 , 6 )",in_distribution
1,The sailor dusted a boy .,"* sailor ( 1 ) ; boy ( 4 ) ; dust ( 2 ) AND agent ( 2 , 1 ) AND theme ( 2 , 4 )",in_distribution


In [92]:
# assert len(dataset['train']) == 135547, "V2"
# assert len(dataset['train']) == 135546, "V1"
assert len(dataset['train']) == 27227, "Positional Index"

The `dataset['gen']` section is divided up into different 21 categories. A category name `X_to_Y` or `only_seen_as_X_as_Y`  means that specific phrases were seen only as `X` in training and will encounter those phrases as `Y` at test time.

In [93]:
sorted(dataset['gen'].category.unique())

['active_to_passive',
 'cp_recursion',
 'do_dative_to_pp_dative',
 'obj_omitted_transitive_to_transitive',
 'obj_pp_to_subj_pp',
 'obj_to_subj_common',
 'obj_to_subj_proper',
 'only_seen_as_transitive_subj_as_unacc_subj',
 'only_seen_as_unacc_subj_as_obj_omitted_transitive_subj',
 'only_seen_as_unacc_subj_as_unerg_subj',
 'passive_to_active',
 'pp_dative_to_do_dative',
 'pp_recursion',
 'prim_to_inf_arg',
 'prim_to_obj_common',
 'prim_to_obj_proper',
 'prim_to_subj_common',
 'prim_to_subj_proper',
 'subj_to_obj_common',
 'subj_to_obj_proper',
 'unacc_to_transitive']

## Question 1: Proper names and their semantic roles

A number of the COGS/ReCOGS generalization categories assess models on their ability to handle proper names appearing in novel positions at test time. For example, in the `obj_to_subj_proper` category, models encounter proper names that appeared in the train set only in grammatical object position (e.g., _see Sandy_), and then they are asked to make predictions about cases where those names are grammatical subjects (_Sandy left_). These changes have systematic effects on the grammatical roles that the meanings of these names play semantically. In particular, subjects are likely to play `agent` roles and objects are likely to play `theme` roles.

### Task 1: Pattern-based analysis function [1 point]

Write a function that scans ReCOGS logical forms to determine what role proper names play. The following are the core steps:

1. Identify proper names. All and only proper names begin with capital letters in these LFs, and proper names consist only of ascii letters. The format is, informally, `Name ( d+ )`, as in `Sandy ( 47 )`.

2. Identify role expressions. The pattern is always `role ( d+ , d+ )`, as in `agent ( 1 , 47 )`. Here, the first variable is for the associated event, and the second is the role argument. The possible roles are `agent`, `theme`, and `recipient`. (The dataset includes other roles, but these involve events, not people.)

3. Determine which of the above are linked in the sense that the variable names are the same. A given name can link to multiple role expressions (or none at all), and LFs can contain multiple names and multiple role expressions.

To do the above, you just need to complete the function `get_propername_role`. The test should clear up any ambiguity and help you iterate to a solution.

In [94]:
import re
from typing import DefaultDict

def get_propername_role(s):
    """Extract from `s` all the pairs `(name, role)` determined by
    binding relationships. There can be multiple tokens of the same
    name with different variables, as in "Kim ( 1 )" and "Kim ( 47 )",
    and there can be instances in which a single name with variable
    like "Kim ( 1 )" binds into multiple role expressions like
    "agent ( 4 , 1 )" and "theme ( 6 , 1 )". Your function should
    cover all these cases.

    We've suggested a particular program design to get you started,
    but you are free to do something different and perhaps cleverer
    if you wish!

    Parameters
    ----------
    s: str

    Returns
    -------
    set of tuples `(name, role)` where `name` and `role` are str
    """
    # Step 1: Define a regex for "name ( var )" expressions:
    ##### YOUR CODE HERE
    name_re = r"([A-Z]\w+)\s+\(\s+(\d+)\s+\)"


    # Step 2: Define a regex for "role ( var , var )" expressions:
    ##### YOUR CODE HERE
    role_re = r"(\w+)\s+\(\s+(\d+)\s*,\s+(\d+)\s+\)"


    # Step 3: Use `findall` with both of your regexs:
    ##### YOUR CODE HERE
    matches = re.findall(name_re, s) + re.findall(role_re, s)



    # Step 4: Loop overall combinations of matches from your regexs
    # to build `data` as a set of pairs `(name, role)`:
    data = set()
    ##### YOUR CODE HERE
    var2roles = DefaultDict(list)
    var2name = {}
    for match in matches:
        if len(match) == 2:
            name, var = match
            var2name[var] = name
        elif len(match) == 3:
            role, var1, var2 = match
            var2roles[var2].append(role)

    for var, roles in var2roles.items():
        if var not in var2name:
            continue
        name = var2name[var]
        for role in roles:
            data.add((name, role))


    # Step 5: Return `data`:
    ##### YOUR CODE HERE
    return data




In [95]:
def test_get_propername_role(func):
    examples = [
        # Standard case:
        (
            "Bella ( 7 ) ; smile ( 4 ) AND agent ( 4 , 7 )",
            {("Bella", "agent")}
        ),
        # No binding:
        (
            "Riley ( 37 ) ; theme ( 4 , 7 )",
            set()
        ),
        # Two tokens of the same name referring to different entities:
        (
            "Riley ( 37 ) ; Riley ( 4 ) ; theme ( 1 , 37 ) AND agent ( 1 , 4 )",
            {("Riley", "theme"), ("Riley", "agent")},
        ),
        # Two names:
        (
            "Riley ( 4 ) ; Emma ( 243 ) ; recipient ( 6 , 4 ) AND agent ( 6 , 243 )",
            {("Riley", "recipient"), ("Emma", "agent")},
        ),
        # One name binding into multiple role expressions:
        (
            "Riley ( 4 ) ; agent ( 6 , 4 ) AND theme ( 6 , 4 )",
            {("Riley", "theme"), ("Riley", "agent")},
        ),
        # Nothing to match:
        (
            "no proper names",
            set()
        )
    ]
    errcount = 0
    for ex, expected in examples:
        result = func(ex)
        if expected != result:
            errcount += 1
            print(f"Error for `{func.__name__}`:"
                  f"\n\tInput: {ex}"
                  f"\n\tExpected: {expected}"
                  f"\n\tGot: {result}")
    if errcount == 0:
        print(f"No errors detected for `{func.__name__}`")

In [96]:
test_get_propername_role(get_propername_role)

No errors detected for `get_propername_role`


### Task 2: Finding challenging names [1 point]

You can now use your code to find the names that will be the most challenging because their train/gen roles are disjoint. To do this, you just need to complete the function `find_name_roles`:

In [97]:
from collections import defaultdict

def find_name_roles(split_df, colname="output"):
    """Create a map from names to dicts mapping roles to counts: the
    number of time the name appears with role in `split_df`:

    Parameters
    ----------
    split_df : pd.DataFrame
        Needs to have a column called `colname`.
    colname: str
        Column to target with `get_propername_role`. Default: "output".

    Returns
    -------
    `defaultdict` mapping names to roles to counts
    """
    # This is a convenient way to create a multidimensional count dict:
    # You can access it out of the box as `all_roles[key1][key2] += 1`.
    all_roles = defaultdict(lambda : defaultdict(int))

    # Apply `get_propername_role` to every value in the target column
    # and aggregate the results into `all_roles`:
    ##### YOUR CODE HERE
    for s in split_df[colname]:
        for name, role in get_propername_role(s):
            all_roles[name][role] += 1



    # Return `all_roles`:
    return all_roles

A quick test:

In [98]:
def test_find_name_roles(func):
    df = pd.DataFrame({
        "tester": [
            "Bella ( 7 ) ; agent ( 4 , 7 )",
            "Bella ( 7 ) ; agent ( 4 , 7 )",
            "Riley ( 37 ) ; agent ( 4 , 37 )",
            "Riley ( 3 ) ; theme ( 4 , 3 )",
            "Emma ( 37 ) ; theme ( 4 , 7 )"
        ]})
    expected = {
        "Bella": {"agent": 2},
        "Riley": {"agent": 1, "theme": 1}
    }
    result = func(df, colname="tester")
    if result != expected:
        print(f"Error for `{func.__name__}`:"
              f"\n\tExpected:{expected}"
              f"\n\tGot: {result}")
    else:
        print(f"No errors found for `{func.__name__}`")

In [99]:
test_find_name_roles(find_name_roles)

No errors found for `find_name_roles`


Once the test passes, this analysis should be informative:

In [100]:
train_roles = find_name_roles(dataset['train'])

sorted(train_roles.items(), key=lambda x: len(x[1]))[: 3]

[('Charlie', defaultdict(int, {'theme': 2})),
 ('Lina', defaultdict(int, {'agent': 1})),
 ('Jayden', defaultdict(int, {'agent': 34, 'recipient': 14}))]

In [101]:
dev_roles = find_name_roles(dataset['dev'])

sorted(dev_roles.items(), key=lambda x: len(x[1]))[: 3]

[('Ryan', defaultdict(int, {'agent': 6})),
 ('Grayson', defaultdict(int, {'agent': 5})),
 ('Lincoln', defaultdict(int, {'recipient': 5}))]

In [102]:
test_roles = find_name_roles(dataset['test'])

sorted(test_roles.items(), key=lambda x: len(x[1]))[: 3]

[('Skylar', defaultdict(int, {'agent': 5})),
 ('Christopher', defaultdict(int, {'agent': 2})),
 ('Joshua', defaultdict(int, {'agent': 4}))]

In [103]:
gen_roles = find_name_roles(dataset["gen"])

sorted(gen_roles.items(), key=lambda x: len(x[1]))[: 3]

[('Charlie', defaultdict(int, {'agent': 1000})),
 ('Lina', defaultdict(int, {'theme': 1000})),
 ('Paula', defaultdict(int, {'agent': 1000, 'theme': 1000}))]

We will return to these troublemakers in a bit.

## Pretrained ReCOGS models

We launch now into an extended interlude before Question 2. For Question 2, you will work with a ReCOGS model that we trained for you. This interlude presents the code needed to work with this model. We are exposing these details to you in case you want to use this code to train or fine-tune your own models for your original system.

### Tokenization

Here is a function for creating Hugging Face `PreTrainedTokenizerFast` tokenizers based on a provided vocab file. It pretty much just splits on whitespace and adds special tokens. Chris originally planned to have writing this be a homework question, but it turned out to be very difficult and confusing for him to write, so he decided to just present it to you in the hope that it helps you with similar tasks in the future.

In [104]:
from tokenizers import Tokenizer
from tokenizers.models import WordLevel
from tokenizers.pre_tokenizers import WhitespaceSplit
from tokenizers.processors import TemplateProcessing
from transformers import PreTrainedTokenizerFast


def get_tokenizer(vocab_filename):
    with open(vocab_filename) as f:
        vocab = f.read().splitlines()
    vocab_size = len(vocab)
    vocab = dict(zip(vocab, list(range(vocab_size))))
    tok = Tokenizer(WordLevel(vocab, unk_token='[UNK]'))
    # This definitely needs to be done here and in the construction of
    # `PreTrainedTokenizerFast`. Don't be tempted to "clean this up"!
    tok.add_special_tokens(["[BOS]", "[UNK]", "[PAD]", "[EOS]"])
    tok.pre_tokenizer = WhitespaceSplit()
    tok.post_processor = TemplateProcessing(
        single=f"[BOS]:0 $A:0 [EOS]:0",
        special_tokens=[
            ("[BOS]", tok.token_to_id("[BOS]")),
            ("[EOS]", tok.token_to_id("[EOS]"))])
    return PreTrainedTokenizerFast(
        tokenizer_object=tok,
        bos_token="[BOS]",
        unk_token="[UNK]",
        pad_token="[PAD]",
        eos_token="[EOS]",
        # This vital; otherwise any periods will have their leading
        # spaces removed, which is wrong for COGS/ReCOGS.
        clean_up_tokenization_spaces=False)

We will have separate tokens for the encoder and the decoder:

In [105]:
enc_tokenizer = get_tokenizer(os.path.join(SRC_DIRNAME, "src_vocab.txt"))

In [106]:
enc_tokenizer.tokenize(
    "A sailor was helped", 
    add_special_tokens=True)

['[BOS]', 'A', 'sailor', 'was', 'helped', '[EOS]']

In [107]:
dec_tokenizer = get_tokenizer(os.path.join(SRC_DIRNAME, "tgt_vocab.txt"))

In [108]:
dec_tokenizer.tokenize(
    "sailor ( 53 ) ; help ( 7 ) AND theme ( 7 , 53 )", 
    add_special_tokens=True)

['[BOS]',
 'sailor',
 '(',
 '53',
 ')',
 ';',
 'help',
 '(',
 '7',
 ')',
 'AND',
 'theme',
 '(',
 '7',
 ',',
 '53',
 ')',
 '[EOS]']

### Dataset

Next is a dataset utility. Chris was originally going to have you write this yourselves, since it is useful to know how to write these utilities, and the task is really just to use our tokenizers appropriately. However, since `collate_fn` has to be a static method with fixed arguments, we can't easily pass in these tokenizers to it! As a result, we have to do all the tokenization at once ahead of time and then redo all the masking work for each batch. So Chris did this for you in the hope that this will be useful to you in the future.

In [109]:
import torch

class RecogsDataset(torch.utils.data.Dataset):
    def __init__(self, enc_tokenizer, dec_tokenizer, X, y=None):
        self.X = [enc_tokenizer.encode(s) for s in X]
        self.y = y
        if y is not None:
            self.y = [dec_tokenizer.encode(s) for s in y]

    @staticmethod
    def collate_fn(batch):
        """Unfortunately, we can't pass the tokenizer in as an argument
        to this method, since it is a static method, so we need to do
        the work of creating the necessary attention masks."""
        def get_pad_and_mask(vals):
            lens = [len(i) for i in vals]
            maxlen = max(lens)
            pad = []
            mask = []
            for ex, length in zip(vals, lens):
                diff = maxlen - length
                pad.append(ex + ([0] * diff))
                mask.append(([1] * length) + ([0] * diff))
            return torch.tensor(pad), torch.tensor(mask)
        batch_elements = list(zip(*batch))
        X = batch_elements[0]
        X_pad, X_mask = get_pad_and_mask(X)
        if len(batch_elements) == 1:
            return X_pad, X_mask
        else:
            y = batch_elements[1]
            y_pad, y_mask = get_pad_and_mask(y)
            # Repeat `y_pad` because our optimizer expects to find
            # labels in final position. These will not be used because
            # Hugging Face will calculate the loss for us.
            return X_pad, X_mask, y_pad, y_mask, y_pad

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        if self.y is None:
            return (self.X[idx],)
        else:
            return (self.X[idx], self.y[idx])

The following just illustrate how to work with the above utility:

In [110]:
ex_dataset = RecogsDataset(
    enc_tokenizer,
    dec_tokenizer,
    dataset['train'].input.head(20),
    y=dataset['train'].output.head(20))

In [111]:
ex_dataloader = torch.utils.data.DataLoader(
    ex_dataset,
    batch_size=2,
    shuffle=True,
    pin_memory=True,
    collate_fn=ex_dataset.collate_fn)

In [112]:
ex_batch = iter(ex_dataloader)

This will show you batches. Since `batch_size=2` for `dataloader`, this will be a tuple where each element has two lists. The structure is determined by `collate_fn` in `RecogsDataset`: 

`X_pad, X_mask, y_pad, y_mask, y_pad`

where `y_pad` is repeated in the final position to meet the interface specifications of `torch_base_model.py`, in case you decide to train models yourself. (See details below; Hugging Face calculates the loss itself, which is ultimately nice but a bit non-standard.)

In [113]:
next(ex_batch)

(tensor([[  1, 115, 361, 484, 698, 247,  17,   2,   0,   0,   0,   0,   0,   0],
         [  1, 115, 203, 159, 124, 679, 124, 400, 355, 698, 690, 476,  17,   2]]),
 tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]),
 tensor([[  1,   7, 381,   5,  11,   6,  67, 486,   5,  22,   6,  68, 177,   5,
           22,   8,  11,   6,  68, 724,   5,  22,   8,  44,   6,  68, 288,   5,
           44,   6,  68, 177,   5,  44,   8,  11,   6,   2,   0,   0,   0,   0,
            0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
            0,   0],
         [  1,   7, 247,   5,  11,   6,  67, 655,   5,  44,   6,  67, 415,   5,
           63,   6,  67,   7, 479,   5,  12,   6,  67, 490,   9, 207,   5,  11,
            8,  44,   6,  68, 382,   5,  64,   6,  68, 664,   5,  64,   8,  11,
            6,  68, 177,   5,  64,   8,  63,   6,  68, 567,   5,  64,   8,  12,
            6,   2]]),
 tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

### Model basics

Now we come to the model itself. We will first load it and explore it a bit, and then we will define a nice classifier interface for it.

In [114]:
from transformers import EncoderDecoderModel

encdec = EncoderDecoderModel.from_pretrained(f"ReCOGS/ReCOGS-model")

A single illustrative example:

In [115]:
ex_inputs = enc_tokenizer.batch_encode_plus(
    ["A rose was helped by a dog ."], 
    return_tensors='pt')

ex_outputs = dec_tokenizer.batch_encode_plus(
    ['rose ( 53 ) ; dog ( 38 ) ; help ( 7 ) AND theme ( 7 , 53 ) AND agent ( 7 , 38 )'], 
    return_tensors='pt')

Here is the forward method. For training, it is vital to have `labels=` here so that the model return a loss value.

In [116]:
ex_rep = encdec(
    input_ids=ex_inputs["input_ids"],
    attention_mask=ex_inputs["attention_mask"],
    labels=ex_outputs["input_ids"],
    decoder_attention_mask=ex_outputs["attention_mask"],
)



In [117]:
ex_rep.keys(), ex_rep.loss

(odict_keys(['loss', 'logits', 'past_key_values', 'encoder_last_hidden_state']),
 tensor(0.3451, grad_fn=<NllLossBackward0>))

And here is how we will do generation:

In [118]:
ex_gen = encdec.generate(
    ex_inputs['input_ids'],
    attention_mask=ex_inputs['attention_mask'],
    max_new_tokens=512,
    eos_token_id=encdec.config.eos_token_id)

ex_gen

tensor([[  1,   1, 581,   5,  41,   6,  67, 328,   5,  58,   6,  67, 408,   5,
          17,   6,  68, 664,   5,  17,   8,  41,   6,  68, 177,   5,  17,   8,
          58,   6,   2]])

In [119]:
ex_pred = dec_tokenizer.batch_decode(
    ex_gen, 
    skip_special_tokens=False, 
    # Our tokenizer have this set already, but I am nervous:
    clean_up_tokenization_spaces=False)

ex_pred

# "A rose was helped by a dog ."

['[BOS] [BOS] rose ( 37 ) ; dog ( 52 ) ; help ( 15 ) AND theme ( 15 , 37 ) AND agent ( 15 , 52 ) [EOS]']

### Model interface

Okay, finally, the main interface. If you do not plan to train your own models using our code, then you can treat `RecogsModel` as an interface and not worry about these details.

In [120]:
from torch_model_base import TorchModelBase
import torch.nn as nn
from transformers import EncoderDecoderModel

As I mentioned above, Hugging Face `EncoderDecoderModel` instances will calculate a loss internally if you provide them with `labels`. Normally, one's optimization loop would need to do this manually. In order to rely on Hugging Face and still use the trainer in `torch_model_base.py`, we define this simple loss that just takes in model outputs and labels and returns `outputs.loss`. The labels argument is present for compatibility; it was already used internally to get the value of `outputs.loss` and so can be ignored.

In [121]:
class RecogsLoss(nn.Module):
    def __init__(self):
        super().__init__()
        self.reduction = "mean"

    def forward(self, outputs, labels):
        """`labels` is ignored, as it was already used to assign a
        value of `outputs.loss`, and that value is all we need."""
        return outputs.loss

Here is a basic `nn.Module`. Its sole purpose is to organize the examples created by our `RecogsDataset` and feed them to the trained `EncoderDecoderModel`:

In [218]:
class RecogsModule(nn.Module):
    def __init__(self):
        super().__init__()
        self.encdec = EncoderDecoderModel.from_pretrained(
            f"ReCOGS/ReCOGS-model")

        # self.encdec = EncoderDecoderModel.from_pretrained(
        #     '../ReCOGS/results_cogs/cogs_pipeline.model.ende_transformer.lf.cogs.glove.False.seed.42/model-last/'
        # )

    def forward(self, X_pad, X_mask, y_pad, y_mask, labels=None):
        outputs = self.encdec(
            input_ids=X_pad, 
            attention_mask=X_mask,
            decoder_attention_mask=y_mask,
            labels=y_pad)
        return outputs

And, at last, our interface. The keyword parameter `initialize=True` is the default because we are initially going to use this just for making predictions, and so we need the instance to establish all its parameters when we initialize it as opposed to waiting to do that when we call `fit` (which we may never do).

Aside: the huggingface translation task course is using MarianMTModel. For summarization task, it is using mt5. They should both have a bilingual tokenizer. They might work for this assignment.

In [219]:
class RecogsModel(TorchModelBase):
    def __init__(self, *args,
            initialize=True,
            enc_vocab_filename=f"{SRC_DIRNAME}/src_vocab.txt",
            dec_vocab_filename=f"{SRC_DIRNAME}/tgt_vocab.txt",
            **kwargs):
        self.enc_vocab_filename = enc_vocab_filename
        self.dec_vocab_filename = dec_vocab_filename
        self.enc_tokenizer = get_tokenizer(self.enc_vocab_filename)
        self.dec_tokenizer = get_tokenizer(self.dec_vocab_filename)
        super().__init__(*args, **kwargs)
        self.loss = RecogsLoss()
        if initialize:
            self.initialize()

    def build_graph(self):
        return RecogsModule()

    def build_dataset(self, X, y=None):
        return RecogsDataset(
            self.enc_tokenizer, self.dec_tokenizer, X, y=y)

    def predict(self, X, device=None):
        device = self.device if device is None else torch.device(device)
        dataset = self.build_dataset(X)
        dataloader = self._build_dataloader(dataset, shuffle=False)
        self.model.to(device)
        self.model.eval()
        preds = []
        with torch.no_grad():
            for batch in dataloader:
                X_pad, X_mask = [x.to(device) for x in batch]
                outputs = self.model.encdec.generate(
                    X_pad,
                    attention_mask=X_mask,
                    max_new_tokens=512,
                    eos_token_id=self.model.encdec.config.eos_token_id)
                results = self.dec_tokenizer.batch_decode(
                    outputs, 
                    skip_special_tokens=True,
                    clean_up_tokenization_spaces=False)
                preds += results
        return preds

    def score(self, X, y, device=None):
        # An overall accuracy score:
        preds = self.predict(X, device=device)
        vals = [int(recogs_exact_match(gold, pred)) for gold, pred in zip(y, preds)]
        return sum(vals) / len(vals)

In [221]:
recogs_model = RecogsModel()

Predictions for our first to train cases

In [222]:
recogs_model.predict(dataset['dev'].input[: 2])

['Liam ( 15 ) ; box ( 47 ) ; girl ( 35 ) ; hope ( 40 ) AND agent ( 40 , 15 ) AND ccomp ( 40 , 8 ) AND burn ( 8 ) AND theme ( 8 , 47 ) AND agent ( 8 , 35 )',
 '* donkey ( 48 ) ; * cookie ( 25 ) ; mother ( 50 ) ; lend ( 49 ) AND agent ( 49 , 48 ) AND theme ( 49 , 25 ) AND recipient ( 49 , 50 )']

## Question 2: Exploring predictions [2 points]

Now that we are set up to use the model, we can move to Question 2. There is just one final preliminary: for ReCOGs, we want to come as close as possible to assessing systems purely on semantic criteria, as opposed to assessing their ability to predict arbitrary features of logical forms. In particular, we want predictions to be independent of the particular choice of variable names and independent of the order of conjuncts. 

### ReCOGS assessment function

The function `recogs_exact_match` does this. It's a complex function, and so you can ignore its precise implementation details. Here are some illustrative examples to give you a feel for it:

In [223]:
# The precise names of bound variables do not matter:

recogs_exact_match(
    "dog ( 4 ) AND happy ( 4 )", 
    "dog ( 7 ) AND happy ( 7 ) ")

True

In [224]:
# The order of conjuncts does not matter:

recogs_exact_match(
    "dog ( 4 ) AND happy ( 4 )", 
    "happy ( 7 ) AND dog ( 7 )")

True

In [225]:
# Consistency of variable names does matter:

recogs_exact_match(
    "dog ( 4 ) AND happy ( 4 )", 
    "dog ( 4 ) AND happy ( 7 )")

False

### Task

Your task is to write a utility function to see how well a model does on a specific generalization category in the generalization dataset. The metric is accuracy according to `recogs_exact_match`.

In [226]:
def category_assess(gen_df, model, category):
    """Assess `model` against the `category` examples in `gen_df`.

    Parameters
    ----------
    gen_df: pd.DataFrame
        Should be `dataset["gen"]`
    model: A `RecogsModel instance
    category: str
        A string from `gen_df.category`

    Returns
    -------
    `pd.DataFrame` limited to `category` examples and with columns
    "prediction" and "correct" added by this function
    """
    # This line is done for you because of how important it is to
    # operate on a copy of the dataframe rather than the original!
    cat_df = gen_df[gen_df.category == category].copy()

    # Step 1: Add a column called "prediction" to `cat_df`. This should
    # give the predicted LFs:
    ##### YOUR CODE HERE
    cat_df["prediction"] = model.predict(cat_df.input)



    # Step 2: Add a column "correct" that says whether the prediction
    # and the gold output are the same. Must use `recogs_exact_match`.
    ##### YOUR CODE HERE
    cat_df["correct"] = [recogs_exact_match(gold, pred) for gold, pred in zip(cat_df.output, cat_df.prediction)]


    # Step 3: Return the `pd.DataFrame` `cat_df`:
    ##### YOUR CODE HERE
    return cat_df




In [227]:
dataset['gen'].sample(5)

Unnamed: 0,input,output,category
1623,A child tolerated the cake on the road in the house in a car beside a stage on the pedestal on the chair on a bed on the yacht in a bottle on the futon on the table on the canvas .,"child ( 23 ) ; * cake ( 6 ) ; * road ( 51 ) ; * house ( 30 ) ; car ( 49 ) ; stage ( 18 ) ; * pedestal ( 52 ) ; * chair ( 29 ) ; bed ( 31 ) ; * yacht ( 21 ) ; bottle ( 38 ) ; * futon ( 3 ) ; * table ( 35 ) ; * canvas ( 44 ) ; nmod . on ( 6 , 51 ) AND nmod . in ( 51 , 30 ) AND nmod . in ( 30 , 49 ) AND nmod . beside ( 49 , 18 ) AND nmod . on ( 18 , 52 ) AND nmod . on ( 52 , 29 ) AND nmod . on ( 29 , 31 ) AND nmod . on ( 31 , 21 ) AND nmod . in ( 21 , 38 ) AND nmod . on ( 38 , 3 ) AND nmod . on ( 3 , 35 ) AND nmod . on ( 35 , 44 ) AND tolerate ( 26 ) AND agent ( 26 , 23 ) AND theme ( 26 , 6 )",pp_recursion
1657,Emma liked that a cockroach slept .,"Emma ( 31 ) ; cockroach ( 42 ) ; like ( 53 ) AND agent ( 53 , 31 ) AND ccomp ( 53 , 23 ) AND sleep ( 23 ) AND agent ( 23 , 42 )",obj_to_subj_common
98,Luna hoped that the hippo ate .,"Luna ( 39 ) ; * hippo ( 8 ) ; hope ( 50 ) AND agent ( 50 , 39 ) AND ccomp ( 50 , 28 ) AND eat ( 28 ) AND agent ( 28 , 8 )",only_seen_as_unacc_subj_as_obj_omitted_transitive_subj
18322,The fly squeezed the hero .,"* fly ( 1 ) ; * hero ( 45 ) ; squeeze ( 39 ) AND agent ( 39 , 1 ) AND theme ( 39 , 45 )",passive_to_active
5679,The monster shattered Evelyn .,"* monster ( 14 ) ; Evelyn ( 25 ) ; shatter ( 41 ) AND agent ( 41 , 14 ) AND theme ( 41 , 25 )",unacc_to_transitive


In [228]:
def test_category_assess(func):
    testmod = RecogsModel()
    samp_df = dataset['gen'].head(150)
    examples = [
        ("active_to_passive", 0.80),
        ("unacc_to_transitive", 0.86),
        ("obj_to_subj_proper", 0.78)
    ]
    result_df = func(samp_df, testmod, "active_to_passive")
    if not isinstance(result_df, pd.DataFrame):
        print(f"Error `{func.__name__}`: "
              "Return value should be a `pd.DataFrame`")
        return
    errcount = 0
    for colname in ("input", "output", "category", "prediction", "correct"):
        if colname not in result_df.columns:
            errcount += 1
            print(f"Error `{func.__name__}`: column '{colname}' is missing")
    if errcount != 0:
        return
    expected_len = 5
    result_len = result_df.shape[0]
    if not result_df.shape[0] == expected_len:
        print(f"Error `{func.__name__}`: "
              f"Expected {expected_len} results, got {result_len}.")
        return
    errcount = 0
    for cat, expected in examples:
        result_df = func(samp_df, testmod, cat)
        result = result_df.correct.sum() / result_df.shape[0]
        result = round(result, 2)
        if result != expected:
            errcount += 1
            print(f"Error `{func.__name__}` with category {cat}: "
                  f"Expected acc {expected}, got {result}")
    if errcount == 0:
        print(f"No errors for `{func.__name__}`")

In [229]:
test_category_assess(category_assess)

No errors for `category_assess`


Question 1 above might lead you to expect that our model will struggle with examples in which proper names appear with totally unfamiliar roles. For that question, you wrote `get_propername_role` to get `(name, role)` pairs from examples and `find_name_roles` to do analyses with that function. We can now run that same analysis on our errors:

In [230]:
gen_df = dataset['gen']

In [231]:
# Depending on your computer, this could take a while. On a relatively
# new Apple laptop, it took about 3 minutes. Colab will be much more
# variable in the time it takes, depending on what kind of instance
# you are running.

pred_df = category_assess(gen_df, recogs_model, "obj_to_subj_proper")

Extract the errors:

In [232]:
err_df = pred_df[pred_df.correct == False]

Use `find_name_roles` to get the role distribution in the error set:

In [233]:
err_roles = find_name_roles(err_df, colname="output")
sorted(err_roles.items(), key=lambda x: len(x[1]))[: 3]

[('Charlie', defaultdict(int, {'agent': 64})),
 ('Ava', defaultdict(int, {'agent': 1})),
 ('Skylar', defaultdict(int, {'recipient': 1}))]

It's our old friend Charlie – in training, always a theme; in the generalization tests, always an agent.

In [234]:
err_df[:5]

Unnamed: 0,input,output,category,prediction,correct
6,William tolerated that Charlie fed the cake on the stage to the boy .,"William ( 4 ) ; Charlie ( 3 ) ; * cake ( 27 ) ; * stage ( 31 ) ; * boy ( 17 ) ; nmod . on ( 27 , 31 ) AND tolerate ( 24 ) AND agent ( 24 , 4 ) AND ccomp ( 24 , 6 ) AND feed ( 6 ) AND agent ( 6 , 3 ) AND theme ( 6 , 27 ) AND recipient ( 6 , 17 )",obj_to_subj_proper,"William ( 24 ) ; Charlie ( 1 ) ; * cake ( 31 ) ; * stage ( 49 ) ; * boy ( 45 ) ; nmod . on ( 31 , 49 ) AND tolerate ( 51 ) AND agent ( 51 , 24 ) AND ccomp ( 51 , 31 ) AND feed ( 31 ) AND agent ( 31 , 1 ) AND theme ( 31 , 31 ) AND recipient ( 31 , 45 )",False
101,Charlie gave William the purse on the stage .,"Charlie ( 51 ) ; William ( 32 ) ; * purse ( 8 ) ; * stage ( 36 ) ; nmod . on ( 8 , 36 ) AND give ( 59 ) AND agent ( 59 , 51 ) AND recipient ( 59 , 32 ) AND theme ( 59 , 8 )",obj_to_subj_proper,"Charlie ( 22 ) ; William ( 20 ) ; * purse ( 31 ) ; * stage ( 23 ) ; nmod . on ( 31 , 23 ) AND give ( 20 ) AND agent ( 20 , 22 ) AND recipient ( 20 , 20 ) AND theme ( 20 , 31 )",False
188,Ava thought that the boy supported that Charlie studied .,"Ava ( 52 ) ; * boy ( 35 ) ; Charlie ( 54 ) ; think ( 36 ) AND agent ( 36 , 52 ) AND ccomp ( 36 , 7 ) AND support ( 7 ) AND agent ( 7 , 35 ) AND ccomp ( 7 , 43 ) AND study ( 43 ) AND agent ( 43 , 54 )",obj_to_subj_proper,"Ava ( 22 ) ; * boy ( 23 ) ; Charlie ( 0 ) ; think ( 25 ) AND agent ( 25 , 22 ) AND ccomp ( 25 , 20 ) AND support ( 20 ) AND agent ( 20 , 23 ) AND ccomp ( 20 , 46 ) AND study ( 46 ) AND theme ( 46 , 0 )",False
421,Charlie served a horse the drink on a computer on a bed .,"Charlie ( 52 ) ; horse ( 38 ) ; * drink ( 12 ) ; computer ( 3 ) ; bed ( 17 ) ; nmod . on ( 12 , 3 ) AND nmod . on ( 3 , 17 ) AND serve ( 19 ) AND agent ( 19 , 52 ) AND recipient ( 19 , 38 ) AND theme ( 19 , 12 )",obj_to_subj_proper,"Charlie ( 10 ) ; horse ( 17 ) ; * drink ( 26 ) ; computer ( 9 ) ; bed ( 35 ) ; nmod . on ( 26 , 9 ) AND nmod . on ( 9 , 33 ) AND serve ( 17 ) AND agent ( 17 , 10 ) AND recipient ( 17 , 17 ) AND theme ( 17 , 26 )",False
1078,Charlie gave the donut in a cup beside the stage to Skylar .,"Charlie ( 24 ) ; * donut ( 10 ) ; cup ( 22 ) ; * stage ( 42 ) ; Skylar ( 46 ) ; nmod . in ( 10 , 22 ) AND nmod . beside ( 22 , 42 ) AND give ( 34 ) AND agent ( 34 , 24 ) AND theme ( 34 , 10 ) AND recipient ( 34 , 46 )",obj_to_subj_proper,"Charlie ( 22 ) ; * donut ( 12 ) ; cup ( 56 ) ; * stage ( 56 ) ; Skylar ( 34 ) ; nmod . in ( 12 , 56 ) AND nmod . beside ( 56 , 48 ) AND give ( 18 ) AND agent ( 18 , 22 ) AND theme ( 18 , 12 ) AND recipient ( 18 , 34 )",False


In [235]:
def get_accuracy(pred_df):
    """Compute the accuracy of `pred_df`."""
    return pred_df.correct.sum() / pred_df.shape[0] * 100

In [236]:
get_accuracy(pred_df)

93.60000000000001

In [237]:
pred_df = category_assess(gen_df, recogs_model, "subj_to_obj_proper")
get_accuracy((pred_df))

86.8

In [238]:
err_df = pred_df[pred_df.correct == False]
err_roles = find_name_roles(err_df, colname="output")
sorted(err_roles.items(), key=lambda x: len(x[1]))[: 3]

[('Lina', defaultdict(int, {'theme': 132})),
 ('Abigail', defaultdict(int, {'recipient': 1})),
 ('Chloe', defaultdict(int, {'recipient': 1}))]

It's our old friend Lina – in training, always a agent; in the generalization tests, always a theme.

In [239]:
err_df[:5]

Unnamed: 0,input,output,category,prediction,correct
196,Noah gave Elizabeth Lina .,"Noah ( 49 ) ; Elizabeth ( 38 ) ; Lina ( 42 ) ; give ( 3 ) AND agent ( 3 , 49 ) AND recipient ( 3 , 38 ) AND theme ( 3 , 42 )",subj_to_obj_proper,"Noah ( 11 ) ; Elizabeth ( 15 ) ; give ( 38 ) AND agent ( 38 , 11 ) AND recipient ( 38 , 15 ) AND theme ( 38 , 15 )",False
387,Dylan believed that a girl sent the human Lina .,"Dylan ( 51 ) ; girl ( 46 ) ; * human ( 49 ) ; Lina ( 5 ) ; believe ( 18 ) AND agent ( 18 , 51 ) AND ccomp ( 18 , 14 ) AND send ( 14 ) AND agent ( 14 , 46 ) AND recipient ( 14 , 49 ) AND theme ( 14 , 5 )",subj_to_obj_proper,"Dylan ( 18 ) ; girl ( nmod . on ( 54 ) AND believe ( 26 ) AND agent ( 26 , 18 ) AND ccomp ( 26 , 46 ) AND send ( 46 ) AND agent ( 46 , 18 ) AND recipient ( 46 , 54 ) AND theme ( 46 , 54 )",False
418,The buyer forwarded Emma Lina .,"* buyer ( 52 ) ; Emma ( 24 ) ; Lina ( 6 ) ; forward ( 38 ) AND agent ( 38 , 52 ) AND recipient ( 38 , 24 ) AND theme ( 38 , 6 )",subj_to_obj_proper,"* buyer ( 7 ) ; Emma ( 38 ) ; forward ( 37 ) AND agent ( 37 , 7 ) AND recipient ( 37 , 38 ) AND theme ( 37 , 38 )",False
863,Emma tolerated that a boy passed Abigail Lina .,"Emma ( 6 ) ; boy ( 2 ) ; Abigail ( 39 ) ; Lina ( 37 ) ; tolerate ( 52 ) AND agent ( 52 , 6 ) AND ccomp ( 52 , 0 ) AND pass ( 0 ) AND agent ( 0 , 2 ) AND recipient ( 0 , 39 ) AND theme ( 0 , 37 )",subj_to_obj_proper,"Emma ( 25 ) ; boy ( 43 ) ; Abigail ( 40 ) ; tolerate ( 28 ) AND agent ( 28 , 25 ) AND ccomp ( 28 , 16 ) AND pass ( 16 ) AND agent ( 16 , 43 ) AND recipient ( 16 , 40 ) AND theme ( 16 , 40 )",False
1249,A girl sold Lina to Chloe .,"girl ( 15 ) ; Lina ( 37 ) ; Chloe ( 1 ) ; sell ( 2 ) AND agent ( 2 , 15 ) AND theme ( 2 , 37 ) AND recipient ( 2 , 1 )",subj_to_obj_proper,"girl ( 31 ) ; Lina ( 29 ) ; sell ( 8 ) AND agent ( 8 , 31 ) AND theme ( 8 , 29 ) AND recipient ( 8 , 29 )",False


In [240]:
pred_df = category_assess(gen_df, recogs_model, "obj_pp_to_subj_pp")
get_accuracy((pred_df))

0.0

## Question 3: In-context learning with DSP [2 points]

For this question, we are going to switch gears, from using our trained ReCOGS model to seeing whether we can get traction on this problem using only in-context learning. This question is meant to be very straightforward – our sole goal is to get you to the point where you have a working DSP program that you can build on.

### Set-up

Standard set-up for DSP, but we don't need a retriever:

In [241]:
import cohere
from datasets import load_dataset
import openai
import os
import dsp
from dotenv import load_dotenv

root_path = '.'

os.environ["DSP_NOTEBOOK_CACHEDIR"] = os.path.join(root_path, 'cache')

load_dotenv(override=True)

openai_key = os.getenv('OPENAI_API_KEY')  # or replace with your API key (optional)

cohere_key = os.getenv('COHERE_API_KEY')  # or replace with your API key (optional)

Our language model:

In [242]:
# Options for Cohere: command-medium-nightly, command-xlarge-nightly
lm = dsp.Cohere(model='command-xlarge-nightly', api_key=cohere_key)

# Options for OpenAI:
# [d["root"] for d in openai.Model.list()["data"]]
# lm = dsp.GPT3(model='text-davinci-001', api_key=openai_key)

DSP settings:

In [243]:
dsp.settings.configure(lm=lm)

### Train examples in DSP format

This will convert the train set into a list of `dsp.Example` instances to use for demonstrations:

In [244]:
dsp_recogs_train = [dsp.Example(input=row['input'], output=row['output'])
                    for _, row in dataset['train'].iterrows()]

### Basic template

In [245]:
Input = dsp.Type(
    prefix="Input:", 
    desc="${the sentence to be translated}")

Output = dsp.Type(
    prefix="Output:", 
    desc="${a logical form}",
    format=dsp.format_answers)

cogs_template = dsp.Template(
    instructions="Translate sentences into logical forms.",
    input=Input(),
    output=Output())

Quick illustration:

In [246]:
ex = dsp.Example(
    input=dataset['train'].input[0],
    demos=dsp.sample(dsp_recogs_train, k=2))


result = cogs_template(ex)
print(result)

Translate sentences into logical forms.

---

Follow the following format.

Input: ${the sentence to be translated}
Output: ${a logical form}

---

Input: The resident was handed the cake beside a computer .
Output: * resident ( 1 ) ; * cake ( 5 ) ; computer ( 8 ) ; hand ( 3 ) AND recipient ( 3 , 1 ) AND theme ( 3 , 5 ) AND nmod . beside ( 5 , 8 )

---

Input: The cake was frozen by the baby .
Output: * cake ( 1 ) ; * baby ( 6 ) ; freeze ( 3 ) AND theme ( 3 , 1 ) AND agent ( 3 , 6 )

---

Input: A rose was helped by a dog .
Output:


### Task

Your task is just to complete the following very basic DSP program. The steps are laid out for you:

In [247]:
@dsp.transformation
def recogs_dsp(example, train=dsp_recogs_train, k=2): 
    pass
    # Step 1: Sample k train cases and add them to the `demos`
    # attribute of `example`:
    ##### YOUR CODE HERE
    example.demos = dsp.sample(train, k=k)



    # Run your program using `cogs_template`:
    ##### YOUR CODE HERE
    # states_ex, states_compl = dsp.generate(cogs_template)(example, stage='basics')
    states_compl = None



    # Return the `dsp.Completions`:
    ##### YOUR CODE HERE
    return states_compl



A quick test:

In [248]:
def test_recogs_dsp(func):
    k = 3
    ex = dsp.Example(input="Q0", output=["A0"])
    train = [
        dsp.Example(input="Q1", output=["A1"]),
        dsp.Example(input="Q2", output=["A2"]),
        dsp.Example(input="Q3", output=["A3"]),
        dsp.Example(input="Q4", output=["A4"])]
    compl = func(ex, train=train, k=k)
    errcount = 0
    # Check the LM was used as expected:
    if len(compl.data) != 1:
        errcount += 1
        print(f"Error for `{func.__name__}`: Unexpected LM output.")
    data = compl.data[0]
    # Check that the right number of demos was used:
    demos = data['demos']
    if len(demos) != k:
        errcount += 1
        print(f"Error for `{func.__name__}`: "
              f"Unexpected demo count: {len(demos)}")
    if errcount == 0:
        print(f"No errors found for `{func.__name__}`")

In [249]:
# test_recogs_dsp(recogs_dsp)

In [250]:
# recogs_dsp(ex).output

In [251]:
lm.inspect_history(n=2)

### Optional assessment

Here we sample 10 dev cases for a small evaluation. If you adapt this code, remember to use `recogs_exact_match` so that you aren't unfairly penalized for conjunct order or varible name differences.

In [252]:
ssamp = dataset['dev'].sample(10)

In [253]:
# ssamp['prediction'] = ssamp.input.apply(
#     lambda x: recogs_dsp(dsp.Example(input=x)).output)

In [254]:
# ssamp['correct'] = ssamp.apply(
#     lambda row: recogs_exact_match(row['output'], row['prediction']), axis=1)

In [255]:
# ssamp['correct'].sum() / ssamp.shape[0]

A random example to see what's going on:

In [256]:
ssamp.sample(5).to_dict(orient='records')

[{'input': 'The pancake was liked .',
  'output': '* pancake ( 32 ) ; like ( 26 ) AND theme ( 26 , 32 )',
  'category': 'in_distribution'},
 {'input': 'The chalk was awarded to the penguin by Emma .',
  'output': '* chalk ( 7 ) ; * penguin ( 56 ) ; Emma ( 25 ) ; award ( 26 ) AND theme ( 26 , 7 ) AND recipient ( 26 , 56 ) AND agent ( 26 , 25 )',
  'category': 'in_distribution'},
 {'input': 'A butterfly rolled a jigsaw beside a stage .',
  'output': 'butterfly ( 1 ) ; jigsaw ( 45 ) ; stage ( 11 ) ; nmod . beside ( 45 , 11 ) AND roll ( 59 ) AND agent ( 59 , 1 ) AND theme ( 59 , 45 )',
  'category': 'in_distribution'},
 {'input': 'The teacher liked a boy .',
  'output': '* teacher ( 2 ) ; boy ( 6 ) ; like ( 22 ) AND agent ( 22 , 2 ) AND theme ( 22 , 6 )',
  'category': 'in_distribution'},
 {'input': 'Logan forwarded the girl the cake beside the stage .',
  'output': 'Logan ( 16 ) ; * girl ( 47 ) ; * cake ( 34 ) ; * stage ( 1 ) ; nmod . beside ( 34 , 1 ) AND forward ( 12 ) AND agent ( 12 , 

## Question 4: Original System [3 points]

For your original system, you can do anything at all. The only constraint (repeated from above):

__You cannot train your system on any examples from `dataset["gen"]`, nor can the output representations from those examples be included in any prompts used for in-context learning.__

In the cell below, please provide a brief technical description of your original system, so that the teaching team can gain an understanding of what it does. This will help us to understand your code and analyze all the submissions to identify patterns and strategies.

In [257]:
# PLEASE MAKE SURE TO INCLUDE THE FOLLOWING BETWEEN THE START AND STOP COMMENTS:
#   1) Textual description of your system.
#   2) The code for your original system.
# PLEASE MAKE SURE NOT TO DELETE OR EDIT THE START AND STOP COMMENTS

# START COMMENT: Enter your system description in this cell.

# To utilize the power of the pre-trained model, I will furthur train the ReCOGS model 


# STOP COMMENT: Please do not remove this comment.

Here are some potential paths – just a few of many options, though!

### Option: DSP program

This could build on Question 3 very directly. All we have tried ourselves so far is the simple approach from that question.

### Option: Further training of our model

This is very easy to do. For example, here we do some training on the first 10 dev examples, and we've exposed some keyword arguments that may be of interest:

In [258]:
recogs_ff = RecogsModel(
    batch_size=32,
    gradient_accumulation_steps=20,
    max_iter=300, 
    early_stopping=False,
    n_iter_no_change=100,
    optimizer_class=torch.optim.Adam,
    eta=1e-4,)

In [259]:
# ds = pd.concat((dataset['dev'], dataset['test']))
ds = dataset['train']

In [260]:
# _ = recogs_ff.fit(ds.input, ds.output)

For this, you will want to pay a lot of attention to the optimization-related parameters.

In [261]:
def eval(valid_df, model):
    struct_cats = [
        "obj_pp_to_subj_pp",
        "cp_recursion",
        "pp_recursion",
        "subj_to_obj_proper",
        "prim_to_obj_proper",
        "prim_to_subj_proper",
    ]
    dfs = [category_assess(valid_df, model, cat) for cat in struct_cats]
    df = pd.concat(dfs)
    result = [get_accuracy(d) for d in dfs]
    result = dict(zip(struct_cats, result))
    lex_dfs = pd.concat([
        category_assess(valid_df, model, cat)
        for cat in valid_df.category.unique()
        if cat not in struct_cats
    ])
    result["LEX"] = get_accuracy(lex_dfs)
    df = pd.concat([df, lex_dfs])
    result["OVERALL"] = get_accuracy(df)
    return result

In [262]:
# valid_ds = dataset['gen'].sample(1000, random_state=42)
valid_df = dataset['gen'][: 1000]

In [263]:
# number of model params
sum(p.numel() for p in recogs_ff.model.parameters() if p.requires_grad)

4344077

In [264]:
eval(valid_df, recogs_ff)

{'obj_pp_to_subj_pp': 0.0,
 'cp_recursion': 6.896551724137931,
 'pp_recursion': 0.0,
 'subj_to_obj_proper': 90.9090909090909,
 'prim_to_obj_proper': 43.90243902439025,
 'prim_to_subj_proper': 90.38461538461539,
 'LEX': 64.58036984352773,
 'OVERALL': 56.3}

Original: 0.559  
pp_dative_to_do_dative 0.15384615384615385  
passive_to_active 0.06666666666666667  
obj_pp_to_subj_pp 0.0  
cp_recursion 0.029411764705882353  
obj_omitted_transitive_to_transitive 0.34615384615384615  
prim_to_obj_proper 0.4827586206896552  
pp_recursion 0.0  
prim_to_subj_proper 0.8653846153846154  
only_seen_as_unacc_subj_as_obj_omitted_transitive_subj 0.9056603773584906  
unacc_to_transitive 0.6346153846153846  
prim_to_obj_common 0.6410256410256411  
subj_to_obj_common 0.7555555555555555  
obj_to_subj_proper 0.9069767441860465  
obj_to_subj_common 0.9534883720930233  
subj_to_obj_proper 0.8297872340425532  
only_seen_as_unacc_subj_as_unerg_subj 0.9583333333333334  
prim_to_inf_arg 0.0  
do_dative_to_pp_dative 0.7407407407407407  
active_to_passive 1.0  
only_seen_as_transitive_subj_as_unacc_subj 0.9285714285714286  
prim_to_subj_common 0.5454545454545454  

In [265]:
import wandb
wandb.login()


%env WANDB_PROJECT=recogs_sweeps

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off
0.00s - to python to disable frozen modules.
0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[34m[1mwandb[0m: Currently logged in as: [33mzanqi-liang[0m ([33mbigz[0m). Use [1m`wandb login --relogin`[0m to force relogin


env: WANDB_PROJECT=recogs_sweeps


In [266]:
# method
sweep_config = {
    'method': 'random'
}


# hyperparameters
parameters_dict = {
    'epochs': {
        'value': 1
        },
    'batch_size': {
        'values': [8, 16, 32, 64]
        },
    'gradient_accumulation_steps': {
        'values': [1, 2, 4, 8, 20, 40]
        },
    'n_iter_no_change': {
        'values': [2, 5, 10, 20]
        },
    'optimizer_class': {
        'values': ['Adam', 'AdamW']
        },
    'eta': {
        'distribution': 'log_uniform_values',
        'min': 1e-6,
        'max': 1e-4
    },
}


sweep_config['parameters'] = parameters_dict
# sweep_id = wandb.sweep(sweep_config, project='recogs-sweeps')


In [267]:
def train(config=None):
    with wandb.init(config=config):
        config = wandb.config

        recogs_ff = RecogsModel(
            batch_size=config.batch_size,
            gradient_accumulation_steps=config.gradient_accumulation_steps,
            max_iter=100, 
            early_stopping=False,
            # n_iter_no_change=config.n_iter_no_change,
            optimizer_class=torch.optim.Adam if config.optimizer_class == 'Adam' else torch.optim.AdamW,
            eta=config.eta)
        
        _ = recogs_ff.fit(dataset['dev'].input[: 400], dataset['dev'].output[: 400])
    
        dfs = [category_assess(valid_df, recogs_ff, c) for c in valid_df.category.unique()]
        df = pd.concat(dfs)
    
        wandb.log({'accuracy': get_accuracy(df)})

In [268]:
# wandb.agent(sweep_id, train, count=20)

### Option: Using a pretrained model

The code used for Question 2 should make this very easy. For example, the following is the start of a complete solution using T5:

In [269]:
import torch.nn as nn
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

class T5RecogsModule(nn.Module):
    def __init__(self):
        super().__init__()
        self.encdec = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

    def forward(self, X_pad, X_mask, y_pad, y_mask, labels=None):
        outputs = self.encdec(
            input_ids=X_pad, 
            attention_mask=X_mask,
            decoder_attention_mask=y_mask,
            labels=y_pad)
        return outputs

class T5RecogsModel(RecogsModel):
    def __init__(self, *args, initialize=True, **kwargs):
        super().__init__(*args, **kwargs)
        self.enc_tokenizer = AutoTokenizer.from_pretrained("t5-small")
        self.dec_tokenizer = self.enc_tokenizer

    def build_graph(self):
        return T5RecogsModule()

This will make predictions, but they will be pretty totally disconnected from our task right now:

In [270]:
t5mod = T5RecogsModel()

In [271]:
t5_exs = dataset['dev'].input[: 2]

t5_exs

0    Liam hoped that a box was burned by a girl .
1      The donkey lended the cookie to a mother .
Name: input, dtype: object

In [272]:
t5mod.predict(t5_exs)

['Liam hoffte, dass eine Box von einer Frau in der Hand gelegt wird.',
 'Der Donkey lended den Cookie an eine Mutter .']

In [273]:
t5mod = RecogsModel(
    batch_size=32,
    gradient_accumulation_steps=20,
    max_iter=10, 
    early_stopping=False,
    n_iter_no_change=100,
    optimizer_class=torch.optim.Adam,
    eta=1e-5,)

# t5mod.fit(ds.input, ds.output)

In [274]:
acc, pred = eval(valid_df, t5mod)
print(acc)

for c in valid_df.category.unique():
    print(c, get_accuracy(pred[pred.category == c]))

KeyboardInterrupt: 

In [None]:
# V1
# 0.466
# pp_dative_to_do_dative 0.038461538461538464
# passive_to_active 0.0
# obj_pp_to_subj_pp 0.0
# cp_recursion 0.029411764705882353
# obj_omitted_transitive_to_transitive 0.038461538461538464
# prim_to_obj_proper 0.5517241379310345
# pp_recursion 0.0
# prim_to_subj_proper 0.8269230769230769
# only_seen_as_unacc_subj_as_obj_omitted_transitive_subj 0.8113207547169812
# unacc_to_transitive 0.38461538461538464
# prim_to_obj_common 0.717948717948718
# subj_to_obj_common 0.7111111111111111
# obj_to_subj_proper 0.7906976744186046
# obj_to_subj_common 0.813953488372093
# subj_to_obj_proper 0.7872340425531915
# only_seen_as_unacc_subj_as_unerg_subj 0.7291666666666666
# prim_to_inf_arg 0.0
# do_dative_to_pp_dative 0.5740740740740741
# active_to_passive 0.8095238095238095
# only_seen_as_transitive_subj_as_unacc_subj 0.7380952380952381
# prim_to_subj_common 0.4727272727272727

# V2
# 0.381
# pp_dative_to_do_dative 0.09615384615384616
# passive_to_active 0.0
# obj_pp_to_subj_pp 0.0
# cp_recursion 0.0
# obj_omitted_transitive_to_transitive 0.019230769230769232
# prim_to_obj_proper 0.4482758620689655
# pp_recursion 0.0
# prim_to_subj_proper 0.46153846153846156
# only_seen_as_unacc_subj_as_obj_omitted_transitive_subj 0.8490566037735849
# unacc_to_transitive 0.3269230769230769
# prim_to_obj_common 0.3333333333333333
# subj_to_obj_common 0.37777777777777777
# obj_to_subj_proper 0.5116279069767442
# obj_to_subj_common 0.627906976744186
# subj_to_obj_proper 0.7872340425531915
# only_seen_as_unacc_subj_as_unerg_subj 0.6875
# prim_to_inf_arg 0.0
# do_dative_to_pp_dative 0.7222222222222222
# active_to_passive 0.6904761904761905
# only_seen_as_transitive_subj_as_unacc_subj 0.7142857142857143
# prim_to_subj_common 0.2909090909090909

This model needs to be fine-tuned on ReCOGS, which you can do with its `fit` method. In that case, you will want to pay a lot of attention to the optimization-related parameters to `TorchModelBase`.

### Option: Training a seq2seq model from scratch

The above code for T5 is easily adapted to use a randomly initialized model. The config files used to train our core model are `encoder_config.json` and `decoder_config.json` in `SRC_DIRNAME`. These might be a good starting point in terms of parameters and other set-up details.

### There are lots more options!

Maybe a symbolic solver? A learned semantic parser? Tree-structured neural network?

## Question 5: Bakeoff entry [1 point]

Here we read in the bakeoff dataset:

In [None]:
bakeoff_df = pd.read_csv(
    os.path.join(SRC_DIRNAME, "cs224u-recogs-test-unlabeled.tsv"), 
    sep="\t", index_col=0)

For the bakeoff entry, you should add a column "prediction" containing your predicted LFs and then use the following command to write the file to disk:

In [None]:
bakeoff_df.to_csv("cs224u-recogs-bakeoff-entry.tsv", sep="\t")

Here is what the first couple of lines of the file should look like:

```
	input	category	prediction
0	A cake was blessed by the wolf .	active_to_passive	PREDICTED LF
1	A melon was blessed by a boy .	active_to_passive	PREDICTED LF
```

where `PREDICTED LF` is what you predicted. Here is a quick test you can run locally to ensure that the autograder won't fail:

In [None]:
def test_bakeoff_file(filename="cs224u-recogs-bakeoff-entry.tsv"):
    ref_filename = os.path.join(SRC_DIRNAME, "cs224u-recogs-test-unlabeled.tsv")
    ref_df = pd.read_csv(ref_filename, sep="\t", index_col=0)

    entry_df = pd.read_csv(filename, sep="\t", index_col=0)

    errcount = 0

    # Check expected columns:
    expected_cols = ["input", "category", "prediction"]
    for col in expected_cols:
        if col not in entry_df.columns:
            errcount += 1
            print(f"Missing column: {col}")
    if errcount > 0:
        return

    # Use the "category" column as a check that the rows have not
    # been shuffled:
    if not entry_df.category.equals(ref_df.category):
        errcount += 1
        print("Rows do not seem to be aligned with reference file. "
              "Might they have gotten shuffled?")

    # Check that the predictions all have type str:
    for line_num, x in enumerate(entry_df.prediction, start=1):
        if not isinstance(x, str):
            errcount += 1
            print(f"Prediction on line {line_num} is not a str: {x}")

    if errcount == 0:
        print("Bakeoff file seems to be in good shape!")

In [None]:
test_bakeoff_file("cs224u-recogs-bakeoff-entry.tsv")

Missing column: prediction
