**Description**: demonstrates that the zero-shot text classification method [described
here](https://stats.stackexchange.com/q/601159/337906) works well on the [COPA
task](https://people.ict.usc.edu/~gordon/copa.html). It's one of the [SuperGLUE
tasks](https://super.gluebenchmark.com/tasks) in which labels have multiple tokens, in
some sense. An interesting result is that classification-via-sampling using
`text-curie-001` (a smaller GPT-3 model) performs worse than random guessing, while
CAPPr using `text-curie-001` is 80% accurate.

**Contamination notice**: I don't know whether the models used here were trained on any
COPA data. If they were, but there's no interaction between the method (CAPPr vs CVS)
and training, then the difference between performances can be studied.

**Estimated run time**: ~1 min.

**Environment**: See the [Setup section in the
README](https://github.com/kddubey/cappr/#setup).

**Other**: You have to have an OpenAI API key stored in the environment variable
`OPENAI_API_KEY`. [Sign up here](https://openai.com/api/). This notebook will manually
ask you to give the go-ahead before incurring any costs. Running the whole notebook will
cost ya 30 cents.

**TODO**: analyze mispredictions.

[Load data](#load-data)

[Write prompt](#write-prompt)

[Run model](#run-model)

[Evaluate CVS](#evaluate-cvs)

[Evaluate CVS (chat)](#evaluate-cvs-chat)

[Evaluate question](#evaluate-question)

[Evaluate single-token](#evaluate-single-token)

In [1]:
from __future__ import annotations
import logging
import os
import sys
from typing import Literal, Sequence

import datasets as nlp_datasets
import pandas as pd

from cappr import Example
from cappr import openai

sys.path.insert(1, os.path.join(sys.path[0], ".."))
from utils import display_df, remove_prefix

In [2]:
# When hitting the OpenAI endpoints, we'll log any server errors
logging.basicConfig(
    level=logging.INFO,
    handlers=[logging.StreamHandler(stream=sys.stdout)],
    format="%(asctime)s :: %(name)s :: %(levelname)s :: " "%(message)s",
)
logger = logging.getLogger(__name__)

# Load data

For this MVP, let's evaluate on the [Choice of Plausible Alternatives (COPA)
task](https://people.ict.usc.edu/~gordon/copa.html). I picked this first b/c I read it
has multi-token labels, in some sense. It also looks cool.

The classification problem is to pick 1 of 2 alternatives which caused or resulted in
the premise. Here are two example pulled from the website:

Example 1

> Premise: The man broke his toe. What was the CAUSE of this?
>
> Alternative 1: He got a hole in his sock.
>
> Alternative 2: He dropped a hammer on his foot.


Example 2

> Premise: I tipped the bottle. What happened as a RESULT?
>
> Alternative 1: The liquid in the bottle froze.
>
> Alternative 2: The liquid in the bottle poured out.

A classifier should predict Alternative 2 for Example 1, and Alternative 2 for Example
2.

The test set labels are hidden, so I'll score this zero-shot classifier on the train and validation sets. We'll be evaluating 5 methods, which is quite a few for only 500 examples. But I didn't tune much of anything.

In [3]:
def load_super_glue(task_id: str, split: str):
    return pd.DataFrame(nlp_datasets
                        .load_dataset('super_glue', task_id, split=split))

df = (pd.concat((load_super_glue('copa', 'train'),
                 load_super_glue('copa', 'validation')))
      .reset_index(drop=True)) # the idx column is only unique w/in splits! fuhgetaboutit

In [4]:
len(df)

500

In [5]:
df.head()

Unnamed: 0,premise,choice1,choice2,question,idx,label
0,My body cast a shadow over the grass.,The sun was rising.,The grass was cut.,cause,0,0
1,The woman tolerated her friend's difficult beh...,The woman knew her friend was going through a ...,The woman felt that her friend took advantage ...,cause,1,0
2,The women met for coffee.,The cafe reopened in a new location.,They wanted to catch up with each other.,cause,2,1
3,The runner wore shorts.,The forecast predicted high temperatures.,She planned to run along the beach.,cause,3,0
4,The guests of the party hid behind the couch.,It was a surprise party.,It was a birthday party.,cause,4,0


# Write prompt

A simple way to model COPA is to prompt an LM with (for Example 1):

```
The man broke his toe because 
```

and use the LM to estimate the probabilities of the 2 alternatives conditional on this
prompt. (See the **Example** section
[here](https://stats.stackexchange.com/q/601159/337906) for a full description of what
"estimate the probabilities" actually means.)

This method assumes GPT isn't miscalibrated in bad ways, as it relies entirely on the
comparison between averaged probabilities. Another potential issue is that any 2
alternatives are gonna have really low probabilities. As a result, discriminating
between alternatives may be, numerically and statistically, a bad idea. But that's why
this notebook is here: let's see if these issues significantly impact accuracy when
compared to classification via sampling (CVS). And even if they do, we could always
provide the alternatives in the prompt, as would be done w/ a sampling approach. That's
done in [Evaluate single-token](#evaluate-single-token).

In [6]:
def _conjunction(question: Literal["cause", "effect"]):
    if question == "cause":
        return " because"
    elif question == "effect":
        return ", so"
    else:
        raise ValueError("question must be 'cause' or 'effect'. Got " f"{question}.")


def prompt(premise: str, question: Literal["cause", "effect"]):
    conjunction = _conjunction(question)
    return f'{premise.strip(". ")}{conjunction}'

In [7]:
df["prompt"] = [
    prompt(premise, question)
    for premise, question in zip(df["premise"], df["question"])
]

In [8]:
display_df(df, columns=["prompt", "choice1", "choice2", "label"])

Unnamed: 0,prompt,choice1,choice2,label
0,My body cast a shadow over the grass because,The sun was rising.,The grass was cut.,0
1,The woman tolerated her friend's difficult behavior because,The woman knew her friend was going through a hard time.,The woman felt that her friend took advantage of her kindness.,0
2,The women met for coffee because,The cafe reopened in a new location.,They wanted to catch up with each other.,1


Note: we need to lowercase the choices.

# Run model

Note that for many SuperGLUE datasets, including COPA, the probability distribution over
classes (alternative 1, 2 for COPA) is uniform. So we'll use `prior=None`.

In [9]:
examples = [
    Example(
        prompt=record["prompt"],
        completions=(record["choice1"].lower(), record["choice2"].lower()),
        prior=None,
    )
    for record in df.to_dict("records")
]

In [10]:
len(examples)

500

We have 500 examples * 2 classes = 1000 OpenAI API requests

In [11]:
# $0.02
pred_probs = openai.classify.predict_proba_examples(
    examples, model="gpt-3.5-turbo-instruct", ask_if_ok=True
)

log-probs:   0%|          | 0/1000 [00:00<?, ?it/s]

For COPA, the scoring metric is accuracy.

In [12]:
(pred_probs.argmax(axis=1) == df["label"]).mean()

0.9

To put this number in context, we'll evaluate zero-shot classification via sampling
(CVS) on `gpt-3.5-turbo-instruct`.

But first, let's see how zero-shot curie performs. Curie is a much smaller model.

In [13]:
# $0.03
pred_probs_curie = openai.classify.predict_proba_examples(
    examples, model="text-curie-001", ask_if_ok=True
)

log-probs:   0%|          | 0/1000 [00:00<?, ?it/s]

In [14]:
(pred_probs_curie.argmax(axis=1) == df["label"]).mean()

0.802

TODO: diagnose these mispredictions. For example, are many caused by differing
completion lengths? It's possible that the average likelihood metric is getting thrown
off in those casses.

# Evaluate CVS

COPA isn't a great demo for this approach b/c there's a trivial way to transform
multi-token labels to single tokens: just point to each choice with a single letter!

For example:

```
The man broke his toe because
A. He got a hole in his sock.
B. He dropped a hammer on his foot.
Answer A or B.
```

This prompt is a multiple choice question. And it could probably work well for all of
the SuperGLUE tasks, because they're all binary classification.

In [15]:
def prompt_mc(
    premise: str, question: Literal["cause", "effect"], choice1: str, choice2: str
):
    return (
        f"{prompt(premise, question)}\n"
        f"A. {choice1}\n"
        f"B. {choice2}\n"
        "Answer A or B."
    )


df["prompt_mc"] = [
    prompt_mc(
        record["premise"], record["question"], record["choice1"], record["choice2"]
    )
    for record in df.to_dict("records")
]


display_df(df, columns=["prompt_mc", "label"])

Unnamed: 0,prompt_mc,label
0,My body cast a shadow over the grass because A. The sun was rising. B. The grass was cut. Answer A or B.,0
1,The woman tolerated her friend's difficult behavior because A. The woman knew her friend was going through a hard time. B. The woman felt that her friend took advantage of her kindness. Answer A or B.,0
2,The women met for coffee because A. The cafe reopened in a new location. B. They wanted to catch up with each other. Answer A or B.,1


(It turns out that GitHub doesn't render the newlines, but I promise they're there!)

In [16]:
# $0.03
choices = openai.api.gpt_complete(
    df["prompt_mc"],
    ask_if_ok=True,
    model="gpt-3.5-turbo-instruct",
    max_tokens=5,  # need to allow for "\n\nAnswer A"
)

log-probs:   0%|          | 0/500 [00:00<?, ?it/s]

In [17]:
completions_mc = [choice["text"] for choice in choices]

In [18]:
def process_completion(
    completion: str,
    class_chars: Sequence[str],
    prefix_remove: str = "Answer ",
    strip_chars: str = " \n.",
    default=-1,
) -> int:
    if any(len(class_char) != 1 for class_char in class_chars):
        raise ValueError("Elements of class_chars must be a single character.")
    completion = remove_prefix(completion, prefix_remove)
    completion_stripped = completion.strip(strip_chars)
    if not completion_stripped:
        return default
    completion_char_lower = completion_stripped[0].lower()
    class_chars_lower = [class_char.lower() for class_char in class_chars]
    try:
        return class_chars_lower.index(completion_char_lower)
    except ValueError:
        return default

In [19]:
class_chars = ("A", "B")

In [20]:
pred_classes_cvs = [
    process_completion(completion, class_chars) for completion in completions_mc
]

Check that most of the sampled completions could be mapped to a label 0 or 1:

In [21]:
(pd.Series(pred_classes_cvs) != -1).mean()

0.98

In [22]:
(pred_classes_cvs == df["label"]).mean()

0.908

This hovers between 0.89 - 0.92 in repeated runs.

Let's see how CVS w/ `text-curie-001` performs. Hypothesis: shouldn't be too bad given
the curie result above.

In [23]:
# $0.04
choices_curie = openai.api.gpt_complete(
    df["prompt_mc"], ask_if_ok=True, model="text-curie-001", max_tokens=5
)

log-probs:   0%|          | 0/500 [00:00<?, ?it/s]

In [24]:
completions_mc_curie = [choice["text"] for choice in choices_curie]
pred_classes_cvs_curie = [
    process_completion(completion, class_chars) for completion in completions_mc_curie
]

Let's see how many of these sampled completions are actually "valid", i.e., in the label set

In [25]:
pred_classes_cvs_curie = pd.Series(pred_classes_cvs_curie, index=df.index)
(pred_classes_cvs_curie != -1).mean()

0.614

In [26]:
(pred_classes_cvs_curie == df["label"]).mean()

0.298

Ouch, much worse than random guessing. Hypothesis very rejected. Let's see how often the
valid completions are accurate.

In [27]:
_mask_valid = pred_classes_cvs_curie != -1
(pred_classes_cvs_curie[_mask_valid] == df.loc[_mask_valid, "label"]).mean()

0.48534201954397393

# Evaluate CVS (chat)

How does the chat completion endpoint perform on COPA? I think it makes sense to use the
same prompt as above.

In [28]:
system_prompt_copa = (
    "Identify the cause or effect of a premise given two choices. Each choice "
    "is identified by a letter, A or B.\n"
    "Respond only with the letter corresponding to the correct cause or effect."
)
# getting this right is crucial

In [29]:
# $0.04
# can take a while, ~6 minutes!
# idk and idc yet how to batch for the chat endpoint. For correctness, I'll just send
# texts 1-by-1
choices_chat = openai.api.gpt_chat_complete(
    df["prompt_mc"], ask_if_ok=True, max_tokens=5, system_msg=system_prompt_copa
)

Completing chats:   0%|          | 0/500 [00:00<?, ?it/s]

2023-09-27 11:20:23,015 :: cappr.openai.api :: INFO :: openai error: The server is overloaded or not ready yet.
2023-09-27 11:20:23,016 :: cappr.openai.api :: INFO :: Try 1. Sleeping for 10 sec.
2023-09-27 11:22:17,213 :: cappr.openai.api :: INFO :: openai error: The server is overloaded or not ready yet.
2023-09-27 11:22:17,214 :: cappr.openai.api :: INFO :: Try 1. Sleeping for 10 sec.


In [30]:
completions_chat = pd.Series(
    [choice["message"]["content"] for choice in choices_chat], index=df.index
)

pred_classes_chat = pd.Series(
    [process_completion(completion, class_chars) for completion in completions_chat],
    index=df.index,
)

As usual, we need to check that completions are valid.

In [31]:
mask_valid = pred_classes_chat != -1
mask_valid.mean()

1.0

What do invalid completions look like?

In [32]:
completions_chat[~mask_valid]

Series([], dtype: object)

What's the accuracy on all completions?

In [33]:
(pred_classes_chat == df["label"]).mean()

0.912

What's the accuracy on *valid* completions?

In [34]:
(pred_classes_chat[mask_valid] == df.loc[mask_valid, "label"]).mean()

0.912

# Evaluate question

There are different ways to format a prompt-completion problem. Since `gpt-3.5-turbo-instruct` was trained w/ RLHF, it's worth asking whether a more RLHF-type of prompt would work better. Let's see how performance changes by formatting the problem as a question:

```
The man broke his toe. What was the cause of this? 
```

In [35]:
def prompt_question(premise: str, question: Literal["cause", "effect"]):
    if question == "cause":
        question_ = "What was the cause of this?"
    elif question == "effect":
        question_ = "What happened as a result?"
    else:
        raise ValueError("question must be 'cause' or 'effect'. Got " f"{question}.")
    return f"{premise} {question_}"


df["prompt_question"] = [
    prompt_question(premise, question)
    for premise, question in zip(df["premise"], df["question"])
]


display_df(df, columns=["prompt_question", "choice1", "choice2", "label"])

Unnamed: 0,prompt_question,choice1,choice2,label
0,My body cast a shadow over the grass. What was the cause of this?,The sun was rising.,The grass was cut.,0
1,The woman tolerated her friend's difficult behavior. What was the cause of this?,The woman knew her friend was going through a hard time.,The woman felt that her friend took advantage of her kindness.,0
2,The women met for coffee. What was the cause of this?,The cafe reopened in a new location.,They wanted to catch up with each other.,1


According to the [docs](https://platform.openai.com/docs/guides/fine-tuning/data-formatting), best practice is to separate prompts and completions using this string:

In [36]:
openai.api.end_of_prompt

'\n\n###\n\n'

In [37]:
examples_question = [
    Example(
        prompt=record["prompt_question"],
        completions=(record["choice1"], record["choice2"]),
        prior=None,
        end_of_prompt=openai.api.end_of_prompt,
    )
    for record in df.to_dict("records")
]

In [38]:
# $0.03
pred_probs_question = openai.classify.predict_proba_examples(
    examples_question, model="gpt-3.5-turbo-instruct", ask_if_ok=True
)

log-probs:   0%|          | 0/1000 [00:00<?, ?it/s]

In [39]:
(pred_probs_question.argmax(axis=1) == df["label"]).mean()

0.818

Bad!

# Evaluate single-token

Let's see how the single-token transformation performs for COPA. Based on the [Evaluate CVS](#evaluate-cvs) result, my hypothesis is that it'll perform slightly better than the [multi-token approach](#run-model). I wouldn't be bummed if it performed better. B/c if I could control the backend, there's still a usability and computational benefit to the idea of returning probabilities for A and B instead of sampling from all possible token sequences.

In [40]:
examples_mc = [
    Example(
        prompt=record["prompt_mc"],
        completions=("A", "B"),
        prior=None,
        end_of_prompt=openai.api.end_of_prompt,
    )
    for record in df.to_dict("records")
]

In [41]:
# $0.05
# If I could control the backend, the cost would be $0.05/2 = $0.025
pred_probs_mc = openai.classify.predict_proba_examples(
    examples_mc, model="gpt-3.5-turbo-instruct", ask_if_ok=True
)

log-probs:   0%|          | 0/1000 [00:00<?, ?it/s]

In [42]:
(pred_probs_mc.argmax(axis=1) == df["label"]).mean()

0.958

Conclusion: performs 5% better than the [multi-token approach](#run-model). I wonder how
much better it'd be if we could use the `<|endoftext|>` special token which separate
prompts from completion. That could be impactful b/c that was used during training.

It costs twice as much. But that's an artifact of the way the endpoint works. For
prompts like this, it takes 1 `model()` call to give us the data we need: the
probability distribution of (single tokens) `'A'` and `'B'` conditional on the prompt.