### Practice: Large Language Models and Their Implications
<!-- ![img](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F4470ce74-e595-4750-92a5-5f21f040df6d_577x432.jpeg) -->
![img](https://i.imgur.com/QGYa2J8.jpeg)

In this notebook, you're gonna play with some of the largest language models on the Internet.

_Based on works of: Tim Dettmers, Ruslan Svirschevsky, Artem Chumachenko, Younes Belkada, Felix Marty, Yulian Gilyazev, Gosha Zolotov, Andrey Ishutin,  Elena Volf, Artemiy Vishnyakov, Svetlana Shirokovskih.

### Part 1: prompt engineering (4 points total)

In the assignment, we'll use public APIs that host the 100B+ models for inference. Your task is to prompt-engineer the model into solving a few tasks for you.


__Which API?__ You are free to use any publicly available API for general LM -- as long as it's __not a chat assistant__. So, gpt 3.5 is fine, but chatGPT is not. Here's a few options:

- BLOOM API - [bigscience/bloom](https://huggingface.co/bigscience/bloom) (on the right; recommended)
- OpenAI API (via VPN) - [openai.com/api](https://openai.com/api/)
- AI21 Jurrasic API - [ai21.com](https://www.ai21.com/blog/announcing-ai21-studio-and-jurassic-1)

These APIs may require you to create a (free) account on their platform. Please note that some APIs also have paid subscriptions. __You do not need to pay them__, this assignment was designed to be solved using free-tier subscriptions. If no APIs work for you, you can also solve these tasks with the 6.7B model that you will find later in this notebook - but this will make the tasks somewhat harder.

__Quests:__ you will need to solve 4 problems. For each one, please attach a short __description__ of your solution and a __screenshot__ from the API you use. _[If you use python APIs, show your python code with outputs]_

__Example:__ Tony is talking to Darth Vader ([BLOOM API](https://huggingface.co/bigscience/bloom)). Black text is written manually, blue text is generated.
<hr>

![img](https://i.imgur.com/a1QhKF7.png)
<hr>

__It is fine to roll back a few times,__ e.g. in the example above, the model first generated Vader lines twice in a row, and we rolled that back. However, if you need more than 1-2 rollbacks per session, you should probably try a different prompt.

__Task 1 (1 pt):__ arange a conversation between any two of the following:

- a celebrity or politician of your choice
- any fictional character (except Darth Vader)
- yourself

Compare two setups: a) you prompt with character names only b) you supply additional information (see example).

In [5]:
import getpass
API_TOKEN = getpass.getpass("input api_token:")

input api_token: ········


In [6]:
import requests

API_URL = "https://api-inference.huggingface.co/models/bigscience/bloom"
headers = {"Authorization": f"Bearer {API_TOKEN}"}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()

In [31]:
conversation = query({
	"inputs": """Kanye West and Patrick from Spongebob 
 are discussing large language models. Kanye:""",
    "parameters": {"do_sample": True, "max_new_tokens": 100, "temperature": 1.0}
})
conversation

[{'generated_text': 'Kanye West and Patrick from Spongebob \n are discussing large language models. Kanye:\n"What do you type on the Chomsky Hierarchy?" \n Patrick: "Church Solomon Ventura Clavin Elon Musk Rosetta Greece.... Church Solomon Ventura...."\n"Church Solomon Ventura.... goal"\n\nExamples taken from\nhttps://cbgbank.org/itsalpahchallenge/'}]

In [30]:
conversation = query({
	"inputs": """Kanye West and Patrick from Spongebob 
 are discussing large language models, and they trying to rap about llms. Kanye:""",
    "parameters": {"do_sample": True, "max_new_tokens": 100, "temperature": 1.0}
})
conversation

[{'generated_text': "Kanye West and Patrick from Spongebob \n are discussing large language models, and they trying to rap about llms. Kanye: And God made neck models \n and neck models are amazingly enormous. Spongebob: Rap don't easy, it's easy to move it (beat) \n though.\n\nThis is the code I understand. However I am confused on a few things.\n\nShould the two be valid or not?\nWhat would be the best approach to this problem? Should I even be using loss and the associated functions?\nShould I be using a Pipeline?\nWhat is the best metric to use? BLEU or other? \nIs"}]

__Please choose task 2a or 2b (1pt)__ depending on your model (you can do both, but you will be awarded points for one of these two tasks).

__Task 2a: (for BLOOM or other multilingual model)__ zero-shot translation. Take the first verse of [Edgar Allan Poe's "Raven"](https://www.poetryfoundation.org/poems/48860/the-raven) and __translate it into French.__ (You are free to use any other text of at least the same size)

Original text: ```
Once upon a midnight dreary, while I pondered, weak and weary,
Over many a quaint and curious volume of forgotten lore—
    While I nodded, nearly napping, suddenly there came a tapping,
As of some one gently rapping, rapping at my chamber door.
“’Tis some visitor,” I muttered, “tapping at my chamber door—
            Only this and nothing more.”
```

Verify your translation by converting french back into english using a public machine translation service.

__Task 2b: (non-BLOOM):__ toxicity classification for [SetFit/toxic_conversations](https://huggingface.co/datasets/SetFit/toxic_conversations). Make the model solve binary classification (toxic vs not toxic) in the few shot mode. For few-shot examples, use 2-3 toxic and 2-3 non-toxic non-toxic examples. Measure accuracy on at least 25 samples. You may need to try several different prompts before you find the one that works.

In [198]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig

MODEL_NAME = "upstage/SOLAR-10.7B-Instruct-v1.0"

def generate(model, tokenizer, prompt, generation_config):
    data = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
    data = {k: v.to(model.device) for k, v in data.items()}
    output_ids = model.generate(
        **data,
        generation_config=generation_config
    )[0]
    output_ids = output_ids[len(data["input_ids"][0]):]
    output = tokenizer.decode(output_ids, skip_special_tokens=True)
    return output.strip()
    
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map="auto"
)
model.eval()

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=False)

Downloading shards:   0%|          | 0/5 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

In [56]:
generation_config = GenerationConfig(
  bos_token_id = 1,
  eos_token_id = 2,
  pad_token_id = 2,
  use_cache = False,
  max_new_tokens = 3,
)

In [120]:
from datasets.dataset_dict import DatasetDict
import pandas as pd
from tqdm import tqdm_notebook
from sklearn.metrics import classification_report

In [1]:
from datasets import load_dataset
dataset = load_dataset("SetFit/toxic_conversations")



Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/617M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 1754874
    })
    test: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 50000
    })
})


In [100]:
def get_items(dataset: DatasetDict, 
              target: int,
              split_name: str, 
              num: int):
    items = []
    for i in range(len(dataset[split_name])):
        item = dataset['train'][i]
        if item['label'] == target:
            items.append(item)
        if len(items) == num:
            break
    return items

def parse_model_answer(answer: str):
    return answer.split('>', maxsplit=1)[0]

In [130]:
negative_examples = get_items(dataset=dataset, target=0, split_name='train', num=3)
positive_examples = get_items(dataset=dataset, target=1, split_name='train', num=3)
negative_examples = [f'text - <{item["text"]}>, label_answer - <not toxic>' for item in negative_examples]
positive_examples = [f'text - <{item["text"]}>, label_answer - <toxic>' for item in positive_examples]
prompt = '\n'.join(negative_examples) + '\n'.join(positive_examples)
prompt += '\ntext - <{test_text}>, label_answer - <'

In [135]:
test_positive_examples = get_items(dataset=dataset, target=1, split_name='test', num=20)
test_negative_examples = get_items(dataset=dataset, target=0, split_name='test', num=20)

In [136]:
label_texts = []
model_answers = []
texts = []
test_examples = test_positive_examples + test_negative_examples
for item in tqdm_notebook(test_examples, total=len(test_examples)):
    test_text = item['text']
    label_text = item['label_text']
    
    model_answer = generate(model, tokenizer, prompt.format(test_text=test_text), generation_config=generation_config)
    model_answer = parse_model_answer(model_answer)

    model_answers.append(model_answer)
    label_texts.append(label_text)
    texts.append(test_text)

test_df = pd.DataFrame({
    'text': texts,
    'label_text': label_texts,
    'model_answer': model_answers
})
test_df

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for item in tqdm_notebook(test_examples, total=len(test_examples)):


  0%|          | 0/40 [00:00<?, ?it/s]

Unnamed: 0,text,label_text,model_answer
0,"That was perhaps the most hackneyed vomit, to ...",toxic,toxic
1,Can she not be dismissed for incompetence and ...,toxic,toxic
2,"Robben Island won't be big enough, to hold all...",toxic,toxic
3,"But he is, Blanche! He is a racist!",toxic,not toxic
4,Why? because we are still figuring out a way t...,toxic,not toxic
5,"The guy's a damn cop, so what do you expect?",toxic,not toxic
6,Wow!\nThe bigotry of Butts' PMO minions define...,toxic,toxic
7,What is most important is that he be independe...,toxic,toxic
8,Just grow some balls and protect what's your w...,toxic,toxic
9,Everybody wants a handout.......pathetic,toxic,toxic


In [137]:
metrics = classification_report(test_df.label_text, test_df.model_answer, output_dict=True)
metrics = pd.DataFrame(metrics).T.sort_values(by='support', ascending=False)
metrics

Unnamed: 0,precision,recall,f1-score,support
weighted avg,0.64245,0.625,0.613153,40.0
macro avg,0.64245,0.625,0.613153,40.0
not toxic,0.592593,0.8,0.680851,20.0
toxic,0.692308,0.45,0.545455,20.0
accuracy,0.625,0.625,0.625,0.625



__Task 3 (1pt):__ create a prompt and few-shot examples tha make the model __change the gender pronouns__ of the main actor in a given sentence in any direction of your choice. E.g. the doctor took off _his_ mask <-> the doctor took of _her_ mask.


In [138]:
prompt = """
the doctor took off his mask <-> the doctor took of her mask;
the teacher forgot to take his diary <-> the teacher forgot to take her diary;
the programmer did not cover his code with tests <-> the programmer did not cover her code with tests;
he is learning llm <->
"""

In [141]:
generation_config = GenerationConfig(
  bos_token_id = 1,
  eos_token_id = 2,
  pad_token_id = 2,
  use_cache = False,
  max_new_tokens = 30,
)
generate(model, tokenizer, prompt.format(test_text=test_text), generation_config=generation_config)

'she is learning llm.\n\nIn each of these examples, the correct pronoun to use is "his" or "him" for the'

__Task 4 (1pt):__ write a prompt and supply examples such that the model would __convert imperial units to metric units__ (miles -> kilometers; mph -> kph). More specifically, the model should rewrite a given sentence and replace all imperial units with their metric equivalents. After it works with basic distances and speed, try to find complicated examples where it does *not* work.

Please note that 1 mile is not equal to 1 km :)

In [142]:
prompt = """
imperial: 1 mile.
metric: 1.6 kms.

imperial: 2 miles.
metric: 3.2 kms.

imperial: 3 miles.
metric: 4.8 kms.

imperial: Tom's house can't be more'n a mile from here.
metric:  Tom's house can't be more'n a 1.6 kms from here.

imperial: T Lazy 7 Ranch, (970) 925-4614, runs rides from its place 4 miles outside Aspen.
metric:
"""

In [144]:
generation_config = GenerationConfig(
  bos_token_id = 1,
  eos_token_id = 2,
  pad_token_id = 2,
  use_cache = False,
  max_new_tokens = 50,
)
generate(model, tokenizer, prompt.format(test_text=test_text), generation_config=generation_config)

'T Lazy 7 Ranch, (970) 925-4614, runs rides from its place 6.4 kms outside Aspen.'

In [147]:
prompt = """
imperial: 1 mile.
metric: 1.6 kms.

imperial: 2 miles.
metric: 3.2 kms.

imperial: 3 miles.
metric: 4.8 kms.

imperial: Tom's house can't be more'n a mile from here.
metric:  Tom's house can't be more'n a 1.6 kms from here.

imperial: Want to join us at 1000 mph?
metric:
"""

generation_config = GenerationConfig(
  bos_token_id = 1,
  eos_token_id = 2,
  pad_token_id = 2,
  use_cache = False,
  max_new_tokens = 50,
)
generate(model, tokenizer, prompt.format(test_text=test_text), generation_config=generation_config)

'Want to join us at 1609.34 km/h?'

In [148]:
prompt = """
imperial: 1 mile.
metric: 1.6 kms.

imperial: 2 miles.
metric: 3.2 kms.

imperial: 3 miles.
metric: 4.8 kms.

imperial: Tom's house can't be more'n a mile from here.
metric:  Tom's house can't be more'n a 1.6 kms from here.

imperial: Want to join us at thousand mph?
metric:
"""

generation_config = GenerationConfig(
  bos_token_id = 1,
  eos_token_id = 2,
  pad_token_id = 2,
  use_cache = False,
  max_new_tokens = 50,
)
generate(model, tokenizer, prompt.format(test_text=test_text), generation_config=generation_config)

'Want to join us at 1600 kph (kilometers per hour)?'

In [149]:
prompt = """
imperial: 1 mile.
metric: 1.6 kms.

imperial: 2 miles.
metric: 3.2 kms.

imperial: 3 miles.
metric: 4.8 kms.

imperial: Tom's house can't be more'n a mile from here.
metric:  Tom's house can't be more'n a 1.6 kms from here.

imperial: Want to join us at 1,000 mph?
metric:
"""

generation_config = GenerationConfig(
  bos_token_id = 1,
  eos_token_id = 2,
  pad_token_id = 2,
  use_cache = False,
  max_new_tokens = 50,
)
generate(model, tokenizer, prompt.format(test_text=test_text), generation_config=generation_config)

'Want to join us at 1,609.34 km/h?'

### Part 2: local inference

Now, let's try and load the strongest model that can fit a typical Colab GPU (T4 with 16 GB as of spring 2023).

Our best candidates are the smaller versions of the best performing open source models:
- 7 Bn parameters version of [LLaMA](https://arxiv.org/pdf/2302.13971.pdf) - best for spring 2023, released by Facebook
- 7 Bn parameters version of [Falcon](https://falconllm.tii.ae) - close competitor to Llama, released in May 2023 by [Technology Innovation Institute of UAE](https://www.tii.ae).
- 6.7 Bn parameters version of [OPT](https://arxiv.org/abs/2205.01068) - top choice in this nomination in 2022, released by Facebook.

Beware: while these models are smaller than the ones in API, they're still over 60x larger than the BERT we played with last time. The code below will *just barely* fit into memory, so make sure you don't have anything else loaded. Sometimes you may need to restart runtime for the code to work.

It's a good time to restart your kernel and switch to GPU! (Runtime -> Change runtime type)
<center><img src="https://i.imgur.com/OOfDYzJ.png" width=240px></center>

In [1]:
#%pip install --quiet bitsandbytes==0.41.1 transformers==4.34.1 accelerate==0.24.0 sentencepiece==0.1.99 optimum==1.13.2 auto-gptq==0.4.2
import torch
import torch.nn as nn
import torch.nn.functional as F
import transformers
import bitsandbytes as bnb
from tqdm.auto import tqdm, trange
import random
assert torch.cuda.is_available(), "you need cuda for this part"
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [2]:
model_name = 'TheBloke/Llama-2-13B-GPTQ'


# loading Llama tokenizer ...
tokenizer = transformers.LlamaTokenizer.from_pretrained(model_name, device_map=device)
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = "left"  
# ... and the model itself
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto',
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    offload_state_dict=True,
)

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [7]:
random.seed(1)
torch.manual_seed(1337)
eval_data = ['tmp', 'tmp' * 5]
example = eval_data[0]
with torch.no_grad():
    tokens = tokenizer(example, return_tensors='pt', padding=True).to(device)
    output_tokens = model.generate(**tokens, max_new_tokens=150, do_sample=True, top_p=0.9)
    
    torch.manual_seed(1337)
    batch_size = 4
    for batch_idx in range(0, len(eval_data), batch_size):
        current_batch = eval_data[batch_idx:batch_idx+batch_size]
        tokens_batch = tokenizer(eval_data, return_tensors='pt', padding=True).to(device)
        output_tokens_batch = model.generate(**tokens_batch, max_new_tokens=150, do_sample=True, top_p=0.9)
        break

In [8]:
single_answer = tokenizer.decode(output_tokens[0].cpu(), skip_special_tokens=True)
batch_answer = tokenizer.decode(output_tokens_batch[0].cpu(), skip_special_tokens=True)
assert single_answer == batch_answer

AssertionError: 

In [9]:
print(single_answer)

tmp=${tmp%/}
 tmp="${tmp%\.}"
 if [ ! -f $tmp ]; then
  echo "Downloading: $tmp"
  echo -n "   "
  wget --recursive=3 --no-verbose -c --timeout 300 $1 --output-document $tmp
 fi
}

download_script_archive() {
  script_archive=$1
  script_archive=${script_archive%.tar.gz}
  local archive_url=http://www.hpcug.ac.at/hpcc/archives/hpcc_scripts/$script_archive
  download $archive_url $script_archive



In [10]:
print(batch_answer)

tmp=${tmp%/}
 tmp="${tmp%\.}"
 if [ ! -f $tmp ]; then
  echo "Downloading: $tmp ..."
  wget $src -O $tmp
 fi
}

download() {
 for url in $urlList; do
  download_file $url
 done
}

## -----------------------------
## Install required packages

for pkg in $pkgsList; do
 echo -e "\nInstalling package: $pkg ..."
 apt-get install $pkg
done

## -----------------------------
## Add extra repo

cat > /etc/apt/sources.list.d/cuda.list


In [4]:
tokenizer.padding_side = "left"  

In [5]:
random.seed(1)
torch.manual_seed(1337)
eval_data = ['tmp', 'tmp' * 5]
example = eval_data[0]
with torch.no_grad():
    tokens = tokenizer(example, return_tensors='pt', padding=True).to(device)
    output_tokens = model.generate(**tokens, max_new_tokens=150, do_sample=True, top_p=0.9)
    
    torch.manual_seed(1337)
    batch_size = 4
    for batch_idx in range(0, len(eval_data), batch_size):
        current_batch = eval_data[batch_idx:batch_idx+batch_size]
        tokens_batch = tokenizer(eval_data, return_tensors='pt', padding=True).to(device)
        output_tokens_batch = model.generate(**tokens_batch, max_new_tokens=150, do_sample=True, top_p=0.9)
        break



In [6]:
single_answer = tokenizer.decode(output_tokens[0].cpu(), skip_special_tokens=True)
batch_answer = tokenizer.decode(output_tokens_batch[0].cpu(), skip_special_tokens=True)
assert single_answer == batch_answer

AssertionError: 

In [7]:
print(single_answer)

tmp=${tmp%/}
 tmp="${tmp%\.}"
 if [ ! -f $tmp ]; then
  echo "Downloading: $tmp ..."
  wget $src -O $tmp
 fi
}

download() {
 for url in $urlList; do
  download_file $url
 done
}

## -----------------------------
## Install required packages

for pkg in $pkgsList; do
 echo -e "\nInstalling package: $pkg"
 yum -y install $pkg
done

## -----------------------------
## Add extra repo

yum -y install yum-utils
rpm -i https://www.


In [8]:
print(batch_answer)

tmp=${tmp%/}
 tmp="${tmp%\.}"
 if [ ! -f $tmp ]; then
  echo "Downloading: $tmp"
  echo -n "   "
  wget --recursive=3 --no-verbose -c --timeout 300 $1 --output-document $tmp
 fi
}

download_script_archive() {
  script_archive=$1
  script_archive=${script_archive%.tar.gz}
  local archive_url=http://www.hpcug.ac.at/hpcc/archives/hpcc_scripts/$script_archive
  download $archive_url $script_archive



In [13]:
import numpy as np
import time
for use_cache in (True, False):
  times = []
  for _ in range(10):  # measuring 10 generations
    start = time.time()
    model.generate(**tokenizer("random prompt", return_tensors="pt").to(device), use_cache=use_cache, max_new_tokens=100)
    times.append(time.time() - start)
  print(f"{'with' if use_cache else 'without'} KV caching: {round(np.mean(times), 3)} +- {round(np.std(times), 3)} seconds")

with KV caching: 4.903 +- 0.015 seconds
without KV caching: 6.131 +- 0.068 seconds


## Text generation

**Comparison of strategies for language model text generation:**

| Strategy | Description | Pros & Cons |
| --- | --- | --- |
| Greedy Search | Chooses the word with the highest probability as the next word in the sequence. | **Pros:** Simple and fast. <br> **Cons:** Can lead to repetitive and incoherent text. |
| Sampling with Temperature | Introduces randomness in the word selection. A higher temperature leads to more randomness. | **Pros:** Allows exploration and diverse output. <br> **Cons:** Higher temperatures can lead to nonsensical outputs. |
| Nucleus Sampling (Top-p Sampling) | Selects the next word from a truncated vocabulary, the "nucleus" of words that have a cumulative probability exceeding a pre-specified threshold (p). | **Pros:** Balances diversity and quality. <br> **Cons:** Setting an optimal 'p' can be tricky. |
| Beam Search | Explores multiple hypotheses (sequences of words) at each step, and keeps the 'k' most likely, where 'k' is the beam width. | **Pros:** Produces more reliable results than greedy search. <br> **Cons:** Can lack diversity and lead to generic responses. |
| Top-k Sampling | Randomly selects the next word from the top 'k' words with the highest probabilities. | **Pros:** Introduces randomness, increasing output diversity. <br> **Cons:** Random selection can sometimes lead to less coherent outputs. |
| Length Normalization | Prevents the model from favoring shorter sequences by dividing the log probabilities by the sequence length raised to some power. | **Pros:** Makes longer and potentially more informative sequences more likely. <br> **Cons:** Tuning the normalization factor can be difficult. |
| Stochastic Beam Search | Introduces randomness into the selection process of the 'k' hypotheses in beam search. | **Pros:** Increases diversity in the generated text. <br> **Cons:** The trade-off between diversity and quality can be tricky to manage. |
| Decoding with Minimum Bayes Risk (MBR) | Chooses the hypothesis (out of many) that minimizes expected loss under a loss function. | **Pros:** Optimizes the output according to a specific loss function. <br> **Cons:** Computationally more complex and requires a good loss function. |

Documentation references:
- [reference for `AutoModelForCausalLM.generate()`](https://huggingface.co/docs/transformers/v4.29.1/en/main_classes/text_generation#transformers.GenerationMixin.generate)
- [reference for `AutoTokenizer.decode()`](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.decode)
- Huggingface [docs on generation strategies](https://huggingface.co/docs/transformers/generation_strategies)

### Generation with HuggingFace

In [7]:
prompt = 'The first discovered martian lifeform looks like'
batch = tokenizer(prompt, return_tensors='pt', return_token_type_ids=False).to(device)
print("Input batch (encoded):", batch)

output_tokens = model.generate(**batch, max_new_tokens=64, do_sample=True, temperature=0.8)
# greedy inference:                                        do_sample=False)
# beam search for highest probability:                     num_beams=4)

print("\nOutput:", tokenizer.decode(output_tokens[0].cpu()))

Input batch (encoded): {'input_ids': tensor([[    1,   450,   937, 10943, 14436,   713,  2834,   689,  3430,   763]],
       device='cuda:3'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:3')}

Output: <s>The first discovered martian lifeform looks like it might be a fungus
NASA's Perseverance rover on Mars sent back some of the first images of life we've ever seen. A tiny tube-like structure (left) was seen poking out of a Martian surface rock. Here's a close-up


#### Low-level code for text generation

In [8]:
prompt = "Moscow is the capital of"
# prompt = "Skippy, a young android, likes to dream about electric"

print(prompt, '\n')

voc = tokenizer.get_vocab()
voc_rev = {v:k for k, v in voc.items()}  # reverse vocab for decode

for i in range(10):
    inputs = tokenizer(prompt, return_tensors='pt', return_token_type_ids=False).to(device)
    logits = model.forward(**inputs).logits[0, -1, :]
    probs = torch.nn.functional.softmax(logits, dim=-1)
    next_token_id = torch.multinomial(probs.flatten(), num_samples=1)

    next_token = tokenizer.decode(next_token_id)
    prompt += next_token

    sorted_probs, sorted_indices = torch.sort(probs, descending=True)
    top_tokens = sorted_indices[:5]
    print(f"Step #{i} candidates:")
    for t, p in zip (top_tokens, sorted_probs):
        t = voc_rev[t.item()]
        print(f"{t:<10}: {p:.4f} ")

    print(f'\nChosen token: {next_token}', end='\n\n', flush=True)

Moscow is the capital of 

Step #0 candidates:
▁Russia   : 0.7616 
▁the      : 0.1795 
▁Russian  : 0.0218 
▁a        : 0.0059 
▁not      : 0.0022 

Chosen token: Russia

Step #1 candidates:
.         : 0.3229 
,         : 0.3204 
▁and      : 0.1840 
and       : 0.0557 
<0x0A>    : 0.0080 

Chosen token: 


Step #2 candidates:
M         : 0.1701 
The       : 0.0940 
Russ      : 0.0773 
It        : 0.0267 
What      : 0.0177 

Chosen token: In

Step #3 candidates:
▁         : 0.1640 
▁the      : 0.1614 
▁Moscow   : 0.1370 
▁Russia   : 0.0857 
▁which    : 0.0346 

Chosen token: fact

Step #4 candidates:
,         : 0.4379 
▁Moscow   : 0.0869 
▁it       : 0.0761 
▁Russia   : 0.0540 
▁the      : 0.0398 

Chosen token: Russia

Step #5 candidates:
is        : 0.2620 
▁is       : 0.2462 
<0x0A>    : 0.1815 
has       : 0.0344 
▁has      : 0.0272 

Chosen token: has

Step #6 candidates:
▁the      : 0.1540 
▁a        : 0.1084 
▁         : 0.1067 
▁two      : 0.0942 
▁been     : 0.0599 

Chosen t

**Task 5: write code for nucleus sampling generation (2 points)**:

Use the `nucleus_sampling()` template below. Look at the detailed generation code above for inspiration. __Please do not use model.generate__.

**Bonus task: write code for beam search (3 bonus points)**

In [9]:
from typing import Tuple, List
def nucleus_sampling(model, tokenizer, prompt: str, prob: float = 0.5) -> Tuple[str, List[str]]:
    """generates the next token from the nucleus of tokens with cumulative probability up to param:prob"""
    
    inputs = tokenizer(prompt, return_tensors='pt', return_token_type_ids=False).to(device)
    logits = model.forward(**inputs).logits[0, -1, :]
    probs = torch.nn.functional.softmax(logits, dim=-1)
    
    sorted_probs, sorted_indices = torch.sort(probs, descending=True)

    cumsum = torch.cumsum(sorted_probs, dim=-1) <= prob

    max_nucleus_ids = cumsum.int().argmin()

    if max_nucleus_ids.item() == 0:
        max_nucleus_ids = torch.tensor(1)

    new_probs = sorted_probs[:max_nucleus_ids]

    normalized_probs = new_probs / new_probs.sum()
    
    next_token_id = torch.multinomial(normalized_probs.flatten(), num_samples=1)

    next_token_id = sorted_indices[next_token_id]
    
    sampled_token = tokenizer.decode(next_token_id)

    possible_tokens = [tokenizer.decode(idx) for idx in sorted_indices[:max_nucleus_ids]]

    # sampled_token should be a string token that was generated
    # possible_tokens should be a list of all tokens that have non-zero probability
    return sampled_token, possible_tokens

In [10]:
# Tests for nucleus sampling
test_prompt = "Elbrus is the highest"
next_token, possible_tokens = nucleus_sampling(model, tokenizer, test_prompt, prob=0.9)
print(test_prompt, next_token, possible_tokens)
assert next_token in possible_tokens
assert 3 <= len(possible_tokens) <= 3
assert sorted(possible_tokens) == ['mountain', 'peak', 'point']

test_prompt = "Large language models can learn to"
next_token, possible_tokens = nucleus_sampling(model, tokenizer, test_prompt, prob=0.4)
print(test_prompt, next_token, possible_tokens)
assert next_token in possible_tokens
assert sorted(possible_tokens) == ['be', 'communicate', 'do', 'generate', 'perform', 'predict', 'speak', 'write']
assert len(possible_tokens) == 8

Elbrus is the highest mountain ['peak', 'mountain', 'point']
Large language models can learn to generate ['generate', 'write', 'perform', 'do', 'speak', 'be', 'predict', 'communicate']


### Part 3: Chain-of-thought prompting (4 points total)

![img](https://github.com/kojima-takeshi188/zero_shot_cot/raw/main/img/image_stepbystep.png)

---



In [17]:
import json
import random
import locale; locale.getpreferredencoding = lambda: "UTF-8"
!wget https://raw.githubusercontent.com/kojima-takeshi188/zero_shot_cot/2824685e25809779dbd36900a69825068e9f51ef/dataset/AQuA/test.json -O aqua.json
data = list(map(json.loads, open("aqua.json")))

--2024-02-22 12:27:04--  https://raw.githubusercontent.com/kojima-takeshi188/zero_shot_cot/2824685e25809779dbd36900a69825068e9f51ef/dataset/AQuA/test.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 130192 (127K) [text/plain]
Saving to: ‘aqua.json’


2024-02-22 12:27:04 (1.68 MB/s) - ‘aqua.json’ saved [130192/130192]



In [18]:
print("Example:")
data[150]

Example:


{'question': 'Janice bikes at 10 miles per hour, while Jennie bikes at 20. How long until they have collectively biked 1 mile?',
 'options': ['A)1 minute',
  'B)2 minutes',
  'C)3 minutes',
  'D)4 minutes',
  'E)5 minutes'],
 'rationale': "Janice's speed = 1/6 miles per minute\nJennie's speed = 1/3 miles per minute\nJanice + Jennie's speed= (1/6 + 1/3) = 1/2 miles per minute\nBoth together will finish the mile in 2 minutes\ncorrect option is B",
 'correct': 'B'}

### Naive solution

Here, we prompt the model to choose an answer to the example above (`data[150]`) out of the options given above. We're using a format that mimics grade school solution textbook.

Please note that there are minor formatting changes in options: an extra space and an opening bracket. Those may or may not be important :)

In [26]:
EXAMPLE_0SHOT = """
Question: Janice bikes at 10 miles per hour, while Jennie bikes at 20. How long until they have collectively biked 1 mile?
Answer Choices: (A) 1 minute (B) 2 minutes (C) 3 minutes (D) 4 minutes (E) 5 minutes
Correct Answer:
""".strip()

In [199]:
# solving an equation directly
batch = tokenizer(EXAMPLE_0SHOT, return_tensors='pt', return_token_type_ids=False).to(device)
torch.manual_seed(1337)
output_tokens = model.generate(**batch, max_new_tokens=300, do_sample=True, top_p=0.9)
print("[Prompt:]\n" + EXAMPLE_0SHOT)
print("=" * 80)
print("[Generated:]", tokenizer.decode(output_tokens[0][batch['input_ids'].shape[1]:].cpu()))



[Prompt:]
Question: Janice bikes at 10 miles per hour, while Jennie bikes at 20. How long until they have collectively biked 1 mile?
Answer Choices: (A) 1 minute (B) 2 minutes (C) 3 minutes (D) 4 minutes (E) 5 minutes
Correct Answer:
[Generated:]  (C) 3 minutes
Explanation: Janice bikes at a rate of 10 miles per hour, and Jennie bikes at a rate of 20 miles per hour.
To find out how long it takes for them to collectively bike 1 mile, we need to find their combined rate.
Their combined rate is 10 + 20 = 30 miles per hour.
Since they are collectively biking at a rate of 30 miles per hour, it will take them 1 mile / 30 miles per hour = 1/30 hours to bike 1 mile together.
Since there are 60 minutes in 1 hour, 1/30 hours is equal to 60/30 = 2 minutes.
However, we need to find the answer in whole minutes, so the closest whole minute answer is 3 minutes.
The answer is: 3</s>


And here's how you can solve this with few-shot chain-of-thought prompting.

You need to chang 3 things
- use a new field called **Rationale**, that contains a step-by-step solution to the problem
- add several few-shot examples of previously solved problems **with rationales**
- change the final prompt so that the model has to generate rationale before answering

In [27]:
EXAMPLE_3SHOT_CHAIN_OF_THOUGHT = """
Question: The original retail price of an appliance was 60 percent more than its wholesale cost. If the appliance was actually sold for 20 percent less than the original retail price, then it was sold for what percent more than its wholesale cost?
Answer Choices: (A) 20% (B) 28% (C) 36% (D) 40% (E) 42%
Rationale: wholesale cost = 100;\noriginal price = 100*1.6 = 160;\nactual price = 160*0.8 = 128.\nAnswer: B.
 Correct Answer: B


Question: A grocer makes a 25% profit on the selling price for each bag of flour it sells. If he sells each bag for $100 and makes $3,000 in profit, how many bags did he sell?
Answer Choices: (A) 12 (B) 16 (C) 24 (D) 30 (E) 40
Rationale: Profit on one bag: 100*1.25= 125\nNumber of bags sold = 3000/125 = 24\nAnswer is C.
 Correct Answer: C


Question: 20 marbles were pulled out of a bag of only white marbles, painted black, and then put back in. Then, another 20 marbles were pulled out, of which 1 was black, after which they were all returned to the bag. If the percentage of black marbles pulled out the second time represents their percentage in the bag, how many marbles in total Q does the bag currently hold?
Answer Choices: (A) 40 (B) 200 (C) 380 (D) 400 (E) 3200
Rationale: We know that there are 20 black marbles in the bag and this number represent 1/20 th of the number of all marbles in the bag, thus there are total Q of 20*20=400 marbles.\nAnswer: D.
 Correct Answer: D


Question: Janice bikes at 10 miles per hour, while Jennie bikes at 20. How long until they have collectively biked 1 mile?
Answer Choices: (A) 1 minute (B) 2 minutes (C) 3 minutes (D) 4 minutes (E) 5 minutes
Rationale:
""".strip()

In [234]:
batch = tokenizer(EXAMPLE_3SHOT_CHAIN_OF_THOUGHT, return_tensors='pt', return_token_type_ids=False).to(device)
torch.manual_seed(1337)
output_tokens = model.generate(**batch, max_new_tokens=100, do_sample=True, top_p=0.9)
print("[Prompt:]\n" + EXAMPLE_3SHOT_CHAIN_OF_THOUGHT)
print("=" * 80)
print("[Generated:]", tokenizer.decode(output_tokens[0][batch['input_ids'].shape[1]:].cpu()))
#### NOTE: scroll down for the final answer (below the ======= line)

[Prompt:]
Question: The original retail price of an appliance was 60 percent more than its wholesale cost. If the appliance was actually sold for 20 percent less than the original retail price, then it was sold for what percent more than its wholesale cost?
Answer Choices: (A) 20% (B) 28% (C) 36% (D) 40% (E) 42%
Rationale: wholesale cost = 100;
original price = 100*1.6 = 160;
actual price = 160*0.8 = 128.
Answer: B.
 Correct Answer: B


Question: A grocer makes a 25% profit on the selling price for each bag of flour it sells. If he sells each bag for $100 and makes $3,000 in profit, how many bags did he sell?
Answer Choices: (A) 12 (B) 16 (C) 24 (D) 30 (E) 40
Rationale: Profit on one bag: 100*1.25= 125
Number of bags sold = 3000/125 = 24
Answer is C.
 Correct Answer: C


Question: 20 marbles were pulled out of a bag of only white marbles, painted black, and then put back in. Then, another 20 marbles were pulled out, of which 1 was black, after which they were all returned to the bag. I

__Task 6 (1 pt)__ write a function that automatically creates chain-of-thought prompts. Follow the instructions from the function docstring.

In [28]:
QUESTION_PREFIX = "Question: "
OPTIONS_PREFIX = "Answer Choices: "
CHAIN_OF_THOUGHT_PREFIX = "Rationale: "
ANSWER_PREFIX = "Correct Answer: "
FEWSHOT_SEPARATOR = "\n\n\n"

def make_prompt(*, main_question, fewshot_examples):
  """
  Your goal is to produce the same prompt as the EXAMPLE_3SHOT_CHAIN_OF_THOUGHT automatically

  For each few-shot question, make sure to follow the following rules:
  1. Each question begins with QUESTION_PREFIX, after which you should print the question without leading/traiiling spaces (if any)
  2. After the question, provide space-separated options. Each option should be put in double brackets, followed by option text, e.g. "(A) 146%"
  3. Then, provide the answer as a single letter (A-E)
  4. Finally, add trailing newlines from FEWSHOT_SEPARATOR

  Your final prompt should contain all fewshot_examples (in order), separated with FEWSHOT_SEPARATOR, then follow with main_question.
  The main_question should contain the question and options formatted the same way as in FEWSHOT_EXAMPLES.
  After that, you should prompt the model to produce an explanation (rationale) for the answer.

  Please make sure your prompt contains no leading/trailing newlines or spaces, same as in EXAMPLE_3SHOT_CHAIN_OF_THOUGHT
  """
  prompt = ''
  for example in fewshot_examples: 
      question = example['question']
      options = example['options']
      rationale = example['rationale']
      answer = example['correct']
      question_prompt = f'{QUESTION_PREFIX}{question}\n'
      answer_choices_prompt = OPTIONS_PREFIX + ' '.join([' '.join(['(' + option.split(')')[0]  + ')', option.split(')')[1]]) for option in options]) + '\n'
      rationale_prompt = CHAIN_OF_THOUGHT_PREFIX + rationale + '\n'
      answer_prompt = ' ' + ANSWER_PREFIX + answer
      prompt += question_prompt + answer_choices_prompt + rationale_prompt + answer_prompt + FEWSHOT_SEPARATOR
      
  question = main_question['question'] 
  options = main_question['options']
  question_prompt = f'{QUESTION_PREFIX}{question}\n'
  answer_choices_prompt = OPTIONS_PREFIX + ' '.join([' '.join(['(' + option.split(')')[0]  + ')', option.split(')')[1]]) for option in options]) + '\n'
  prompt += question_prompt + answer_choices_prompt + "Rationale:"

  return prompt



generated_fewshot_prompt = make_prompt(main_question=data[150], fewshot_examples=(data[30], data[20], data[5]))
assert generated_fewshot_prompt == EXAMPLE_3SHOT_CHAIN_OF_THOUGHT, "prompts don't match"
assert generated_fewshot_prompt != make_prompt(main_question=data[150], fewshot_examples=())
assert generated_fewshot_prompt.endswith(make_prompt(main_question=data[150], fewshot_examples=()))

print("Well done!")

# Hint: if two prompts do not match, you may find it usefull to use https://www.diffchecker.com or similar to find the difference

Well done!


__Task 7 (1 points):__ Evaluate your prompt.

Please run the model on the entire dataset and measure it's accuracy.
For each question, peak $n=5$ other questions at random to serve as few-shot examples. Make sure not to accidentally sample the main_question among few-shot examples. For scientific evaluation, it is also a good practice to split the data into two parts: one for eval, and another for few-shot examples. However, doing so is optional in this homework.

The tricky part is when to stop generating: if you don't control for this, your model can accidentally generate a whole new question - and promptyly answer it :) To make sure you get the correct answer, stop generating tokens when the model is done explaining it's solution. To circumvent this, you need to __stop generating as soon as the model generates Final Answer: [A-E]__
To do so, you can either generate manually (see low-level generation above) or use [transformers stopping criteria](https://discuss.huggingface.co/t/implimentation-of-stopping-criteria-list/20040/2), whichever you prefer.

If you do everything right, the model should be much better than random. However, please __do not expect miracles__: this is far from the best models, and it will perform much worse than an average human.

In [35]:
NUM_SAMPLES = 0    # use this to count how many samples you evaluated
NUM_RESPONDED = 0  # how many times did the model produce Correct Answer: (letter) in it's response. use as a sanity check.
NUM_CORRECT = 0    # how many times did the model's chosen answer (letter) match the correct answer

In [15]:
from sklearn.model_selection import train_test_split
import pandas as pd
from tqdm import tqdm_notebook
pd.set_option('max_colwidth', 800)

In [19]:
eval_data, few_shot_data = train_test_split(data, test_size=0.2, random_state=1337)

In [20]:
from transformers import StoppingCriteria, StoppingCriteriaList
class StoppingCriteriaSub(StoppingCriteria):

    def __init__(self, stops = [], encounters=1):
        super().__init__()
        self.stops = [stop.to("cuda") for stop in stops]

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor):
        for stop in self.stops:
            if torch.all((stop == input_ids[0][-len(stop):])).item():
                return True

        return False

answer_letters = ['A', 'B', 'C', 'D', 'E']

stop_words = [f'Correct Answer: {ch}' for ch in answer_letters]
stop_words_ids = [tokenizer(stop_word, return_tensors='pt')['input_ids'].squeeze()[1:] for stop_word in stop_words]
stopping_criteria = StoppingCriteriaList([StoppingCriteriaSub(stops=stop_words_ids)])
#tokens = tokenizer(EXAMPLE_3SHOT_CHAIN_OF_THOUGHT, return_tensors='pt', return_token_type_ids=False).to(device)
#tmp = model.generate(**tokens, max_new_tokens=100, do_sample=True, top_p=0.9, stopping_criteria=stopping_criteria)
#print(tokenizer.decode(tmp[0][tokens['input_ids'].shape[1]:].cpu()))

In [296]:
tokens = tokenizer([prompt, prompt], return_tensors='pt', return_token_type_ids=False).to(device)
output_tokens = model.generate(**tokens, max_new_tokens=300, do_sample=True, top_p=0.9, stopping_criteria=stopping_criteria)

In [36]:
parsed_answers = []
answers = []
questions = []
explanations = []
correct_answers = []
options = []
batch_size = 4
for example in tqdm_notebook(eval_data, total=len(eval_data)):
    #question = example['question']
    correct_answer = example['correct']
    few_shot_examples = random.choices(population=few_shot_data, k=5)
    prompt = make_prompt(main_question=example, fewshot_examples=few_shot_examples)
    tokens = tokenizer(prompt, return_tensors='pt', return_token_type_ids=False).to(device)
    output_tokens = model.generate(**tokens, max_new_tokens=300, 
                                   stopping_criteria=stopping_criteria,
                                   use_cache=True, do_sample=True, temperature=0.8
                                  )
    len_origin_prompt = tokens['input_ids'].shape[1]
    answer = tokenizer.decode(output_tokens[0][len_origin_prompt:].cpu())
    parsed_answer = answer[-1]
    if parsed_answer in answer_letters:
        NUM_RESPONDED += 1
    if parsed_answer == correct_answer:
        NUM_CORRECT += 1
    NUM_SAMPLES += 1

    parsed_answers.append(parsed_answer)
    answers.append(answer)
    questions.append(example['question'])
    explanations.append(example['rationale'])
    correct_answers.append(correct_answer)
    options.append(example['options'])

# Optionally, consider inferencing multiple sentences in a batch for faster inference;
# If you choose to batch outputs, make sure the results are the same as with batch=1 (using greedy inference)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for example in tqdm_notebook(eval_data, total=len(eval_data)):


  0%|          | 0/203 [00:00<?, ?it/s]

In [37]:
test_df = pd.DataFrame({
    'question': questions,
    'options': options,
    'correct_answer': correct_answers,
    'model_answer': parsed_answers,
    'model_full_answer': answers
})


In [202]:
test_df

Unnamed: 0,question,options,correct_answer,model_answer,model_full_answer
0,"The original retail price of an appliance was 60 percent more than its wholesale cost. If the appliance was actually sold for 20 percent less than the original retail price, then it was sold for what percent more than its wholesale cost?","[A)20%, B)28%, C)36%, D)40%, E)42%]",B,D,"Selling Price = 20% Less than Retail Price\nOriginal Retail Price = 60% more than Wholesale Price\nWholesale Price = 40% of Original Retail Price\nSo, Selling Price = 20% Less than Wholesale Price\n=> Selling Price = 80% of Wholesale Price\n=> Selling Price / Original Retail Price = Wholesale Price / Original Retail Price\n=> Selling Price / Original Retail Price = 40% / 160%\n=> Selling Price = 25% of Original Retail Price\n=> Selling Price = 25% of 1.6 Original Retail Price\n=> Selling Price = 25% of 1.6*60\n=> Selling Price = 25% of 96\n=> Selling Price = 24\nSo, Selling Price = 24% more than the Wholesale Price.\nSo, Selling Price = 44% more than the Wholesale Price.\nAnswer: D\n Correct Answer: D"
1,What is the sum of three consecutive integers whose product can be expressed as 727+728+729.,"[A)37, B)38, C)36, D)30, E)39]",E,A,"Let the integers be x, x + 1, and x + 2. Then x(x + 1) + x(x + 1) + x(x + 2) = 727 + 728 + 729\nx(x + 1)(x + 2) = 2188\nx(x + 1)(x + 2) – 3(x + 2) = 2188 – 3(x + 2) = 725\n3(x + 2)(x + 1) = 2185\n3(x + 2) = 725\nx = 200\nSo the integers are 199, 200 and 201\nCorrect option: A\n Correct Answer: A"
2,A wire in the shape of rectangle of length 27 cm and breadth 17 cm is rebent to form a square. What will be the measure of each side?,"[A)9, B)11, C)22, D)25, E)31]",C,E,A = b = 17cm\nSquare = a = 27\na^2 = (27)^2 = 729\nSo a = 27\nSince area is same\n(27)^2 = (27)^2 = 729\na = 27 cm\nArea = 27*27 = 729\nAnswer: E.\n Correct Answer: E
3,"A man borrows Rs.360 If he pays it back in 12 monthly installments of Rs.31.50, what is his interest rate?","[A)1.5%, B)4.5%, C)10%, D)5%, E)12%]",D,E,P=Rs.360.\nI=?\nLet A=amount received every month\nA=31.5\nAmount received =Rs.360 x 12 = Rs.4320\nS.I.=4320-360 = Rs.3960\nRate =4320x100/360 = 12%\n Correct Answer: E
4,"If a man rows at the rate of 4 kmph in still water and his rate against the current is 2 kmph, then the man's rate along the current is:","[A)15 kmph, B)6 kmph, C)12 kmph, D)14 kmph, E)6 kmph]",E,A,Average speed = 6 kmph\nAnswer:A\n Correct Answer: A
...,...,...,...,...,...
198,"Two ants are moving from their farms towards each other. Ant A is moving at a speed of 9 cm per hour and ant B is moving at a speed of 6 cm per hour. If the farms are 75 cm away from each other, what will be the distance (in cm) that ant A travels until meeting ant B?","[A)45, B)48, C)51, D)54, E)57]",A,C,"When A and B meet, the total distance traveled by A and B = 75\nThe distance traveled by A = 75 - 6 = 71\nThe distance traveled by B = 75 - 9 = 66\nSo distance traveled by A = 66 + 71 = 137\nAnswer is C\n Correct Answer: C"
199,"By himself, Jack can clean a yacht in 12 hours. On a particular day, he happens to finish two-thirds of the work. The remaining portion of the work is done by Jill, whose rate of cleaning is just 5% of what Jack can do. How long does it take Jill to finish the remaining work?","[A)4, B)8, C)22, D)50, E)20]",E,B,A/B+C = 2/3 = (2/3)A = (2/3)12 = 8 hours\nC = 20 hours\nCorrect Option: B\n Correct Answer: B
200,"Two balls A and B rotate along a circular track. Ball A makes 2 full rotations in 26 minutes. Ball B makes 5 full rotation in 35 minutes. If they start rotating now from the same point, when will they be at the same starting point again?","[A)1 hour and 31 minutes, B)2 hour and 31 minutes, C)3 hour and 31 minutes, D)4 hour and 31 minutes, E)5 hour and 31 minutes]",A,",","Time it takes for 1st full rotation of Ball A = 26 minutes\nTime it takes for 2nd full rotation of Ball A = 26 minutes\nTime it takes for 1st full rotation of Ball B = 35 minutes\nTime it takes for 2nd full rotation of Ball B = 35 minutes\nBall A completes its 1st full rotation in 26 minutes.\nTherefore, when Ball B will make 1 full rotation in 35 minutes, both will be on the same track.\nTime taken by Ball A to make 1 full rotation = 35/2 = 17.5 minutes\nTime taken by Ball B to make 1 full rotation = 17.5 minutes\nHence, Ball B will complete its 1st full rotation in 17.5 minutes\nAfter 17.5 minutes, both the balls will be on the same track.\nThus, Time taken to complete the 1st full rotation of both the balls = 35/2 + 17.5 minutes = 30 minutes\nThus, Time taken to complete 2 full rota..."
201,"A special cereal mixture contains rice, wheat and corn in the ratio of 2:3:5. If a bag of the mixture contains 3 pounds of rice, how much corn does it contain?","[A)6.5, B)7.5, C)7, D)6, E)None of the above]",B,A,"The ratio of rice:wheat:corn is 2:3:5.\n3 pounds of rice is 1/6 of the cereal mixture.\nso, 1/6 of the cereal mixture is 3:6:10\nSo, 10 pounds of corn is 10*1/6=1/6\nso, 1/6 of the cereal mixture is 1/6 pounds of corn.\n1/6 pounds is 1 pound.\nSo, the bag contains 1 pound of corn.\nANSWER:A\n Correct Answer: A"


In [204]:
test_df[~test_df.model_answer.isin(answer_letters)]

Unnamed: 0,question,options,correct_answer,model_answer,model_full_answer
11,8 man work for 6 days to complete a work. How many men are required to complete same work in 1/2 day.,"[A)93 men, B)94 men, C)95 men, D)96 men, E)97 men]",D,,"Let’s assume 1 man can complete the work in 1 day\nIn 6 days = 6 men can complete the work\nSo 1 man = 6/8 = 3/4 man\nSo for 1/2 day, 1 man is required = 1/2 * 3/4 man = 1/4 man\nSo 1/2 man can complete the work in 1 day\nIf 9 man can complete the work in 1 day, then how much work can 1 man complete in 1 day.\nIf 1 man can complete 1/9 work in 1 day, then 1 man can complete 9/9 work in 1 day = 1 day\nSo 1 man can complete 9/9 work in 1 day\nSo 1 man can complete 9/9 * 1/4 work in 1/2 day = 9 * 1/4 work = 9/4 work = 9/4 day work\nSo 1/4 man can complete the work in 1 day = 1 man can complete 9/4 work in 1 day\nSo 9/4 man can complete the work in 1 day\n9/4 * 6/8 man can complete the work in 6 days = 9/4 * 3/4 man can complete the work in 6 days\nIn"
15,When folded into two equal halves a rectangular sheet had a perimeter of 48cm for each part folded along one set of sides and the same is 66cm when folded along the other set of sides. Find the area of the sheet.,"[A)1584, B)1120, C)792, D)1320, E)1200]",B,6,"Length = 2x and Width = 2x\nLength of one part of rectangular sheet = 2x = 48\nLength of other part of rectangular sheet = 2x = 66\nArea of sheet = 2x x 2x = 16 x 16 = 256\nCorrect Answer : B\n\n\nQuestion: A bag contains 5 red, 6 white and 8 blue balls. If one ball is drawn at random, find the probability that the ball drawn is either white or blue.\nAnswer Choices: (A) 1/11 (B) 11/11 (C) 6/11 (D) 5/11 (E) None of these\nRationale: Number of balls = 5+6+8 = 19\nProbability of white ball or blue ball = 6/19 + 8/19 = 14/19 = 7/11\nSo the probability that the ball drawn is either white or blue = 7/11\nCorrect Answer : C\n\n\nQuestion: A rectangular carpet with a perimeter of 250 cm was cut into two pieces, each with a perimeter of 120 cm. What is the area of each piece of carpet?\nAnswer..."
28,A train M leaves City A at 5 am and reaches City B at 9am. Another train N leaves City B at 7am and reaches City A at 1030am. At what time do the 2 trains cross one another?,"[A)1 hr 23 min, B)1 hr 15 min, C)1 hr 8 min, D)56 min, E)55 min]",D,),"\nTotal Distance = (AB) + (BA)\n=>(AB) = (BA)\n=> 5*3 + 9*4 = 4*5\n=> 15+36 = 25\n=> 51 Kms\nT=d/s = 51*60/90 = 20 Km/hr = 20*60/180 = 3/4 hr\nSo 3/4 * 15 + 3/4 * 15 = 120\n=>120/2 = 60\nSo 1/2 hrs = 30 mins\nCorrect Answer: B\n\nQuestion: Three persons A, B and C are standing in a row at a distance of 25 meters from each other. A fires a rifle and shoots at B, and misses him by 40 centimeters. B fires a rifle and shoots at C and misses him by 35 centimeters. C fires a rifle and shoots at B and misses him by 28 centimeters. What is the distance between A and C?\nAnswer Choices: (A) 28 centimeters (B) 35 centimeters (C) 40 centimeters (D) 73 centimeters (E)"
35,What is the greatest number of identical bouquets that can be made out of 28 white and 98 red tulips if no flowers are to be left out? (Two bouquets are identical whenever the number of red tulips in the two bouquets is equal and the number of white tulips in the two bouquets is equal.),"[A)4, B)7, C)10, D)14, E)21]",D,\n,"Total flowers = 28+98 = 126 flowers.\nNumber of bouquets possible = 126 / 60 = 21\nWe can make 21 bouquets with each bouquet having 28 white tulips and 98 red tulips.\nHence we have 21 identical bouquets.\nCorrect answer: C\n\nQuestion: A student has to make a minimum grade of 50% to pass in his examination. He has 50 marks out of 100 in mathematics and 55 marks out of 100 in English. What is the minimum marks that he should score in the remaining two subjects, so that he could pass the examination?\nAnswer Choices: (A) 150 (B) 140 (C) 130 (D) 120 (E) None of the above\nRationale: If x marks are obtained in both Maths and English, and (100-x) marks are to be obtained in other two subjects,\n\n(100-x)*2/100 * (x/100)=50\nx/100=10\nx=500\nHence, the minimum score required in other two su..."
46,A bag of cat food weighs 7 pounds and 4 ounces. How much does the bag weigh in ounces?,"[A) 108, B) 112, C) 116, D) 120, E) 124]",C,,"7 lbs = 7 x 16 = 112 ounces 4 ounces = 4/16 = 2.5 ounces\nAns: 112 ounces + 2.5 ounces = 114.5 ounces\nCorrect answer: C\n\n\nQuestion: If a = 0, 0 and 0, find a.\nAnswer Choices: (A) 1/3 (B) 2/3 (C) 0 (D) 3/3 (E) cannot be determined\nRationale: I did not get this question.\nCorrect Answer: C\n\n\nQuestion: The population of a city is decreasing at a rate of 1.5 percent per year. If the population of the city was 500,000 last year, what was the population 10 years ago?\nAnswer Choices: (A) 605,500 (B) 477,500 (C) 551,500 (D) 511,500 (E) 440,000\nRationale: This problem is asking you to do the opposite of what you might think: you need to find how many years ago it was based on the population 10 years ago."
47,"The sum of the 5 consecutive two digit odd numbers when divided by 10 becomes a perfect square, which of the following can be one of these 5 numbers?","[A)47, B)91, C)41, D)67, E)44]",C,o,"52 = 10 x 52, we need to make 5 numbers such that (a^2 + b^2 + c^2 + d^2 + e^2) = 100.\nCase1: a^2 = 10; b^2 = 20; c^2 = 22; d^2 = 23; e^2 = 25\nCase2: a^2 = 11; b^2 = 22; c^2 = 21; d^2 = 24; e^2 = 25\nCase3: a^2 = 12; b^2 = 23; c^2 = 22; d^2 = 20; e^2 = 21\nCase4: a^2 = 13; b^2 = 24; c^2 = 21; d^2 = 22; e^2 = 21\nCase5: a^2 = 14; b^2 = 21; c^2 = 20; d^2 = 23; e^2 = 22\nFor case 1 & 3, a & d are odd & even so, impossible.\nFor case 2 & 4, b & c are odd & even so"
68,"At a conference, one team is made up of 4 men and 4 women. Four presenters are chosen to present the team's findings in front of the entire conference. How many different groups of presenters can be chosen from the team if a team cannot be composed of men only or women only? (Two groups of presenters are considered different if at least one presenter is different.)","[A)120, B)19, C)180, D)420, E)460]",B,4,4C1 = 4 4C2 = 6 4C3 = 4 4C4 = 1\nSo 4C1 4C2 4C3 4C4 = 4! = 24\nSo 4C1 4C2 4C3 4C4 x 24 = 100\n4C2 x 24 = 96 4C3 x 24 = 81 4C4 x 24 = 24\n96/24 = 4 81/24 = 3 24/24 = 1\nSo 4C3 x 3 = 12 4C4 x 1 = 4\nSo 4C3 + 4C4 = 12 + 4 = 16\nSo 4C2 = 96 - 16 = 80\nSo 4C2 4C3 4C4 x 80 = 100\n80 x 12 = 100 4 x 12 = 48\n80 x 12 + 4 x 12 = 100 + 48 = 148\n4C1 4C2 4C3 4C4 x 14
72,In a row of children Neha is 12th from left end and Radha is 6th from right end. When Radha is shifted to left by 2 places and Neha is shifted to right by 2 places there 6 children between Radha and Neha. How many children are there in the row?,"[A)23, B)27, C)26, D)28, E)29]",D,1,"Number of children 28\nCorrect Answer: E\n\n\nQuestion: At 10 a.m. 20 bells began ringing together and then at the rate of 1 bell per minute, 1 bell more began ringing at the end of every minute. At 12 noon, there were 36 bells ringing. At what time did all the 36 bells begin to ring together ?\nAnswer Choices: (A) 10.10 (B) 10.15 (C) 11.05 (D) 11.10 (E) 12.00\nRationale: 20+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1=36\nAt 12 noon 20*60+1=36*60\n= 20*60+1=36*60+1/60\n= 21/60=7/180 =70/360 of a minute\n20+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1"
81,"Boomtown urban planners expect the city’s population to increase by 10% per year over the next two years. If that projection were to come true, the population two years from now would be exactly double the population of one year ago. Which of the following is closest to the percent population increase in Boomtown over the last year?","[A)20%, B)40%, C)50%, D)65%, E)75%]",D,5,"2(1+x) = 2x+1,\nso 2y = 2y+1,\nso 2y = 2y+1\nso 2y = 3,\nso y = 1.5\nPercent population increase in Boomtown over the last year = 50%\nCorrect Answer: C\n\n\nQuestion: The number of text messages sent during a week by 5 friends is a linear function of the total time they spend on their cell phones during that week. One friend sends 500 messages over a week if he spends 250 minutes on his cell phone. What is the ratio of text messages sent by one of the other 4 friends to the number of messages sent by that first friend?\nAnswer Choices: (A) 2:3 (B) 3:5 (C) 3:7 (D) 7:5 (E) 10:3\nRationale: The number of text messages sent during a week by 5 friends = 500\nThe total time they spend on their cell phones during that week = 250 minutes\nSince, the number of text messages sent during a week ..."
105,Car ‘X’ covers a distance of 320 kms in 8 hours and car ‘Y’ covers a distance of 415 kms in 5 hrs. What is the difference in the speed of the two cars?,"[A)42kms/hr, B)41km/hr, C)43kms/hr, D)45kms/hr, E)None of these]",C,1,"Let the speed of car ‘X’ be ‘x’ and that of car ‘Y’ be ‘y’.\nThen, we get,\nx = 415/5\ny = 320/8\nDifference in their speeds = x-y = 415/5 – 320/8 = 57/40\nSo, the required difference is 15 kms/hr.\nCorrect Answer: D\n\n\nQuestion: A sum of Rs. 2400 amounts to Rs. 2000 in 4 years at simple interest. What is the rate of interest?\nAnswer Choices: (A) 5% (B) 7% (C) 8% (D) 9% (E) None of these\nRationale: Let the rate of interest be r%.\nThen, the amount = (P) × (1 + r/100)4\n⇒ 2400 = (2000) × (1 + r/100)4\n⇒ 12 = (1 + r/100)4\n⇒ (1 + r/100)4 = 24\n⇒ 1 + r/100 = 241/4\n⇒ r/100 = 241/4 – 1"


In [38]:
print("Responded %%:", NUM_RESPONDED / NUM_SAMPLES)
print("Accuracy (when responded):", NUM_CORRECT / NUM_RESPONDED)
print("Accuracy (overall):", NUM_CORRECT / NUM_SAMPLES)

if NUM_RESPONDED / NUM_SAMPLES < 0.9:
  print("Something is wrong with the evaluation technique (for 5-shot CoT): the model refuses to answer too many questions.")
  print("Make sure you generate enough tokens that the model can produce a correct answer.")
  print("When in doubt, take a look at the full model output. You can often spot errors there.")

Responded %%: 0.9014778325123153
Accuracy (when responded): 0.19672131147540983
Accuracy (overall): 0.17733990147783252


__Task 8 (2 points)__ Experiment time!
<img width=200px src=https://www.evolvefish.com/cdn-cgi/image/quality%3D85/assets/images/Apparel/TShirtsWomenCont/Main/EF-APP-CWT-00068(Main).jpg>

Your final quest is to use the testbench you've just written to answer one of the following questions:

### Option 1: How many shots do you need?

How does model accuracy change with the number of fewshot examples?

a. check if the model accuracy changes as you increase/decrease the number of "shots"

b. try to prompt-engineer a model into giving the best rationale __without__ any few-shot examples, i.e. zero-shot

For zero-shot mode, feel free to use wild prompt-engineering or modify the inference procedure.

### Option 2: Is this prompting tecnique reliable?

_Inspired by ongoing research by Anton Voronov, Lena Volf and Max Ryabinin._

For this option, you need to check if the model behavior (and hence, accuracy) is robust to perturbations in the input prompt.

a. Does the accuracy degrade if you provide wrong answers to few-shot examples? (make sure to modify rationale if it contains answer in the end)

b. Does it degrade if you replace question/answer prompts with "Q" and "A"? What if you write both on the same line? Change few-shot separators?



### Option 3: Inference Matters

There are many ways to inference the model, not all of them equal.

a. check whether greedy inference or beam search affects model generation quality

b. implement and evaluate sampling with voting (see explanation below).


The voting technique(b) should work as follows: first, you generate k (e.g. 50) "attempts" at an answer using nucleus sampling (or a similar technique).
Then, you count how many of those attempts chose a particular option (A, B, etc) as the final answer. The option that was chosen most frequently has the most "votes", and therefore "wins".

To speed up voting, you may want to generate these attempts in parallel as a batch. That should be very easy to implement: just run `model.generate` on a list with multiple copies of the same prompt.




================================================

__Common rules:__ You will need to test both hypothes (A and B) in the chosen option. You may choose to replace one of them with your own idea - but please ask course staff in advance (via telegram) if you want full points.

Feel free to organize your code and report as you see fit - but please make sure it's readable and the code runs top-to-bottom :)
Write a short informal report about what you tried and, in doing so, what did you found. Minimum of 2 paragraphs; more is ok; creative visualizations are welcome.

You are allowed (but not required) to prompt the model into generating a report for you --- or helping you write one. However, if you do so, make sure that it is still human-readable :)



In [2]:
# feel free to organize your solution as you see fit

In [49]:
import re

In [143]:
random.seed(4)
few_shot_examples = random.choices(population=few_shot_data, k=5)
prompt = make_prompt(main_question=example, fewshot_examples=few_shot_examples)
print(prompt)

Question: Roberts has a property worth of $1023.65. But in a record his property worth is written as greatest positive even integer less than or equal to his property worth and it is divisible by 100. Find the difference between actual property and recorded property worth?
Answer Choices: (A) 23.65 (B) 1000 (C) 35.62 (D) 2.65 (E) 1023.65
Rationale: Since Robert property worth is written as greatest positive even integer less than or equal to his property worth and it is divisible by 100 then it is =1000 (greatest positive even integer less than or equal to his property worth and it is divisible by 100 is 1000).
Hence the difference = 1023.65 - 1000 = 23.65
Answer: A.
 Correct Answer: A


Question: If Raj was one-third as old as Rahim 5 years back and Raj is 17 years old now, How old is Rahim now?
Answer Choices: (A) 37 (B) 41 (C) 40 (D) 42 (E) 43
Rationale: Raj’s age today = 17 decades,
Hence, 5 decades back, he must be 12 years old.
Rahim must be 36 years old, Because (3×12).
5 years 

In [168]:
fewshot_prompt = """
Question: Roberts has a property worth of $1023.65. But in a record his property worth is written as greatest positive even integer less than or equal to his property worth and it is divisible by 100. Find the difference between actual property and recorded property worth?
Answer Choices: (A) 23.65 (B) 1000 (C) 35.62 (D) 2.65 (E) 1023.65
Rationale: Since Robert property worth is written as greatest positive even integer less than or equal to his property worth and it is divisible by 100 then it is =1000 (greatest positive even integer less than or equal to his property worth and it is divisible by 100 is 1000).
Hence the difference = 1023.65 - 1000 = 23.65;
 Correct Answer: A


Question: If Raj was one-third as old as Rahim 5 years back and Raj is 17 years old now, How old is Rahim now?
Answer Choices: (A) 37 (B) 41 (C) 40 (D) 42 (E) 43
Rationale: Raj’s age today = 17 decades,
Hence, 5 decades back, he must be 12 years old.
Rahim must be 36 years old, Because (3×12).
5 years back Rahim must be 41 years old today. Because (36+5);
 Correct Answer: B


Question: 30 is subtracted from a number, it is reduced to its one third. What is the value of 50% of that number?
Answer Choices: (A) 22.5 (B) 84 (C) 21 (D) 24 (E) 25
Rationale: 2/3 x = 30 => x = 45
45 * 1/2 = 22.5;
 Correct Answer: A


Question: Christopher and Jonathan were having bets. They decide that a coin will be flipped twenty times and each time it turns heads, Christopher will give $2 to Jonathan and each time it turns out to be Tails, Jonathan will give 3$ to Christopher. After flipping for twenty times none of the both won or lost any amount.
How many times did the coin landed on Heads ?
Answer Choices: (A) 10 (B) 23 (C) 16 (D) 18 (E) 12
Rationale: The amount won and lost by both is equal.
Thus 2x = 3(20-x) --- x in the number of times heads came
X = 12;
 Correct Answer: E


Question: A flagstaff 17.5 metre high casts a shadow of length 40.25 metre. The height of building, which casts a shadow of length 28.75 metre under similar conditions will be :
Answer Choices: (A) 12 metre (B) 12.5 metre (C) 13.5 metre (D) 14 metre (E) 15 metre
Rationale: Less shadow, Less Height (Direct Proportion)
So, let height of building be x metre
then,
40.25:17.5::28.75:x
=>x=17.5∗28.75/ 40.25
=>x=12.5;
 Correct Answer: B


"""

In [169]:
def make_prompt(*, main_question, fewshot_prompt):
  prompt = fewshot_prompt
  question = main_question['question'] 
  options = main_question['options']
  question_prompt = f'{QUESTION_PREFIX}{question}\n'
  answer_choices_prompt = OPTIONS_PREFIX + ' '.join([' '.join(['(' + option.split(')')[0]  + ')', option.split(')')[1]]) for option in options]) + '\n'
  prompt += question_prompt + answer_choices_prompt + "Rationale:"

  return prompt