### Practice: Large Language Models and Their Implications
<!-- ![img](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F4470ce74-e595-4750-92a5-5f21f040df6d_577x432.jpeg) -->
![img](https://i.imgur.com/QGYa2J8.jpeg)

In this notebook, you're gonna play with some of the largest language models on the Internet.

_Based on works of: Tim Dettmers, Ruslan Svirschevsky, Artem Chumachenko, Younes Belkada, Felix Marty, Yulian Gilyazev, Gosha Zolotov, Andrey Ishutin,  Elena Volf, Artemiy Vishnyakov, Svetlana Shirokovskih.

### Part 1: prompt engineering (4 points total)

In the assignment, we'll use public APIs that host the 100B+ models for inference. Your task is to prompt-engineer the model into solving a few tasks for you.


__Which API?__ You are free to use any publicly available API for general LM -- as long as it's __not a chat assistant__. So, gpt 3.5 is fine, but chatGPT is not. Here's a few options:

- BLOOM API - [bigscience/bloom](https://huggingface.co/bigscience/bloom) (on the right; recommended)
- OpenAI API (via VPN) - [openai.com/api](https://openai.com/api/)
- AI21 Jurrasic API - [ai21.com](https://www.ai21.com/blog/announcing-ai21-studio-and-jurassic-1)

These APIs may require you to create a (free) account on their platform. Please note that some APIs also have paid subscriptions. __You do not need to pay them__, this assignment was designed to be solved using free-tier subscriptions. If no APIs work for you, you can also solve these tasks with the 6.7B model that you will find later in this notebook - but this will make the tasks somewhat harder.

__Quests:__ you will need to solve 4 problems. For each one, please attach a short __description__ of your solution and a __screenshot__ from the API you use. _[If you use python APIs, show your python code with outputs]_

__Example:__ Tony is talking to Darth Vader ([BLOOM API](https://huggingface.co/bigscience/bloom)). Black text is written manually, blue text is generated.
<hr>

![img](https://i.imgur.com/a1QhKF7.png)
<hr>

__It is fine to roll back a few times,__ e.g. in the example above, the model first generated Vader lines twice in a row, and we rolled that back. However, if you need more than 1-2 rollbacks per session, you should probably try a different prompt.

In [None]:
# код для генерации текста по промпту
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "Enoch/llama-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto',
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    offload_state_dict=True
)

def generate_text(prompt, max_length=100, max_new_tokens=None):
    inputs = tokenizer(prompt, return_tensors="pt")

    outputs = model.generate(
        inputs["input_ids"].to("cuda"),
        max_length=max_length,
        max_new_tokens=max_new_tokens,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return generated_text

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/218 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggin

config.json:   0%|          | 0.00/511 [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/25.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/33 [00:00<?, ?it/s]

pytorch_model-00001-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00002-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00003-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00004-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00005-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00006-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00007-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00008-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00009-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00010-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00011-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00012-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00013-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00014-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00015-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00016-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00017-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00018-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00019-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00020-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00021-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00022-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00023-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00024-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00025-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00026-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00027-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00028-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00029-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00030-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00031-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00032-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

pytorch_model-00033-of-00033.bin:   0%|          | 0.00/524M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/33 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/151 [00:00<?, ?B/s]

__Task 1 (1 pt):__ arange a conversation between any two of the following:

- a celebrity or politician of your choice
- any fictional character (except Darth Vader)
- yourself

Compare two setups: a) you prompt with character names only b) you supply additional information (see example).

In [None]:
# a) you prompt with character names only
prompt = """Harry Potter is talking to Mark Zuckerberg.
Harry Potter:"""
generated_text = generate_text(prompt, max_length=300)
print("Generated text:")
print(generated_text)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Generated text:
Harry Potter is talking to Mark Zuckerberg.
Harry Potter: “Mark, we have to talk.”
Mark Zuckerberg: “Why?”
Harry Potter: “I’ve seen what you’ve done with Facebook. It’s amazing. I can’t believe you’ve built such a successful social network with so little money. I’m really impressed.”
Mark Zuckerberg: “Thanks, Harry.”
Harry Potter: “But I’m concerned about your privacy policy.”
Mark Zuckerberg: “What do you mean?”
Harry Potter: “I’m concerned about how you’re using information people share on Facebook. I think it’s a bad idea.”
Mark Zuckerberg: “Why?”
Harry Potter: “Because it’s a terrible idea to allow people to share all this information about themselves online. It’s a terrible idea to make it easy for people to share their personal information with everyone they know. I don’t want to be on Facebook.”
Mark Zuckerberg: “But, Harry, you’re already on Facebook.”
Harry Potter: “Yes, I am. I’m on Facebook because I have to be. It’s for my job. I’m the Minister of Magic.”
Ma

Получилось достаточно неплохо для такого простого промпта.

In [None]:
# b) you supply additional information
prompt = """Harry Potter is talking to Mark Zuckerberg.
Mark tells about Facebook. Harry is very interested. Harry asks about Facebook future.
Harry Potter:"""
generated_text = generate_text(prompt, max_length=300)
print("Generated text:")
print(generated_text)

Generated text:
Harry Potter is talking to Mark Zuckerberg.
Mark tells about Facebook. Harry is very interested. Harry asks about Facebook future.
Harry Potter: “What’s your vision for Facebook?”
Mark Zuckerberg: “We want to connect people to people. We want to connect people to businesses. We want to connect people to things that they care about.”
Harry Potter: “Why are you so confident about your vision?”
Mark Zuckerberg: “Because I think it’s a really good idea. And I think that we’re the people to execute it.”
Harry Potter: “Why is it a good idea?”
Mark Zuckerberg: “Because I think that people are fundamentally social. I think that they want to connect to each other and that they want to share and that they want to be able to have a voice. And I think that the world is going to be better if we help them do that.”
Harry Potter: “You’re really convinced that you’re doing the right thing?”
Mark Zuckerberg: “Yeah, I’m really confident.”
Harry Potter: “Why?”
Mark Zuckerberg: “Because I’

Получилось более осмысленно + есть продолжение промпта.

__Please choose task 2a or 2b (1pt)__ depending on your model (you can do both, but you will be awarded points for one of these two tasks).

__Task 2a: (for BLOOM or other multilingual model)__ zero-shot translation. Take the first verse of [Edgar Allan Poe's "Raven"](https://www.poetryfoundation.org/poems/48860/the-raven) and __translate it into French.__ (You are free to use any other text of at least the same size)

Original text: ```
Once upon a midnight dreary, while I pondered, weak and weary,
Over many a quaint and curious volume of forgotten lore—
    While I nodded, nearly napping, suddenly there came a tapping,
As of some one gently rapping, rapping at my chamber door.
“’Tis some visitor,” I muttered, “tapping at my chamber door—
            Only this and nothing more.”
```

Verify your translation by converting french back into english using a public machine translation service.

__Task 2b: (non-BLOOM):__ toxicity classification for [SetFit/toxic_conversations](https://huggingface.co/datasets/SetFit/toxic_conversations). Make the model solve binary classification (toxic vs not toxic) in the few shot mode. For few-shot examples, use 2-3 toxic and 2-3 non-toxic non-toxic examples. Measure accuracy on at least 25 samples. You may need to try several different prompts before you find the one that works.

In [None]:
prompt = """Original text: Once upon a midnight dreary, while I pondered, weak and weary, Over many a quaint and curious volume of forgotten lore—
While I nodded, nearly napping, suddenly there came a tapping, As of some one gently rapping, rapping at my chamber door.
“’Tis some visitor,” I muttered, “tapping at my chamber door— Only this and nothing more.”

French translation:"""
generated_text = generate_text(prompt, max_length=300)
print("Generated text:")
print(generated_text)

Generated text:
Original text: Once upon a midnight dreary, while I pondered, weak and weary, Over many a quaint and curious volume of forgotten lore— 
While I nodded, nearly napping, suddenly there came a tapping, As of some one gently rapping, rapping at my chamber door. 
“’Tis some visitor,” I muttered, “tapping at my chamber door— Only this and nothing more.”

French translation: Il y avait un soir, par un temps sombre, et comme je m’affaiblissais, fatigué et las, je m’interrogeai sur de nombreux livres curieux et anciens, alors que je me reposais, presque endormi, et soudain, une tape frappait doucement à la porte de ma chambre.

C’était quelqu’un qui frappait, qui frappait à ma porte.

«Ce n’est qu’un visiteur», j’ai murmuré, assis à côté de mon livre.

English translation: Once upon a midnight dreary, while I pondered, weak and weary, Over many a quaint and curious volume of forgotten lore— While I nodded, nearly napping, suddenly there came a tapping, As of some one gently rapp

Основной перевод получился довольно неплохим. По смыслу все то же. Правда концовка получилась немного обрезанная.

\+ модель начала печатать перевод на английском из-за того, что у нее есть ограничение на максимальную длину и она пытается уложитсья в нее


__Task 3 (1pt):__ create a prompt and few-shot examples tha make the model __change the gender pronouns__ of the main actor in a given sentence in any direction of your choice. E.g. the doctor took off _his_ mask <-> the doctor took of _her_ mask.


In [None]:
prompt = """
Sentence: Mother plays with her daughter
Change: Mother plays with his daughter

Sentence: Father is repairing his car
Change: Father is repairing her car

Sentence: Boy is doing his homework
Change: Boy is doing her homework

Sentence: Doctor took off his mask
Change:"""
generated_text = generate_text(prompt, max_new_tokens=5)
print("Generated text:")
print(generated_text)

Both `max_new_tokens` (=5) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Generated text:

Sentence: Mother plays with her daughter 
Change: Mother plays with his daughter

Sentence: Father is repairing his car
Change: Father is repairing her car

Sentence: Boy is doing his homework
Change: Boy is doing her homework

Sentence: Doctor took off his mask
Change: Doctor took off her mask


__Task 4 (1pt):__ write a prompt and supply examples such that the model would __convert imperial units to metric units__ (miles -> kilometers; mph -> kph). More specifically, the model should rewrite a given sentence and replace all imperial units with their metric equivalents. After it works with basic distances and speed, try to find complicated examples where it does *not* work.

Please note that 1 mile is not equal to 1 km :)

In [None]:
prompt = """
Given a text, convert all occurances of imperial units to metric units.
Use this information for convertation: 1 mile equals 1.6 kilometers, 1 inch equals 2.5 centimetres, 1 foot is 30 centimetres.
Check examples below.

Example: Average height of people in the USA is 5 feet, 9 inches
Result: Average height of people in the USA is 175 centimetres

Example: Man is driving with speed 50 miles per hour
Result: Man is driving with speed 80 kilometers per hour

Example: 2 inches are short for this task
Result: 5 centimetres are short for this task

Text 1: 3 inches
Text 2: 60 miles per hour is slow for our mission

Result 1:"""
generated_text = generate_text(prompt, max_new_tokens=30)
print("Generated text:")
print(generated_text)

Both `max_new_tokens` (=30) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Generated text:

Given a text, convert all occurances of imperial units to metric units.
Use this information for convertation: 1 mile equals 1.6 kilometers, 1 inch equals 2.5 centimetres, 1 foot is 30 centimetres.
Check examples below.

Example: Average height of people in the USA is 5 feet, 9 inches
Result: Average height of people in the USA is 175 centimetres

Example: Man is driving with speed 50 miles per hour
Result: Man is driving with speed 80 kilometers per hour

Example: 2 inches are short for this task
Result: 5 centimetres are short for this task

Text 1: 3 inches
Text 2: 60 miles per hour is slow for our mission

Result 1: 7.5 centimetres is short for this task
Result 2: 100 kilometers per hour is slow for our mission


Получилось практически верно. 3 дюйма это действительно 7.5 сантиметров, а 60 миль/ч это 96 км/ч - практически 100, так что результат хороший

In [None]:
prompt = """
Given a text, convert all occurances of imperial units to metric units.
Use this information for convertation: 1 mile equals 1.6 kilometers, 1 inch equals 2.5 centimetres, 1 foot is 30 centimetres.
Check examples below.

Example: Average height of people in the USA is 5 feet, 9 inches
Result: Average height of people in the USA is 175 centimetres

Example: Man is driving with speed 50 miles per hour
Result: Man is driving with speed 80 kilometers per hour

Example: 2 inches are short for this task
Result: 5 centimetres are short for this task

Text 1: My height is 6 feet, 3 inches. I am living 30 miles away from the centre.

Result 1:"""
generated_text = generate_text(prompt, max_new_tokens=30)
print("Generated text:")
print(generated_text)

Both `max_new_tokens` (=30) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Generated text:

Given a text, convert all occurances of imperial units to metric units.
Use this information for convertation: 1 mile equals 1.6 kilometers, 1 inch equals 2.5 centimetres, 1 foot is 30 centimetres.
Check examples below.

Example: Average height of people in the USA is 5 feet, 9 inches
Result: Average height of people in the USA is 175 centimetres

Example: Man is driving with speed 50 miles per hour
Result: Man is driving with speed 80 kilometers per hour

Example: 2 inches are short for this task
Result: 5 centimetres are short for this task

Text 1: My height is 6 feet, 3 inches. I am living 30 miles away from the centre.

Result 1: My height is 1.8 meters. I am living 4800 metres away from the centre.

Text 2: I


В данном случае модель может еще плюс-минус правильно переводит рост, но с расстоянием ошибается.

### Part 2: local inference

Now, let's try and load the strongest model that can fit a typical Colab GPU (T4 with 16 GB as of spring 2023).

Our best candidates are the smaller versions of the best performing open source models:
- 7 Bn parameters version of [LLaMA](https://arxiv.org/pdf/2302.13971.pdf) - best for spring 2023, released by Facebook
- 7 Bn parameters version of [Falcon](https://falconllm.tii.ae) - close competitor to Llama, released in May 2023 by [Technology Innovation Institute of UAE](https://www.tii.ae).
- 6.7 Bn parameters version of [OPT](https://arxiv.org/abs/2205.01068) - top choice in this nomination in 2022, released by Facebook.

Beware: while these models are smaller than the ones in API, they're still over 60x larger than the BERT we played with last time. The code below will *just barely* fit into memory, so make sure you don't have anything else loaded. Sometimes you may need to restart runtime for the code to work.

It's a good time to restart your kernel and switch to GPU! (Runtime -> Change runtime type)
<center><img src="https://i.imgur.com/OOfDYzJ.png" width=240px></center>

In [1]:
%pip install --quiet bitsandbytes==0.41.1 transformers==4.41.0 accelerate==0.24.0 sentencepiece==0.1.99 optimum==1.15.0 auto-gptq==0.5.0 torch==2.1.0
import torch
import torch.nn as nn
import torch.nn.functional as F
import transformers
import bitsandbytes as bnb
from tqdm.auto import tqdm, trange
assert torch.cuda.is_available(), "you need cuda for this part"
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.8/43.8 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.1/9.1 MB[0m [31m55.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.0/261.0 kB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m40.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.9/400.9 kB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.8/4.8 MB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m670.2/670.2 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
model_name = 'TheBloke/Llama-2-13B-GPTQ'

# loading Llama tokenizer ...
tokenizer = transformers.LlamaTokenizer.from_pretrained(model_name, device_map=device)
tokenizer.pad_token_id = tokenizer.eos_token_id

# ... and the model itself
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto',
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    offload_state_dict=True
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/913 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/7.26G [00:00<?, ?B/s]



generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

## Text generation

**Comparison of strategies for language model text generation:**

| Strategy | Description | Pros & Cons |
| --- | --- | --- |
| Greedy Search | Chooses the word with the highest probability as the next word in the sequence. | **Pros:** Simple and fast. <br> **Cons:** Can lead to repetitive and incoherent text. |
| Sampling with Temperature | Introduces randomness in the word selection. A higher temperature leads to more randomness. | **Pros:** Allows exploration and diverse output. <br> **Cons:** Higher temperatures can lead to nonsensical outputs. |
| Nucleus Sampling (Top-p Sampling) | Selects the next word from a truncated vocabulary, the "nucleus" of words that have a cumulative probability exceeding a pre-specified threshold (p). | **Pros:** Balances diversity and quality. <br> **Cons:** Setting an optimal 'p' can be tricky. |
| Beam Search | Explores multiple hypotheses (sequences of words) at each step, and keeps the 'k' most likely, where 'k' is the beam width. | **Pros:** Produces more reliable results than greedy search. <br> **Cons:** Can lack diversity and lead to generic responses. |
| Top-k Sampling | Randomly selects the next word from the top 'k' words with the highest probabilities. | **Pros:** Introduces randomness, increasing output diversity. <br> **Cons:** Random selection can sometimes lead to less coherent outputs. |
| Length Normalization | Prevents the model from favoring shorter sequences by dividing the log probabilities by the sequence length raised to some power. | **Pros:** Makes longer and potentially more informative sequences more likely. <br> **Cons:** Tuning the normalization factor can be difficult. |
| Stochastic Beam Search | Introduces randomness into the selection process of the 'k' hypotheses in beam search. | **Pros:** Increases diversity in the generated text. <br> **Cons:** The trade-off between diversity and quality can be tricky to manage. |
| Decoding with Minimum Bayes Risk (MBR) | Chooses the hypothesis (out of many) that minimizes expected loss under a loss function. | **Pros:** Optimizes the output according to a specific loss function. <br> **Cons:** Computationally more complex and requires a good loss function. |

Documentation references:
- [reference for `AutoModelForCausalLM.generate()`](https://huggingface.co/docs/transformers/v4.29.1/en/main_classes/text_generation#transformers.GenerationMixin.generate)
- [reference for `AutoTokenizer.decode()`](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.decode)
- Huggingface [docs on generation strategies](https://huggingface.co/docs/transformers/generation_strategies)

### Generation with HuggingFace

In [None]:
prompt = 'The first discovered martian lifeform looks like'
batch = tokenizer(prompt, return_tensors='pt', return_token_type_ids=False).to(device)
print("Input batch (encoded):", batch)

output_tokens = model.generate(**batch, max_new_tokens=64, do_sample=True, temperature=0.8)
# greedy inference:                                        do_sample=False)
# beam search for highest probability:                     num_beams=4)

print("\nOutput:", tokenizer.decode(output_tokens[0].cpu()))

Input batch (encoded): {'input_ids': tensor([[    1,   450,   937, 10943, 14436,   713,  2834,   689,  3430,   763]],
       device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}





Output: <s>The first discovered martian lifeform looks like a piece of metal.
Scientists have discovered a new type of bacterium that can thrive in conditions so extreme that it is not even like any other living thing that has ever been discovered on Earth.
The bacteria were found living inside a Martian meteorite that was discovered in the Sah


#### Low-level code for text generation

In [None]:
prompt = "Moscow is the capital of"
# prompt = "Skippy, a young android, likes to dream about electric"

print(prompt, '\n')

voc = tokenizer.get_vocab()
voc_rev = {v:k for k, v in voc.items()}  # reverse vocab for decode

for i in range(10):
    inputs = tokenizer(prompt, return_tensors='pt', return_token_type_ids=False).to(device)
    logits = model.forward(**inputs).logits[0, -1, :]
    probs = torch.nn.functional.softmax(logits, dim=-1)
    next_token_id = torch.multinomial(probs.flatten(), num_samples=1)

    next_token = tokenizer.decode(next_token_id)
    prompt += next_token

    sorted_probs, sorted_indices = torch.sort(probs, descending=True)
    top_tokens = sorted_indices[:5]
    print(f"Step #{i} candidates:")
    for t, p in zip (top_tokens, sorted_probs):
        t = voc_rev[t.item()]
        print(f"{t:<10}: {p:.4f} ")

    print(f'\nChosen token: {next_token}', end='\n\n', flush=True)

Moscow is the capital of 

Step #0 candidates:
▁Russia   : 0.7616 
▁the      : 0.1795 
▁Russian  : 0.0218 
▁a        : 0.0058 
▁not      : 0.0022 

Chosen token: the

Step #1 candidates:
▁Russian  : 0.8241 
▁largest  : 0.0293 
Russ      : 0.0146 
▁Russia   : 0.0116 
▁country  : 0.0101 

Chosen token: Russian

Step #2 candidates:
▁Federation: 0.7706 
F         : 0.1720 
f         : 0.0168 
▁feder    : 0.0106 
▁Empire   : 0.0041 

Chosen token: Federation

Step #3 candidates:
.         : 0.2875 
,         : 0.2831 
and       : 0.2170 
▁and      : 0.1256 
as        : 0.0093 

Chosen token: .

Step #4 candidates:
▁It       : 0.3454 
▁The      : 0.1321 
▁Moscow   : 0.0747 
<0x0A>    : 0.0582 
It        : 0.0373 

Chosen token: It

Step #5 candidates:
▁is       : 0.6774 
is        : 0.0882 
▁has      : 0.0454 
▁was      : 0.0309 
’         : 0.0234 

Chosen token: is

Step #6 candidates:
▁the      : 0.1747 
located   : 0.1173 
▁located  : 0.0957 
▁also     : 0.0689 
▁a        : 0.0658 

Chos

**Task 5: write code for nucleus sampling generation (2 points)**:

Use the `nucleus_sampling()` template below. Look at the detailed generation code above for inspiration. __Please do not use model.generate__.

**Bonus task: write code for beam search (3 bonus points)**

In [None]:
from typing import Tuple, List
def nucleus_sampling(model, tokenizer, prompt: str, prob: float = 0.5) -> Tuple[str, List[str]]:
    """generates the next token from the nucleus of tokens with cumulative probability up to param:prob"""
    inputs = tokenizer(prompt, return_tensors='pt', return_token_type_ids=False).to(device)
    logits = model.forward(**inputs).logits[0, -1, :]
    probs = torch.nn.functional.softmax(logits, dim=-1)

    sorted_probs, sorted_indices = torch.sort(probs, descending=True)
    mask = torch.cumsum(sorted_probs, dim=0) <= prob
    next_token_id = sorted_indices[torch.multinomial(sorted_probs[mask], num_samples=1)]

    sampled_token = tokenizer.decode(next_token_id)

    possible_tokens = []
    for idx in sorted_indices[mask]:
        possible_tokens.append(tokenizer.decode(idx))

    top_tokens = sorted_indices[:len(possible_tokens)]
    print(f"Candidates:")
    for t, p in zip (top_tokens, sorted_probs):
        t = voc_rev[t.item()]
        print(f"{t:<10}: {p:.4f} ")

    # sampled_token should be a string token that was generated
    # possible_tokens should be a list of all tokens that have non-zero probability
    return sampled_token, possible_tokens

In [None]:
# Tests for nucleus sampling
test_prompt = "Elbrus is the highest"
next_token, possible_tokens = nucleus_sampling(model, tokenizer, test_prompt, prob=0.9)
print(test_prompt, next_token, possible_tokens)
assert next_token in possible_tokens
assert 3 <= len(possible_tokens) <= 3
assert sorted(possible_tokens) == ['mountain', 'peak', 'point']

test_prompt = "Large language models can learn to"
next_token, possible_tokens = nucleus_sampling(model, tokenizer, test_prompt, prob=0.4)
print(test_prompt, next_token, possible_tokens)
assert next_token in possible_tokens
assert sorted(possible_tokens) == ['be', 'communicate', 'do', 'generate', 'perform', 'predict', 'speak', 'write']
assert len(possible_tokens) == 8

Candidates:
▁peak     : 0.4371 
▁mountain : 0.3512 
▁point    : 0.0719 
Elbrus is the highest mountain ['peak', 'mountain', 'point']
Candidates:
▁generate : 0.0849 
▁write    : 0.0804 
▁perform  : 0.0465 
▁do       : 0.0451 
▁speak    : 0.0407 
▁be       : 0.0291 
▁predict  : 0.0284 
▁communicate: 0.0269 
Large language models can learn to be ['generate', 'write', 'perform', 'do', 'speak', 'be', 'predict', 'communicate']


**Bonus task**

Реализация beam search.

In [25]:
def beam_search(model, tokenizer, prompt, num_steps, k):
  inputs = tokenizer(prompt, return_tensors='pt', return_token_type_ids=False).to(device)

  sequences = [(inputs["input_ids"], 0)]
  for _ in range(num_steps):
      all_candidates = []

      for i in range(len(sequences)):
          input, score = sequences[i]

          logits = model.forward(input).logits[0, -1, :]
          probs = torch.nn.functional.softmax(logits, dim=-1)

          sorted_probs, sorted_indices = torch.sort(probs, descending=True)
          top_k_probs, top_k_indices = sorted_probs[:k], sorted_indices[:k]

          for j in range(k):
              new_score = top_k_probs[j] + score
              new_input = torch.cat([input, top_k_indices[j].reshape(1, 1)], dim=1)

              all_candidates.append((new_input, new_score))

      all_candidates.sort(key=lambda x: x[1], reverse=True)
      sequences = all_candidates[:k]

  generations = []
  for sequence in sequences:
      text = ''

      for token_id in sequence[0]:
          word = tokenizer.decode(token_id)
          text += word + ' '

      text = text.strip()

      generations.append(text)

  return generations

In [28]:
test_prompt = "Elbrus is the highest"
generations = beam_search(model, tokenizer, test_prompt, num_steps=7, k=3)
print('Prompt: ', test_prompt)
print('Generations:')
for generation in generations:
    print(generation)

test_prompt = "Large language models can learn to"
generations = beam_search(model, tokenizer, test_prompt, num_steps=15, k=4)
print('Prompt: ', test_prompt)
print('Generations:')
for generation in generations:
    print(generation)

Prompt:  Elbrus is the highest
Generations:
<s>Elbrus is the highest mountain in Europe and the Cau
<s>Elbrus is the highest mountain in Europe. It is located
<s>Elbrus is the highest mountain in Europe. It is a
Prompt:  Large language models can learn to
Generations:
<s>Large language models can learn to perform a variety of natural language processing (NLP) tasks, such as
<s>Large language models can learn to perform a variety of natural language processing (NLP) tasks, including text
<s>Large language models can learn to perform a variety of natural language processing (NLP) tasks, such a
<s>Large language models can learn to perform a variety of natural language processing (NLP) tasks, such…


### Part 3: Chain-of-thought prompting (4 points total)

![img](https://github.com/kojima-takeshi188/zero_shot_cot/raw/main/img/image_stepbystep.png)

---



In [29]:
import json
import random
import locale; locale.getpreferredencoding = lambda: "UTF-8"
!wget https://raw.githubusercontent.com/kojima-takeshi188/zero_shot_cot/2824685e25809779dbd36900a69825068e9f51ef/dataset/AQuA/test.json -O aqua.json
data = list(map(json.loads, open("aqua.json")))

--2024-11-16 09:55:02--  https://raw.githubusercontent.com/kojima-takeshi188/zero_shot_cot/2824685e25809779dbd36900a69825068e9f51ef/dataset/AQuA/test.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 130192 (127K) [text/plain]
Saving to: ‘aqua.json’


2024-11-16 09:55:02 (6.58 MB/s) - ‘aqua.json’ saved [130192/130192]



In [30]:
print("Example:")
data[150]

Example:


{'question': 'Janice bikes at 10 miles per hour, while Jennie bikes at 20. How long until they have collectively biked 1 mile?',
 'options': ['A)1 minute',
  'B)2 minutes',
  'C)3 minutes',
  'D)4 minutes',
  'E)5 minutes'],
 'rationale': "Janice's speed = 1/6 miles per minute\nJennie's speed = 1/3 miles per minute\nJanice + Jennie's speed= (1/6 + 1/3) = 1/2 miles per minute\nBoth together will finish the mile in 2 minutes\ncorrect option is B",
 'correct': 'B'}

### Naive solution

Here, we prompt the model to choose an answer to the example above (`data[150]`) out of the options given above. We're using a format that mimics grade school solution textbook.

Please note that there are minor formatting changes in options: an extra space and an opening bracket. Those may or may not be important :)

In [None]:
EXAMPLE_0SHOT = """
Question: Janice bikes at 10 miles per hour, while Jennie bikes at 20. How long until they have collectively biked 1 mile?
Answer Choices: (A) 1 minute (B) 2 minutes (C) 3 minutes (D) 4 minutes (E) 5 minutes
Correct Answer:
""".strip()

In [None]:
# solving an equation directly
batch = tokenizer(EXAMPLE_0SHOT, return_tensors='pt', return_token_type_ids=False).to(device)
torch.manual_seed(1337)
output_tokens = model.generate(**batch, max_new_tokens=100, do_sample=True, top_p=0.9)
print("[Prompt:]\n" + EXAMPLE_0SHOT)
print("=" * 80)
print("[Generated:]", tokenizer.decode(output_tokens[0][batch['input_ids'].shape[1]:].cpu()))

[Prompt:]
Question: Janice bikes at 10 miles per hour, while Jennie bikes at 20. How long until they have collectively biked 1 mile?
Answer Choices: (A) 1 minute (B) 2 minutes (C) 3 minutes (D) 4 minutes (E) 5 minutes
Correct Answer:
[Generated:] (E) 5 minutes
Explanation: Jennie bikes at 20 miles per hour for 2 minutes. She will have travelled 2 miles in this time. Janice also bikes for 2 minutes, but at a slower speed of 10 miles per hour. This means that she will travel 2/10 miles or 0.2 miles. She will travel 1 mile in 5 minutes. Hence, 5 minutes will have el


And here's how you can solve this with few-shot chain-of-thought prompting.

You need to chang 3 things
- use a new field called **Rationale**, that contains a step-by-step solution to the problem
- add several few-shot examples of previously solved problems **with rationales**
- change the final prompt so that the model has to generate rationale before answering

In [32]:
EXAMPLE_3SHOT_CHAIN_OF_THOUGHT = """
Question: The original retail price of an appliance was 60 percent more than its wholesale cost. If the appliance was actually sold for 20 percent less than the original retail price, then it was sold for what percent more than its wholesale cost?
Answer Choices: (A) 20% (B) 28% (C) 36% (D) 40% (E) 42%
Rationale: wholesale cost = 100;\noriginal price = 100*1.6 = 160;\nactual price = 160*0.8 = 128.\nAnswer: B.
Correct Answer: B


Question: A grocer makes a 25% profit on the selling price for each bag of flour it sells. If he sells each bag for $100 and makes $3,000 in profit, how many bags did he sell?
Answer Choices: (A) 12 (B) 16 (C) 24 (D) 30 (E) 40
Rationale: Profit on one bag: 100*1.25= 125\nNumber of bags sold = 3000/125 = 24\nAnswer is C.
Correct Answer: C


Question: 20 marbles were pulled out of a bag of only white marbles, painted black, and then put back in. Then, another 20 marbles were pulled out, of which 1 was black, after which they were all returned to the bag. If the percentage of black marbles pulled out the second time represents their percentage in the bag, how many marbles in total Q does the bag currently hold?
Answer Choices: (A) 40 (B) 200 (C) 380 (D) 400 (E) 3200
Rationale: We know that there are 20 black marbles in the bag and this number represent 1/20 th of the number of all marbles in the bag, thus there are total Q of 20*20=400 marbles.\nAnswer: D.
Correct Answer: D


Question: Janice bikes at 10 miles per hour, while Jennie bikes at 20. How long until they have collectively biked 1 mile?
Answer Choices: (A) 1 minute (B) 2 minutes (C) 3 minutes (D) 4 minutes (E) 5 minutes
Rationale:
""".strip()

In [None]:
batch = tokenizer(EXAMPLE_3SHOT_CHAIN_OF_THOUGHT, return_tensors='pt', return_token_type_ids=False).to(device)
torch.manual_seed(1337)
output_tokens = model.generate(**batch, max_new_tokens=100, do_sample=True, top_p=0.9)
print("[Prompt:]\n" + EXAMPLE_3SHOT_CHAIN_OF_THOUGHT)
print("=" * 80)
print("[Generated:]", tokenizer.decode(output_tokens[0][batch['input_ids'].shape[1]:].cpu()))
#### NOTE: scroll down for the final answer (below the ======= line)

[Prompt:]
Question: The original retail price of an appliance was 60 percent more than its wholesale cost. If the appliance was actually sold for 20 percent less than the original retail price, then it was sold for what percent more than its wholesale cost?
Answer Choices: (A) 20% (B) 28% (C) 36% (D) 40% (E) 42%
Rationale: wholesale cost = 100;
original price = 100*1.6 = 160;
actual price = 160*0.8 = 128.
Answer: B.
Correct Answer: B


Question: A grocer makes a 25% profit on the selling price for each bag of flour it sells. If he sells each bag for $100 and makes $3,000 in profit, how many bags did he sell?
Answer Choices: (A) 12 (B) 16 (C) 24 (D) 30 (E) 40
Rationale: Profit on one bag: 100*1.25= 125
Number of bags sold = 3000/125 = 24
Answer is C.
Correct Answer: C


Question: 20 marbles were pulled out of a bag of only white marbles, painted black, and then put back in. Then, another 20 marbles were pulled out, of which 1 was black, after which they were all returned to the bag. If 

__Task 6 (1 pt)__ write a function that automatically creates chain-of-thought prompts. Follow the instructions from the function docstring.

In [33]:
QUESTION_PREFIX = "Question: "
OPTIONS_PREFIX = "Answer Choices: "
CHAIN_OF_THOUGHT_PREFIX = "Rationale: "
ANSWER_PREFIX = "Correct Answer: "
FEWSHOT_SEPARATOR = "\n\n\n"

def make_prompt(*, main_question, fewshot_examples):
  """
  Your goal is to produce the same prompt as the EXAMPLE_3SHOT_CHAIN_OF_THOUGHT automatically

  For each few-shot question, make sure to follow the following rules:
  1. Each question begins with QUESTION_PREFIX, after which you should print the question without leading/traiiling spaces (if any)
  2. After the question, provide space-separated options. Each option should be put in double brackets, followed by option text, e.g. "(A) 146%"
  3. Then, provide the answer as a single letter (A-E)
  4. Finally, add trailing newlines from FEWSHOT_SEPARATOR

  Your final prompt should contain all fewshot_examples (in order), separated with FEWSHOT_SEPARATOR, then follow with main_question.
  The main_question should contain the question and options formatted the same way as in FEWSHOT_EXAMPLES.
  After that, you should prompt the model to produce an explanation (rationale) for the answer.

  Please make sure your prompt contains no leading/trailing newlines or spaces, same as in EXAMPLE_3SHOT_CHAIN_OF_THOUGHT
  """

  prompt = ''
  all_stuff = list(fewshot_examples) + [main_question]

  for i, elem in enumerate(all_stuff):
      question = elem['question']
      options = elem['options']
      rationale = elem['rationale']
      answer = elem['correct']

      options_text = ''
      for option in options:
          tmp = option.split(')')
          new_option = f'({tmp[0]}) {tmp[1]} '

          options_text += new_option

      prompt += QUESTION_PREFIX + question + '\n'
      prompt += OPTIONS_PREFIX + options_text.strip() + '\n'

      if i != (len(all_stuff) - 1):
          prompt += CHAIN_OF_THOUGHT_PREFIX + rationale + '\n'
          prompt += ANSWER_PREFIX + answer + FEWSHOT_SEPARATOR
      else:
          prompt += CHAIN_OF_THOUGHT_PREFIX.strip()

  return prompt

generated_fewshot_prompt = make_prompt(main_question=data[150], fewshot_examples=(data[30], data[20], data[5]))
assert generated_fewshot_prompt == EXAMPLE_3SHOT_CHAIN_OF_THOUGHT, "prompts don't match"
assert generated_fewshot_prompt != make_prompt(main_question=data[150], fewshot_examples=())
assert generated_fewshot_prompt.endswith(make_prompt(main_question=data[150], fewshot_examples=()))

print("Well done!")

# Hint: if two prompts do not match, you may find it usefull to use https://www.diffchecker.com or similar to find the difference

Well done!


__Task 7 (1 points):__ Evaluate your prompt.

Please run the model on the entire dataset and measure it's accuracy.
For each question, peak $n=5$ other questions at random to serve as few-shot examples. Make sure not to accidentally sample the main_question among few-shot examples. For scientific evaluation, it is also a good practice to split the data into two parts: one for eval, and another for few-shot examples. However, doing so is optional in this homework.

The tricky part is when to stop generating: if you don't control for this, your model can accidentally generate a whole new question - and promptyly answer it :) To make sure you get the correct answer, stop generating tokens when the model is done explaining it's solution. To circumvent this, you need to __stop generating as soon as the model generates Final Answer: [A-E]__
To do so, you can either generate manually (see low-level generation above) or use [transformers stopping criteria](https://discuss.huggingface.co/t/implimentation-of-stopping-criteria-list/20040/2), whichever you prefer.

If you do everything right, the model should be much better than random. However, please __do not expect miracles__: this is far from the best models, and it will perform much worse than an average human.

In [48]:
import numpy as np
from transformers import StoppingCriteria, StoppingCriteriaList

In [90]:
# разобъем данные
main_questions = data[:70]
few_shot_examples = data[70:]

In [91]:
class StoppingCriteriaAnswer(StoppingCriteria):
    def __init__(self, stops = [], encounters=1):
        super().__init__()
        self.stops = [stop.to("cuda") for stop in stops]
        self.stop_next = False

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor):
        for stop in self.stops:
            if torch.all((stop == input_ids[0][-len(stop):])).item():
                if self.stop_next:
                    return True
                self.stop_next  = True
                break

        return False

possible_answers = ['A', 'B', 'C', 'D', 'E']

stop_words = ['Correct Answer:', 'orrect Answer:', 'rrect Answer:', 'rect Answer:', 'ect Answer:', 'ct Answer:', 't Answer:',
              'Answer:',  'nswer:', 'swer:', 'wer:', 'er:', 'r:']

all_possible_stops = [f'{stop_word} {possible_answer}' for stop_word in stop_words for possible_answer in possible_answers]
all_possible_stops += [f'{stop_word}{possible_answer}' for stop_word in stop_words for possible_answer in possible_answers]
all_possible_stops_idxs = [tokenizer(stop, return_tensors='pt', return_token_type_ids=False)['input_ids'].squeeze()[1:] for stop in all_possible_stops]

stopping_criteria = StoppingCriteriaList([StoppingCriteriaAnswer(stops=all_possible_stops_idxs)])

In [92]:
NUM_SAMPLES = 0    # use this to count how many samples you evaluated
NUM_RESPONDED = 0  # how many times did the model produce Correct Answer: (letter) in it's response. use as a sanity check.
NUM_CORRECT = 0    # how many times did the model's chosen answer (letter) match the correct answer

In [93]:
num_examples = 5
model.eval()

for i in tqdm(range(0, len(main_questions))):
    NUM_SAMPLES += 1

    main_question = main_questions[i]
    few_shot_examples_sample = [few_shot_examples[j] for j in np.random.choice(len(few_shot_examples), num_examples, replace=False)]

    prompt = make_prompt(main_question=main_question, fewshot_examples=few_shot_examples_sample)

    batch = tokenizer(prompt, return_tensors='pt', return_token_type_ids=False).to(device)

    with torch.no_grad():
        output_tokens = model.generate(**batch, max_new_tokens=300, do_sample=True, top_p=0.9, stopping_criteria=stopping_criteria).detach()
        output = tokenizer.decode(output_tokens[0][batch['input_ids'].shape[1]:].cpu())

    answer = output[-1]

    if answer in possible_answers:
        NUM_RESPONDED += 1

        if answer == main_question['correct']:
            NUM_CORRECT += 1

print(f"NUM_SAMPLES: {NUM_SAMPLES}, NUM_RESPONDED: {NUM_RESPONDED}, NUM_CORRECT: {NUM_CORRECT}")
# Optionally, consider inferencing multiple sentences in a batch for faster inference;
# If you choose to batch outputs, make sure the results are the same as with batch=1 (using greedy inference)

  0%|          | 0/70 [00:00<?, ?it/s]

NUM_SAMPLES: 70, NUM_RESPONDED: 67, NUM_CORRECT: 12


In [94]:
print("Responded %%:", NUM_RESPONDED / NUM_SAMPLES)
print("Accuracy (when responded):", NUM_CORRECT / NUM_RESPONDED)
print("Accuracy (overall):", NUM_CORRECT / NUM_SAMPLES)

if NUM_RESPONDED / NUM_SAMPLES < 0.9:
  print("Something is wrong with the evaluation technique (for 5-shot CoT): the model refuses to answer too many questions.")
  print("Make sure you generate enough tokens that the model can produce a correct answer.")
  print("When in doubt, take a look at the full model output. You can often spot errors there.")

Responded %%: 0.9571428571428572
Accuracy (when responded): 0.1791044776119403
Accuracy (overall): 0.17142857142857143


__Task 8 (2 points)__ Experiment time!
<img width=200px src=https://www.evolvefish.com/cdn-cgi/image/quality%3D85/assets/images/Apparel/TShirtsWomenCont/Main/EF-APP-CWT-00068(Main).jpg>

Your final quest is to use the testbench you've just written to answer one of the following questions:

### Option 1: How many shots do you need?

How does model accuracy change with the number of fewshot examples?

a. check if the model accuracy changes as you increase/decrease the number of "shots"

b. try to prompt-engineer a model into giving the best rationale __without__ any few-shot examples, i.e. zero-shot

For zero-shot mode, feel free to use wild prompt-engineering or modify the inference procedure.

### Option 2: Is this prompting tecnique reliable?

_Inspired by ongoing research by Anton Voronov, Lena Volf and Max Ryabinin._

For this option, you need to check if the model behavior (and hence, accuracy) is robust to perturbations in the input prompt.

a. Does the accuracy degrade if you provide wrong answers to few-shot examples? (make sure to modify rationale if it contains answer in the end)

b. Does it degrade if you replace question/answer prompts with "Q" and "A"? What if you write both on the same line? Change few-shot separators?



### Option 3: Inference Matters

There are many ways to inference the model, not all of them equal.

a. check whether greedy inference or beam search affects model generation quality

b. implement and evaluate sampling with voting (see explanation below).


The voting technique(b) should work as follows: first, you generate k (e.g. 50) "attempts" at an answer using nucleus sampling (or a similar technique).
Then, you count how many of those attempts chose a particular option (A, B, etc) as the final answer. The option that was chosen most frequently has the most "votes", and therefore "wins".

To speed up voting, you may want to generate these attempts in parallel as a batch. That should be very easy to implement: just run `model.generate` on a list with multiple copies of the same prompt.




================================================

__Common rules:__ You will need to test both hypothes (A and B) in the chosen option. You may choose to replace one of them with your own idea - but please ask course staff in advance (via telegram) if you want full points.

Feel free to organize your code and report as you see fit - but please make sure it's readable and the code runs top-to-bottom :)
Write a short informal report about what you tried and, in doing so, what did you found. Minimum of 2 paragraphs; more is ok; creative visualizations are welcome.

You are allowed (but not required) to prompt the model into generating a report for you --- or helping you write one. However, if you do so, make sure that it is still human-readable :)



**Option 1**.

a. Поэкспериментируем с количеством примеров (shots): 1, 2, 3 и 4

In [95]:
import gc

In [96]:
for num_examples in [1, 2, 3, 4]:
    NUM_SAMPLES = 0    # use this to count how many samples you evaluated
    NUM_RESPONDED = 0  # how many times did the model produce Correct Answer: (letter) in it's response. use as a sanity check.
    NUM_CORRECT = 0    # how many times did the model's chosen answer (letter) match the correct answer

    model.eval()

    for i in tqdm(range(0, len(main_questions))):
        NUM_SAMPLES += 1

        main_question = main_questions[i]
        few_shot_examples_sample = [few_shot_examples[j] for j in np.random.choice(len(few_shot_examples), num_examples, replace=False)]

        prompt = make_prompt(main_question=main_question, fewshot_examples=few_shot_examples_sample)

        batch = tokenizer(prompt, return_tensors='pt', return_token_type_ids=False).to(device)

        with torch.no_grad():
            output_tokens = model.generate(**batch, max_new_tokens=300, do_sample=True, top_p=0.9, stopping_criteria=stopping_criteria).detach()
            output = tokenizer.decode(output_tokens[0][batch['input_ids'].shape[1]:].cpu())

        answer = output[-1]

        if answer in possible_answers:
            NUM_RESPONDED += 1

            if answer == main_question['correct']:
                NUM_CORRECT += 1

    print('num_examples = ', num_examples)
    print(f"NUM_SAMPLES: {NUM_SAMPLES}, NUM_RESPONDED: {NUM_RESPONDED}, NUM_CORRECT: {NUM_CORRECT}")
    print("Responded %%:", NUM_RESPONDED / NUM_SAMPLES)
    print("Accuracy (when responded):", NUM_CORRECT / NUM_RESPONDED)
    print("Accuracy (overall):", NUM_CORRECT / NUM_SAMPLES)

    gc.collect()
    torch.cuda.empty_cache()

  0%|          | 0/70 [00:00<?, ?it/s]

num_examples =  1
NUM_SAMPLES: 70, NUM_RESPONDED: 36, NUM_CORRECT: 7
Responded %%: 0.5142857142857142
Accuracy (when responded): 0.19444444444444445
Accuracy (overall): 0.1


  0%|          | 0/70 [00:00<?, ?it/s]

num_examples =  2
NUM_SAMPLES: 70, NUM_RESPONDED: 68, NUM_CORRECT: 11
Responded %%: 0.9714285714285714
Accuracy (when responded): 0.16176470588235295
Accuracy (overall): 0.15714285714285714


  0%|          | 0/70 [00:00<?, ?it/s]

num_examples =  3
NUM_SAMPLES: 70, NUM_RESPONDED: 69, NUM_CORRECT: 22
Responded %%: 0.9857142857142858
Accuracy (when responded): 0.3188405797101449
Accuracy (overall): 0.3142857142857143


  0%|          | 0/70 [00:00<?, ?it/s]

num_examples =  4
NUM_SAMPLES: 70, NUM_RESPONDED: 60, NUM_CORRECT: 15
Responded %%: 0.8571428571428571
Accuracy (when responded): 0.25
Accuracy (overall): 0.21428571428571427


Результат на 5 примерах:

* Responded %%: 0.9571428571428572

* Accuracy (when responded): 0.1791044776119403

* Accuracy (overall): 0.17142857142857143

Сложно сказать, что качество зависит от количества примеров. Потому что мы видим, что при переходе от 2 к 3, качество растет, а при переходе от 3 к 4 падает. Возможно, сами выбранные примеры влияют на получаемое качество.

b.

In [98]:
new_beginning = """Given a question and five answer options (A, B, C, D and E), you have to write the solution and choose the correct answer.
You should write a solution after "Rationale: ", after that you have to give an answer starting with "Correct Answer: "\n\n\n
"""

In [99]:
NUM_SAMPLES = 0    # use this to count how many samples you evaluated
NUM_RESPONDED = 0  # how many times did the model produce Correct Answer: (letter) in it's response. use as a sanity check.
NUM_CORRECT = 0    # how many times did the model's chosen answer (letter) match the correct answer

model.eval()

for i in tqdm(range(0, len(main_questions))):
    NUM_SAMPLES += 1

    main_question = main_questions[i]

    prompt = new_beginning
    prompt += make_prompt(main_question=main_question, fewshot_examples=[])

    batch = tokenizer(prompt, return_tensors='pt', return_token_type_ids=False).to(device)

    with torch.no_grad():
        output_tokens = model.generate(**batch, max_new_tokens=300, do_sample=True, top_p=0.9, stopping_criteria=stopping_criteria).detach()
        output = tokenizer.decode(output_tokens[0][batch['input_ids'].shape[1]:].cpu())

    answer = output[-1]

    if answer in possible_answers:
        NUM_RESPONDED += 1

        if answer == main_question['correct']:
            NUM_CORRECT += 1

# print('num_examples = ', num_examples)
print(f"NUM_SAMPLES: {NUM_SAMPLES}, NUM_RESPONDED: {NUM_RESPONDED}, NUM_CORRECT: {NUM_CORRECT}")
print("Responded %%:", NUM_RESPONDED / NUM_SAMPLES)
print("Accuracy (when responded):", NUM_CORRECT / NUM_RESPONDED)
print("Accuracy (overall):", NUM_CORRECT / NUM_SAMPLES)

  0%|          | 0/70 [00:00<?, ?it/s]

num_examples =  4
NUM_SAMPLES: 70, NUM_RESPONDED: 20, NUM_CORRECT: 7
Responded %%: 0.2857142857142857
Accuracy (when responded): 0.35
Accuracy (overall): 0.1


Вывод сверху про **num_example = 4** я забыл убрать с прошлого куска кода. На самом деле здесь не используются примеры.

Опять же, сложно сделать какой-то вывод о качестве, однако в данном случае количество NUM_RESPONDED гораздо меньше, чем с примерами.