In [1]:
!pip install datasets



# 1. Tokenization (15 pts)

How does one represent textual input to language models? One strategy that we have seen is to split up words on spaces, e.g.,

> This is an example.

> [This, is, an, example],

but this fails when unseen words appear at test time, e.g.,

> We named our son nwonkun.

> [We, named, our, son, \<unk\>] (5 tokens).

One solution to this problem is to use character-level tokens

> [W, e, _, n, a, m, e, d, _, o, u, r, _, s, o, n, _, n, w, o, n, k, u, n]

(24 tokens, if I counted right), but now the number of tokens required to encode a sentence has increased a *lot*.

## 1.1 Byte-pair encodings and sub-word tokenization (5 pts)

[Byte-pair encodings (BPE)](https://en.wikipedia.org/wiki/Byte_pair_encoding) are a clever middle ground for the tokenization problem.
Starting from a character-level tokenization, iteratively combine the most common bigrams (token pairs) into their own tokens.
For example, the most common bigrams from the previous example are "_n" and "on". Breaking the tie arbitrarily and creating a new token "_n" we now have

> [W, e, _n, a, m, e, d, _, o, u, r, _, s, o, n, _n, w, o, n, k, u, n]

reducing the token count to 22. Iteratively applying this rule, we can further reduce it to 20 tokens by adding the token "on", and so on. Each step of this algorithm greedily reduces the token count by the maximum amount.

This tokenization scheme, known as "sub-word tokenization" takes the best of both worlds: since the vocabulary still contains tokens for every byte, we never have to use the \<unk\> token, while still reducing the number of required tokens to encode a sequence. The more tokens you add, the shorter your sequence gets.

To decide which tokens to add to the vocabulary, we have to *train* our BPE tokenizer on a corpus.
In this section you will do just that.

In [2]:
from datasets import load_dataset

dataset = load_dataset("wikitext", "wikitext-2-raw-v1")
train: str = str.join(" ", dataset["train"]["text"])[:pow(10, 6)]
test: str = str.join(" ", dataset["test"]["text"])[:pow(10, 6)]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [7]:
from itertools import chain, pairwise
from collections import Counter
from tqdm import tqdm

class Tokenizer:
    # The lookup list contains *byte groups*, represented as a tuple of ints.
    # The token ID for a byte group is its index in the list.
    vocab: list[tuple[int, ...]]

    def __init__(self, training_seq: str, vocab_size: int) -> None:
        # Initialize a lookup with single-byte groups
        self.vocab = [(i,) for i in range(pow(2, 8))]
        byte_seq = list(bytes(training_seq, "utf-8"))

        for _ in tqdm(range(pow(2, 8), vocab_size)):
            """
            TODO: iteratively add the most common token pairs to the vocabulary.
            Advice: try using Counter and pairwise from the python std lib.
            """
            pairs = pairwise(byte_seq)
            pair_counts = Counter(pairs)

            if not pair_counts:
                break
            most_common_pair = max(pair_counts, key=pair_counts.get)
            most_common_pair_token = self.vocab[most_common_pair[0]] + self.vocab[most_common_pair[1]]

            self.vocab.append(most_common_pair_token)

            i = 0
            new_seq = []
            while i < len(byte_seq):
                if i < len(byte_seq) - 1 and (byte_seq[i], byte_seq[i + 1]) == most_common_pair:
                    new_seq.append(len(self.vocab)-1)
                    i += 2
                else:
                    new_seq.append(byte_seq[i])
                    i += 1
            byte_seq = new_seq


    def tokenize(self, seq: str) -> list[int]:
        byte_seq = list(bytes(seq, "utf-8"))
        token_seq = []
        i = 0

        while i < len(byte_seq):
            same, same_len = None, 0

            for j in range(1, min(10, len(byte_seq) - i + 1)):
                token_to_find = tuple(byte_seq[i:i + j])
                if token_to_find in self.vocab:
                    same = self.vocab.index(token_to_find)
                    same_len = j

            token_seq.append(same if same is not None else byte_seq[i])
            i += same_len if same is not None else 1

        return token_seq


    def detokenize(self, token_seq: list[int]) -> str:
        byte_seq = []
        for token in token_seq:
            byte_seq[len(byte_seq):] = self.vocab[token]
        return bytes(byte_seq).decode("utf-8")

train_data = train[:10000]
tokenizer = Tokenizer(train_data, vocab_size=500)

print("Some of our new tokens:")
for token in tokenizer.vocab[-10:]:
    print(repr(bytes(token).decode("utf-8")))

100%|██████████| 244/244 [00:02<00:00, 95.76it/s] 

Some of our new tokens:
'thro'
'throug'
'pro'
'se'
'diff'
'squad '
'batt'
'p '
'Ar'
'Arm'





As a sanity check, your implementation should be able to compress the training set to ~40-50% of its original size.
You should notice that the test set compression does not perform as well. This is because the distribution of bigrams in the test set does not exactly match the that of the train set. This gets worse the further your test set distribution is from your training set.

In [8]:
# Do not edit this code cell
test_data = test[:10000]
train_bytes_len = len(bytes(train_data, "utf-8"))
train_token_len = len(tokenizer.tokenize(train_data))
print(f"Compressed train set to {train_token_len / train_bytes_len * 100:.0f}% original size")
test_bytes_len = len(bytes(test_data, "utf-8"))
test_token_len = len(tokenizer.tokenize(test_data))
print(f"Compressed test set to {test_token_len / test_bytes_len * 100:.0f}% original size")

assert train_data == tokenizer.detokenize(tokenizer.tokenize(train_data))
assert test_data == tokenizer.detokenize(tokenizer.tokenize(test_data))

Compressed train set to 43% original size
Compressed test set to 52% original size


## 1.2 BPE performance on OOD text. (5 pts)

Explore how English-trained BPE performs on non-English text by downloading corpora from a few different languages and using your English-trained tokenizer. What do you find? Do the results match your expectations? For what langauges does the tokenizer struggle with the most? How might this impact society if everyone were to use your tokenizer?

Include your code, results, and discussion in new cells below.

Hint: we recommend you use `load_dataset` to fetch from HuggingFace with `streaming=True` to avoid huge downloads. You might want to take a look at the `oscar` dataset.

In [9]:
# TODO
from datasets import load_dataset
from collections import Counter

languages = ["zh", "ja", "de", "ar", "es"]
samples = {}

for lang in languages:
    dataset = load_dataset("oscar", f"unshuffled_deduplicated_{lang}", split="train", streaming=True)
    samples[lang] = " ".join([next(iter(dataset))["text"][:500]])

tokenizer = Tokenizer(training_seq=train_data, vocab_size=500)
for lang, text in samples.items():
    print(f"{lang}: {text[:100]}")
    tokens = tokenizer.tokenize(text)
    print(f"{lang}: Compressed to {len(tokens) / len(bytes(text, 'utf-8')) * 100:.0f}% original size")
    print()

100%|██████████| 244/244 [00:00<00:00, 250.65it/s]


zh: 时间可以被缩短，但过程不可以被省略，只有真正为社会创造价值的企业才能基业长青。大巧不工，重剑无锋，企业最终还是要用业绩和结果说话。一级a卡片在线观看通过产品打磨、团队搭建、市场营销、经营管理等过程，秉
zh: Compressed to 100% original size

ja: 神社などへ一緒に同行して、様々な角度のショットで家族写真やお子様の写真を撮影致します！お好みに合わせて様々な写真を取ることができますので、その場でカメラマンへのリクエストも可能です！お子様の晴れ姿を、
ja: Compressed to 100% original size

de: Dosierförderbänder Getriebe Entwässerungssiebmaschine USE 1400 x 3500 mm Eimerkettenbagger Entstaubu
de: Compressed to 79% original size

ar: مرحبا بك عزيز الزائر نتمنى لك أوقاتاً سعيدة معنا وأن نزداد شرفا بخدمتك ولا تنسى التسجيل معنا لتستفيد
ar: Compressed to 99% original size

es: Como se librará de la celulitis en el gimnasio La piel superflua en las manos después del adelgazami
es: Compressed to 62% original size



Observations:

- We can see that our tokenizer performs differently on each language just as we expected.
- Our tokenizer struggles the most for languages that have characters/alphabets different from the English language like Chinese, Japanese and Arabic. Tokens for these languages hardly get compressed by 1%.
- For languages like German and Spanish, which have a lot of similar alphabets when compared to english, the tokenizer performs decently.
- Thus, our tokenizer would only perform well if its used on languages that have similar alphabets to the english language.
- If we want a universal tokenizer for our society, it must be trained on multilingual corpora, ensuring better support for all languages in a globalized world

## 1.3 Pitfalls of and alternatives to BPE (5 pts)

BPE tokenization sufferes from other issues as well. Due to the implementation of our BPE tokenizer, detokenizing a sequence of tokens then re-tokenizing it does not always recover the original sequence:
```
vocab = {a, aa, b}
tokens = [0, 1, 2]
detokenized = aaab
retokenized = [1, 0, 2]
```

Another issue is that some tokens that may have been prevalent during BPE training may not be present during language model training, leading to funky situations where the language model has not been trained to represent or output some tokens. See this paper for more information: https://arxiv.org/pdf/2405.05417.

Some NLP researchers think that we should move away from sub-word tokenization to get rid of these problems. Engage with this discussion by either
- Finding a paper that points out an issue with tokenization and propose your own solution for how you would fix it, or
- Finding a paper that proposes an alternative tokenization scheme (or way of processing text) and discuss the drawbacks of the proposed method.

Your response should be about a paragraph in length and link to a paper.

The paper "BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training" by Chizhov et al. (https://arxiv.org/html/2409.04599v1) explores challenges in BPE tokenization, especially the issue of "junk" tokens that clutter the vocabulary but rarely get used. To tackle this, the authors introduce "Picky BPE," a refined approach to BPE that filters out low-frequency intermediate tokens during training. This method makes the vocabulary more efficient and reduces the presence of under-trained tokens, which negatively impacts the language model performance. While Picky BPE offers a more compressed vocabulary, it is still based on the traditional subword tokenization technique, meaning it doesn’t completely solve issues like mismatches between detokenization and re-tokenization. Even so, tests show that Picky BPE generally maintains or even enhances model performance, making it a useful improvement over standard BPE.

# 2. Generation Algorithms (35 pts + 15 pts BONUS)

In this problem, we will implement several common decoding algorithms and test them with the GPT-2 Medium model.

Given the class below, we will fill in each of the method stubs. You may create additional helper methods as well to make components re-usable.

**You are not allowed to use the generate() function in the transformers library. You can only use the model's forward() method to retrieve final layer logits**

In addition to the methods we ask you to implement, which are:
- Greedy decoding
- Temperature Sampling
- Nucleus Sampling

You will choose ONE of the following sampling algorithms to implement as well (make sure to add your own method, since we do not provide one by default):
- Typical Sampling ([Meister et al. (2022)](https://arxiv.org/abs/2202.00666))
- Eta Sampling ([Hewitt et al. (2022)](https://arxiv.org/abs/2210.15191))

Points for this question will be distributed as follows:

- 5-10 points for implementing each decoding algorithm
- 5 points for implementing the generate() function (you will make this incrementally through each sub-part)
- 5 points for filling out the table with list of tokens (see instructions below)

In [61]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from typing import Optional
import numpy as np

class LM():
    def __init__(self, model_name: str = "openai-community/gpt2-medium"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(model_name)
        self.model.eval()

    def greedy_decoding(self, prompt: str, max_length: int = 64) -> str:
        """
        TODO:

        Implement greedy decoding, in which we use the highest
        probability token at each decoding step
        """
        tokens = self.tokenizer(prompt, return_tensors="pt")
        token_ids = tokens.input_ids
        logits = self.model(input_ids=token_ids).logits[:, -1, :]
        k_scores, k_idxs = torch.topk(logits, k=10, dim=-1)
        k_probs = torch.nn.functional.softmax(k_scores, dim=-1)
        k_tokens = [self.tokenizer.decode([i]) for i in k_idxs[0]]

        print(f"Top 10 tokens for temperature sampling:", k_tokens)

        for i in range(max_length):
            next_token = torch.argmax(logits, dim=-1, keepdim=True)
            token_ids = torch.cat((token_ids, next_token), dim=1)
            if next_token.item() == self.tokenizer.eos_token_id:
                break
            logits = self.model(input_ids=token_ids).logits[:, -1, :]

        return self.tokenizer.decode(token_ids[0], skip_special_tokens=True)

    def temperature_sampling(self, prompt: str, temperature: float = 1.0, max_length: int = 64) -> str:
        """
        TODO:

        Implement temperature sampling, in which we sample
        from the output distribution at each decoding step,
        with a temperature parameter to control the "peakiness"
        of the output distribution
        """

        tokens = self.tokenizer(prompt, return_tensors="pt")
        token_ids = tokens.input_ids
        logits = self.model(input_ids=token_ids).logits[:, -1, :]
        logits = logits / temperature
        k_probs = torch.nn.functional.softmax(logits, dim=-1)
        k_scores, k_idxs = torch.topk(k_probs, k=10, dim=-1)

        k_tokens = [self.tokenizer.decode([i]) for i in k_idxs[0]]

        print(f"Top 10 tokens for temperature sampling:", k_tokens)

        for i in range(max_length):
            next_token = torch.multinomial(k_probs, num_samples=1)
            token_ids = torch.cat((token_ids, next_token), dim=1)
            if next_token.item() == self.tokenizer.eos_token_id:
                break
            logits = self.model(input_ids=token_ids).logits[:, -1, :] / temperature
            k_probs = torch.nn.functional.softmax(logits, dim=-1)

        return self.tokenizer.decode(token_ids[0], skip_special_tokens=True)

    def nucleus_sampling(self, prompt: str, p: float = 0.9, max_length: int = 64, temperature: float = 1.0) -> str:
        """
        TODO:
        Implement nucleus sampling, in which we
        sample from a subset of the vocabulary
        at each decoding step
        Note: There is also a temperature parameter here
        """
        tokens = self.tokenizer(prompt, return_tensors="pt")
        token_ids = tokens.input_ids
        generated_ids = token_ids.tolist()[0]

        for i in range(max_length):
            with torch.no_grad():
                logits = self.model(input_ids=token_ids).logits[:, -1, :]
                logits /= temperature
                all_probs = torch.nn.functional.softmax(logits, dim=-1)

                probs, idxs = torch.sort(all_probs, descending=True, dim=-1)
                cumulative_probs = torch.cumsum(probs, dim=-1)
                mask = cumulative_probs.gt(p)
                mask[..., 1:] = mask[..., :-1].clone()
                mask[..., 0] = False
                probs.masked_fill_(mask, 0.0)

                prob_sum = probs.sum(dim=-1, keepdim=True) + 1e-9
                normalized_probs = probs / prob_sum

                if i == 0:
                    top_k = min(10, normalized_probs.size(-1))
                    top_k_probs, top_k_idxs = torch.topk(normalized_probs, k=top_k, dim=-1)
                    top_k_tokens = [self.tokenizer.decode([idx]) for idx in top_k_idxs[0]]
                    print('Top 10 tokens for Nucleus Sampling:', top_k_tokens)

                pred_token = idxs[0, torch.multinomial(normalized_probs, num_samples=1).item()].item()
                generated_ids.append(pred_token)

                if pred_token == self.tokenizer.eos_token_id:
                    break

                token_ids = torch.tensor([generated_ids])

        return self.tokenizer.decode(generated_ids, skip_special_tokens=True)



    def typical_sampling(self, prompt: str, typical_threshold: float = 0.3, max_length: int = 64, epsilon: float = 0.0) -> str:
      token_ids = self.tokenizer.encode(prompt, return_tensors='pt')
      generated_ids = token_ids.tolist()[0]

      for i in range(max_length):
        logits = self.model(token_ids).logits[:, -1, :]

        prob = torch.nn.functional.softmax(logits, dim=-1)
        prob_log = -1 * torch.nn.functional.log_softmax(logits, dim=-1)

        entropy = -torch.sum(prob * prob_log, dim=-1).unsqueeze(-1)
        typicality = torch.abs(prob_log - entropy)


        idxs = torch.argsort(typicality, dim=-1, descending=False)

        threshold = (typical_threshold * idxs.size(-1))
        threshold = torch.tensor(threshold, dtype=torch.int64)

        threshold_indices = idxs.narrow(-1, 0, threshold).squeeze(0)

        threshold_probs = torch.gather(prob, -1, threshold_indices.unsqueeze(0))
        threshold_prob_sum = threshold_probs.sum(dim=-1, keepdim=True) + 1e-9
        threshold_probs = threshold_probs / threshold_prob_sum

        if i == 0:
            top_k = min(10, threshold_probs.size(-1))
            top_k_probs, top_k_idxs = torch.topk(threshold_probs, k=top_k, dim=-1)
            top_k_tokens = [self.tokenizer.decode([threshold_indices[idx].item()]).strip() for idx in top_k_idxs[0]]
            print("Top 10 tokens for Typical Sampling", top_k_tokens)


        sampled_index = torch.multinomial(threshold_probs, num_samples=1).item()
        pred_token_id = torch.index_select(threshold_indices, 0, torch.tensor([sampled_index])).item()
        generated_ids.append(pred_token_id)
        if pred_token_id == self.tokenizer.eos_token_id:
            break
        token_ids = torch.tensor([generated_ids])

      return self.tokenizer.decode(generated_ids, skip_special_tokens=True)


    def generate(self,
             prompt: str,
             temperature: float = 1.0,
             p: Optional[float] = None,
             max_len: int = 64,
             typical_threshold: Optional[float] = None) -> str:
        """
        TODO:

        Route to the appropriate generation function
        based on the arguments
        HINT: What parameter values should map to greedy decoding?
        """
        if temperature == 0:
            return self.greedy_decoding(prompt, max_length=max_len)

        elif typical_threshold:
            return self.typical_sampling(prompt, typical_threshold=typical_threshold, max_length=max_len)

        elif p:
            return self.nucleus_sampling(prompt, p=p, max_length=max_len, temperature=temperature)

        else:
            return self.temperature_sampling(prompt, temperature=temperature, max_length=max_len)

GPT2LM = LM("openai-community/gpt2-medium")

For each sampling algorithm you implement, fill out this table, in which you will list the top 10 highest probability tokens **at the first decoding step** in a comma separated list. For algorithms like nucleus sampling where you perform some kind of truncation/re-distribution of the output distribution, do the truncation/re-distribution first, and then sort the vocabulary by probability to complete the table.

For this and all questions below, use the following prompt:


**"Once upon a time in a land far far away, "**

Note: Use the default value for `max_length` for all questions below.

| **Decoding Algorithm** | **10 Highest Probability Tokens** |
|------------------------|-----------------------------------|
| Greedy                 | [' there', ' a', ' the', ' in', '\n', ' I', ' an', ' when', ' two', ' you']                                 |
| Temperature (t=1.0)    | [' there', ' a', ' the', ' in', '\n', ' I', ' an', ' when', ' two', ' you']                                 |
| Nucleus (p=0.9)        | ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*']                                |
| Typical/Eta            | ['there', 'a', 'the', 'in', '', 'I', 'an', 'when', 'two', 'you']                                 |

In [35]:
prompt = "Once upon a time in a land far far away,"

In [57]:
GPT2LM.generate(prompt, temperature=1)

Top 10 tokens for temperature sampling: [' there', ' a', ' the', ' in', '\n', ' I', ' an', ' when', ' two', ' you']


'Once upon a time in a land far far away,\n\n\nOur ancestors walked like Zhuangzi "Leap for joy,"\n\n\nI Have tried to put on a brave face\n\n\nAnd tell you that I\'m sorry I can\'t be here.\n\n\nI\'ve wandered all over the world and been\n\n\nA young fool of a screen writer\n\nAnd'

## 2.1 Greedy Decoding (5 points)

First, implement the most simple decoding method of greedy decoding. Here, at each decoding time step, simply use the highest probability token. Note that you'll need to adjust the generate function so that a specific temperature value will map to greedy decoding (what should that value be?).

Use the prompt given above to test your implementation. What do you notice?

In [58]:
print("Greedy Decoding Generation:", GPT2LM.generate(prompt, temperature=0))

Top 10 tokens for temperature sampling: [' there', ' a', ' the', ' in', '\n', ' I', ' an', ' when', ' two', ' you']
Greedy Decoding Generation: Once upon a time in a land far far away, there lived a man named Simeon. He was a wise man, and he knew the secrets of the universe. He knew that the universe was made of many worlds, and that each world was made of a single substance. He knew that each substance was made of a single substance, and that each substance was made


- It an be seen that the output is very repetitive and lacks creativity in general.
- The generated text is highly deterministic, and the model consistently chooses the highest-probability token at each step.

##2.2 Temperature Sampling (10 pts)

Sometimes (a lot of the time?), we don't actually just want the highest probability token at each time step. Why might this be the case?

- Selecting the highest probability token at each time step can lead to repetitive outputs, particularly in creative tasks like story generation..
- Introducing some randomness can produce more varied and engaging results.

To adjust for this, we often use sampling algorithms instead of greedy decoding. However, there are many ways we can go about sampling.

First, implement temperature sampling. Recall that the temperature parameter adjusts the "randomness" of the output at each time step. Here, you'll need to think about how to adjust the output distribution which you will do multinomial sampling from. Be careful about how you will handle very low (close to 0) temperatures.

Given the same prompt as above, test your implementation with the following temperature values: [0.3, 0.5, 0.7, 0.9, 1.1]. For each value, sample 3 outputs. What do you notice in terms of the differences between output sets across different temperature values?  

In [10]:
#TODO
for i in [0.3, 0.5, 0.7, 0.9, 1.1]:
    print(f'For temperature = {i}')
    for j in range(3):
        print(f'Sample {j+1}: {GPT2LM.generate(prompt, temperature=i)}')
        print()

For temperature = 0.3
Top 10 tokens for temperature sampling: [' there', ' a', ' the', ' in', '\n', ' I', ' an', ' when', ' two', ' you']
Sample 1: Once upon a time in a land far far away, a young man1996 was born. He was a boy of eleven. His mother was a woman of the highest rank. She was a very beautiful woman. She was tall and slender, with a beautiful face. She had a very beautiful figure. She was very beautiful. Her hair was long and silky. She had

Top 10 tokens for temperature sampling: [' there', ' a', ' the', ' in', '\n', ' I', ' an', ' when', ' two', ' you']
Sample 2: Once upon a time in a land far far away, there lived a man who had a great desire to see the world. He was a man of great wisdom, and he had a great desire to see the world. He went to the city of his birth, and there he saw the world. And he said to the gods, "I want to see the world,

Top 10 tokens for temperature sampling: [' there', ' a', ' the', ' in', '\n', ' I', ' an', ' when', ' two', ' you']
Sample 3: O

| **Temperature** | **Output 1**                                                                                                                                                                                                                  | **Output 2**                                                                                                                                                                                                                  | **Output 3**                                                                                                                                                                                                                  |
|------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **0.3**         | Once upon a time in a land far far away, a young man1996 was born. He was a boy of eleven. His mother was a woman of the highest rank. She was a very beautiful woman. She was tall and slender, with a beautiful face. She had a very beautiful figure. She was very beautiful. Her hair was long and silky. She had | Once upon a time in a land far far away, there lived a man who had a great desire to see the world. He was a man of great wisdom, and he had a great desire to see the world. He went to the city of his birth, and there he saw the world. And he said to the gods, "I want to see the world, | Once upon a time in a land far far away, there lived a man named Tengen. He was a very wise man, and he had a great many books. One day, he decided to write down all the knowledge he had learned. He began to write down the knowledge he had learned, and he began to write down the knowledge he had learned. He wrote |
| **0.5**         | Once upon a time in a land far far away, a warrior was summoned by a powerful wizard. The warrior was given a weapon and armor, and instructed to lead a charge against a terrible creature that had been summoned tominimum. The warrior was given the task of slaying the creature, but the wizard had other plans. The wizard was able to summon a number of creatures that | Once upon a time in a land far far away, there lived a man named Gulliver, who lived in a cave. In the cave he was surrounded by trees, and he was surrounded by trees. One day, he heard a tree falling. He ran from the cave and came to the tree. He fell into the tree and was saved by the tree. He | Once upon a time in a land far far away, the children of the earth were born to those who were called the sons of God. And those children were called the pure in heart, and the good in heart, and the faithful in word and deed. And those who were called sons of God did not commit adultery, nor steal, nor lie, nor covet. |
| **0.7**         | Once upon a time in a land far far away, there lived a good-natured man. His name was St. Bernard who was a bishop in the French church and had recently entered into interposition of the Holy See. The bishop (who had been made bishop by his own authority) received the part of being archbishop. St. | Once upon a time in a land far far away, there was a man named Hazmat. You know the very first person you ever saw wearing a red "Cancer" baseball cap? Well, Hazmat was there. Once upon a time in a land far far away, there was a man named Hazmat. You know the very first person you ever saw | Once upon a time in a land far far away, there was a shard of a man in a white robe who lived in a village. He was a very wise and powerful man, and he kept Flame Torches and stored them away in a dungeon deep within the village. One day, the village chief sensed that someone had broken into the locked dungeon and stole the treasure |
| **0.9**         | Once upon a time in a land far far away, there lived a wise schoolmaster. He was a close friend of mine, and for hundreds of years, they enjoyed martial arts together. One day, surely, the schoolmaster's life threatened to collapse like a sack of potatoes. He wanted to try out some of the mystic arts, but discovered they weren't as fun | Once upon a time in a land far far away, The beautiful, the divine, the exalted! East and West, The universal, the intangible! No hair on the tree's upper branch stirs, No echo echoes in the jungle… No gods, no gods! No heaven, no heaven sings in the | Once upon a time in a land far far away, there lived a young brat with a dream. He dreamed of writing songs. The land had no music. He had wandered long way from home, had ventured far in the cold, and unknowingly discovered a monastery dedicated to the healing arts. That young monk was a mage known as Quodam Oran. He |
| **1.1**         | Once upon a time in a land far far away, men were poised try doe craunce Isaac boding lain arnieix. color-rich and frhapes i'odegh tife regor bur poogs cancer rigor Hydra Stan chofose an loostkutihe rudo Kamfeaaeead foghdrll le l | Once upon a time in a land far far away, the dragon queen was murdered–the woman who protected Citizens of Hyzu from Zuko in the past…only to be rejuvenated by her…daughter Princess Na Lu Nsungaserkin. Understandably horrified, of course, Princess Na Lu figured that her sister would be around, and so word swept theDI! | Once upon a time in a land far far away, a humble band of moonwars rose up out of cold cockles stalagmite ejecta and glowed military sorcery until their encroaching glory caused Som swirly artillery bombardment across the continent in ship after ship on an impromptu multinational campaign of murderous techno-street vigilantism by frightening NASCAR driver huskies in ne |

- Lower Temperatures (e.g., 0.3, 0.5): Produce predictable, repetitive, and logically coherent outputs.
- Moderate Temperatures (e.g., 0.7): Shows good balance between creativity and coherence, generating diverse yet readable outputs.
- Higher Temperatures (e.g., 0.9, 1.1): Prioritizes diversity and creativity but can lead to chaotic, incoherent, or nonsensical outputs.

## 2.3 Nucleus Sampling (10 pts)

Originally published in [Holtzmann et al. (2021)](https://arxiv.org/abs/1904.09751), nucleus sampling was designed to address an issue that was especially prevalent in language models at the time.

This issue is the case of "neural text degeneration," where outputs from LMs would often degenerate into gibberish if a low probability token was ever decoded. To address this, nucleus (also known as top-p) sampling uses a hyperparameter, p, to control how big of a subset of the vocabulary we sample from at each step. For example, if p=0.9, we only sample from the subset of tokens that have a cumulative probability mass of 0.9 (after sorting by probability).

Implement nucleus sampling and then use the same prompt as above and test your implementation with the following p-values: [0.97, 0.95, 0.9, 0.8, 0.7]
What do you notice across outputs?

In [56]:
#TODO
for p in [0.97, 0.95, 0.9, 0.8, 0.7]:
    for j in range(3):
        print('Output for p =', p, ':', GPT2LM.generate(prompt, p=p))

Top 10 tokens for Nucleus Sampling: ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*']
Output for p = 0.97 : Once upon a time in a land far far away, humans travel the land watching over it, watching over rule by perfect government, loyal citizens, and peaceful citizens. In times that have passed, the shadows go as far as to visit this ancient tale, this fear never ceases to infest this people. The voices, whispers, scars, and hallucinations that haunt the most may
Top 10 tokens for Nucleus Sampling: ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*']
Output for p = 0.97 : Once upon a time in a land far far away, a few warriors off the farmland of England suddenly faced an enemy battalion - thirty Orks who were near impossible to kill and so hot on their heels with a war against Chaos at hand that no ordinary army would stand a chance. Suddenly they were prey to snipers, canteens full of contaminated supplies of life feed, stolen
Top 10 tokens for Nucleus Sampling: ['!', '"', '#', '$', 

| **p**  | **Output 1**                                                                                                                                                                                                                          | **Output 2**                                                                                                                                                                                                                          | **Output 3**                                                                                                                                                                                                                          |
|--------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0.97   | Once upon a time in a land far far away, humans travel the land watching over it, watching over rule by perfect government, loyal citizens, and peaceful citizens. In times that have passed, the shadows go as far as to visit this ancient tale, this fear never ceases to infest this people. The voices, whispers, scars, and hallucinations that haunt the most may | Once upon a time in a land far far away, a few warriors off the farmland of England suddenly faced an enemy battalion - thirty Orks who were near impossible to kill and so hot on their heels with a war against Chaos at hand that no ordinary army would stand a chance. Suddenly they were prey to snipers, canteens full of contaminated supplies of life feed, stolen | Once upon a time in a land far far away, men were violent about their women… During the nineteenth century, tales of female "aggression" through the objectification of women reached folkways through the popular media. At the time, domestic violence was uncommon enough that it seemed reasonable to expect "harmful sex" as a cause of domestic violence. Meanwhile,   |
| 0.95   | Once upon a time in a land far far away, he met Osiris — it was Horus [Hekelspower21] with whom he'd fallen in love before" (Varro). If Horus is now intimately associated with the Ennead (with whom he previously fell in love), his clearly canonical association with Osiris suggests another related cosmic marriage. This coupling and possibly inter | Once upon a time in a land far far away, The world had a star-shaped shape. As a king of armies began his ascension to paradise, The king brought man back with him upon a stool. The king of heaven spoke to man, saying "Ask the lord to let go of man, Tell him of | Once upon a time in a land far far away, there lived a wise man who lived in a village ruled by a bishop called Rubaiyan. A descendant of one of the sons of Rubaiyan, the rich farmer, lived in a village ruled by his sister named Neliairy. And in those days she married an only son, and we all were |
| 0.9    | Once upon a time in a land far far away, there lived a land called Mognirwyn . There, like many, were great swords. In many lands, like many nations, there lived men who made great swords. The people of Mognirwyn made great swords. They were called by their people those who rose to call themselves "Myclar Wires | Once upon a time in a land far far away, there was a king who had lived a long time. He didn't speak to the girl, but she used to keep tabs on him by eavesdropping on his conversations, as she figured he had liked her before. The queen had warned her about such an approach, but the girl had made up her | Once upon a time in a land far far away, there lived a man, called Hynek. He was a skilled wielder of swords. I do not recall the source of his knowledge or skills, though. He was brilliant. Amongst his skills, he had mastered flying. It was at Hynek's wedding night that Hynek |
| 0.8    | Once upon a time in a land far far away, there lived a proud and respected wizard and ruler who was dedicated to a simple, albeit dangerous, goal. One day, he was in an unlikely position to succeed as king of his kingdom and this new ruler found himself challenged with a small and troublesome militia of peasant warriors. The young King had already made a name | Once upon a time in a land far far away, I was a big boy. All alone and lost in space, I built my home out of junk I picked up when I was a kid. I was unable to tell my mom about it until she called me from the ship. I was scared. The first night she came, I went on tour with her | Once upon a time in a land far far away, here is where most things begin. Where there was never a coast or an ocean and seas were always wide, the city was formed. The warrior guild was established in this city to explore and even expand beyond their own lands. Its members would continue to craft the tools of war for decades to come, constantly hoping |
| 0.7    | Once upon a time in a land far far away, the distant moon is fully revealed to you, and with it, the enormous gate. The gate of the gate is locked and locked! The gate of the gate is still locked and locked! The gate of the gate is finally unlocked, and then you see the light of dawn, | Once upon a time in a land far far away, there was a race between a wizard and a human. The wizard won. Now that's a lot of magical chivalry. Sincerely, The Commander of the Iron Dragon One letter ago, an alien society attempted to destroy the human race. This is how you would call | Once upon a time in a land far far away, there lived a ruler called Anomen and his court. The king was a powerful warrior, a true warrior. He was of noble blood and, despite his great power, was often in need of help. One day he called upon a well-known scholar of his age, Alanna the Wise, to guide him through |

- For p = 0.97 and 0.95, outputs are highly creative but often incoherent with  sudden shifts in themes.  
- At p = 0.9, there’s a good balance between creativity and coherence, producing engaging yet structured outputs.  
- For p = 0.8 and 0.7, outputs are more predictable and structured, sacrificing creativity for consistency.  

## 2.4 More variations on decoding algorithms (10 pts)

Nucleus sampling was definitely not the end of the road in terms of new decoding algorithms. Even in the past few years, new decoding algorithms have been proposed to address some limitations of existing algorithms.

Two in particular are:
- Typical Sampling ([Meister et al. (2022)](https://arxiv.org/abs/2202.00666))
- Eta Sampling ([Hewitt et al. (2022)](https://arxiv.org/abs/2210.15191))

For this question, CHOOSE ONE of the two algorithms presented above. Below, please describe in a few sentences what your chosen algorithm does in a novel way and the broad motivation behind it. Along with this description, present 3 sampled outputs for the same prompt as above (you can use one hyperparameter value for all of these).



Typical Sampling:

Novelty:  

*   Typical Sampling adjusts the sampling threshold dynamically, focusing on tokens with high probability but also allowing diversity.
*   The key idea is to adjust the sampling range based on the probability of the tokens, ensuring that tokens from the "typical" region of the distribution are more likely to be selected.
*   This prevents the model from choosing outliers too frequently while maintaining coherence.



Motivation:

*   It aims to generate more meaningful and diverse text by focusing on the "typical" region of the token distribution.



In [62]:
#TODO
for i in range(3):
    print('Output for Typical Sampling', i, ':', GPT2LM.generate(prompt, typical_threshold=0.5, max_len=50))

Top 10 tokens for Typical Sampling ['there', 'a', 'the', 'in', '', 'I', 'an', 'when', 'two', 'you']
Output for Typical Sampling 0 : Once upon a time in a land far far away, teeming with strange vitality... " (Marvel Comics Single-Page Comic - Fall 1994/Spring/Summer 1995 10th Anniversary Issue) [14][15]

Tom LaPille and Roz Alanis (According to "Joe Quesada
Top 10 tokens for Typical Sampling ['there', 'a', 'the', 'in', '', 'I', 'an', 'when', 'two', 'you']
Output for Typical Sampling 1 : Once upon a time in a land far far away, the gods greeted the People of the West and the Elves. The Peoples thanked them, saying 'As they do for all things of ours, so shall angels enable the Promised Land to triumph over these woes!' Nevertheless there appeared devils on the Plain.
Top 10 tokens for Typical Sampling ['there', 'a', 'the', 'in', '', 'I', 'an', 'when', 'two', 'you']
Output for Typical Sampling 2 : Once upon a time in a land far far away, there lived a powerful town, in whose peaks we set 

| **Decoding Method**       | **Output 1**                                                                                                                                  | **Output 2**                                                                                                                                                      | **Output 3**                                                                                                                                                      |
|---------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Typical Sampling**       | Once upon a time in a land far far away, teeming with strange vitality... " (Marvel Comics Single-Page Comic - Fall 1994/Spring/Summer 1995 10th Anniversary Issue) [14][15]  | Once upon a time in a land far far away, the gods greeted the People of the West and the Elves. The Peoples thanked them, saying 'As they do for all things of ours, so shall angels enable the Promised Land to triumph over these woes!' Nevertheless there appeared devils on the Plain. | Once upon a time in a land far far away, there lived a powerful town, in whose peaks we set our tents. There lived Kinan Datin [sic] a folk-tale. Two youths crept up in the night sing [that] the King had held for three nights in the hidden town |

## 2.5 BONUS (Up to 15 pts)

Can you find a prompt where the continuations do not differ much across multiple sampling strategies, even when we use high temperatures or high p values? (Hint: Think about overfitting)

In [67]:
new_prompt = 'dont judge a book'

print(GPT2LM.generate(new_prompt, max_len = 3))
print(GPT2LM.generate(new_prompt, temperature=0.7,max_len = 3))
print(GPT2LM.generate(new_prompt, temperature=0.95, max_len = 3))
print(GPT2LM.generate(new_prompt, p=0.7, max_len = 3))
print(GPT2LM.generate(new_prompt, p=0.9, max_len = 3))

Top 10 tokens for temperature sampling: [' by', ' until', ' for', ' based', ' on', ',', ' that', '.', ' before', ' just']
dont judge a book of religion you
Top 10 tokens for temperature sampling: [' by', ' until', ' for', ' based', ' on', ',', ' that', '.', ' before', ' just']
dont judge a book by its cover
Top 10 tokens for temperature sampling: [' by', ' until', ' for', ' based', ' on', ',', ' that', '.', ' before', ' just']
dont judge a book by its cover
Top 10 tokens for Nucleus Sampling: ['!', '%', ')', '*', '(', '$', '"', '&', '#', "'"]
dont judge a book by its cover
Top 10 tokens for Nucleus Sampling: ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*']
dont judge a book by its cover


# 3. Prompting (50 pts)

In this problem, we will try various prompting approaches and prompt an LLM for a Math Reasoning Benchmark called [GSM8K](https://github.com/openai/grade-school-math), which contains grade school math word problems. This is a very common _reasoning_ benchmark used to test various LLMs.

The LLM that we will be using is [Google Gemini](https://gemini.google.com/). We will be prompting Gemini by using an API call to the Gemini Model. Normally, you can also prompt Open Source LLMs via the HuggingFace Library, however due to compute constraints, we use Gemini in this problem.

## Setting up the GSM8K Dataset and Google Gemini

Follow the steps below to download the GSM8K Dataset and to setup Google Gemini on Colab. You will automatically get points for this subpr

In [10]:
from datasets import load_dataset

dataset = load_dataset("gsm8k", 'main')

In [11]:
len(dataset['train']), len(dataset['test'])

(7473, 1319)

In [12]:
# An example instance of this dataset

dataset['test'][6]

{'question': 'Toulouse has twice as many sheep as Charleston. Charleston has 4 times as many sheep as Seattle. How many sheep do Toulouse, Charleston, and Seattle have together if Seattle has 20 sheep?',
 'answer': 'If Seattle has 20 sheep, Charleston has 4 * 20 sheep = <<20*4=80>>80 sheep\nToulouse has twice as many sheep as Charleston, which is 2 * 80 sheep = <<2*80=160>>160 sheep\nTogether, the three has 20 sheep + 160 sheep + 80 sheep = <<20+160+80=260>>260 sheep\n#### 260'}

In [13]:
dataset['train'][8]

{'question': 'Alexis is applying for a new job and bought a new set of business clothes to wear to the interview. She went to a department store with a budget of $200 and spent $30 on a button-up shirt, $46 on suit pants, $38 on a suit coat, $11 on socks, and $18 on a belt. She also purchased a pair of shoes, but lost the receipt for them. She has $16 left from her budget. How much did Alexis pay for the shoes?',
 'answer': 'Let S be the amount Alexis paid for the shoes.\nShe spent S + 30 + 46 + 38 + 11 + 18 = S + <<+30+46+38+11+18=143>>143.\nShe used all but $16 of her budget, so S + 143 = 200 - 16 = 184.\nThus, Alexis paid S = 184 - 143 = $<<184-143=41>>41 for the shoes.\n#### 41'}

### Gemini Setup (from the official [Gemini documentation](https://colab.research.google.com/github/google/generative-ai-docs/blob/main/site/en/gemini-api/docs/get-started/python.ipynb))


Before you can use the Gemini API, you must first obtain an API key. If you don't already have one, create a key with one click in Google AI Studio.

<a class="button button-primary" href="https://makersuite.google.com/app/apikey" target="_blank" rel="noopener noreferrer">Get an API key</a>

In Colab, add the key to the secrets manager under the "🔑" in the left panel.

---

Give it the name `GEMINI_API_KEY`.

Once you have the API key, pass it to the SDK. You can do this in two ways:

* Put the key in the `GEMINI_API_KEY` environment variable (the SDK will automatically pick it up from there).
* Pass the key to `genai.configure(api_key=...)`

In [14]:
# All imports for this question
from google.colab import userdata
import google.generativeai as genai
from datasets import Dataset
import random
import numpy as np
from typing import Callable, List, Any

In [15]:
GOOGLE_API_KEY = userdata.get("GEMINI_API_KEY")

genai.configure(api_key=GOOGLE_API_KEY)

In [16]:
# Test if your setup is working, do not change the model name
model = genai.GenerativeModel("gemini-1.0-pro")
response = model.generate_content("What is Natural Language Processing? Explain it to a five year old.")
print(response.text)

Imagine you have a very special friend called a "computer" who can understand what you say and write, just like your mom or dad. This special friend is always learning new words and trying to figure out what you mean by what you say. It's called Natural Language Processing, which is like a secret code that helps computers understand us humans!


## 3.1 Data and Prompting Setup (15 + 5 pts)

In this part, we will create some boilerplate code to process our dataset and generate prompts from the dataset.



### Processing the GSM8K Dataset

In [17]:
def process_gsm8k_answers(dataset: Dataset) -> Dataset:
    """
    Processes the GSM8K dataset to remove reasoning chains and retain only the numerical answers.
    Assumes answers are separated from reasoning by the '###' string.

    Args:
    dataset (Dataset): Huggingface Dataset object for GSM8K.

    Returns:
    Dataset: Processed Dataset object with numerical answers only.
    """

    def extract_answer(sample):
        # IMPLEMENT HERE
        # Split the answer using '###' and return a dictionary with the key 'processed_answer'

        return {'processed_answer': sample['answer'].split('####')[-1].strip()}

    return dataset.map(extract_answer)

### Building Prompts (15 pts)

We will be implementing FIVE (5) prompting methods. See their descriptions below -
1. **Zero-Shot Answer Only (2 pts)**: You prompt the model to only generate the answer to the question

2. **Zero-Shot Chain of Thought (CoT) (3 pts)**: Refer to the [Chain of Thought Paper](https://arxiv.org/abs/2201.11903). CoT refers to a reasoning chain that is generated by the model before generating the actual answer. This has shown to improve performance. In this setup, you will prompt the model to generate a reasoning chain before the answer.

3. **5-Shot Answer Only (2 pts)**: You provide some in-context examples to prompt the model with to generate the answer. This is analogous to Approach 1. Use the a random set of 5 examples from the training set to create the in-context examples.

4. **5-Shot CoT (3 pts)**: Combine Approaches 2 and 3 to do 5-shot CoT prompting.

5. **Your own prompt! (5 pts)**: Try something new. Think about how you solve Math problems and implement your own prompting method.

In [18]:
def prompt_generation_zero_shot(problem: str) -> str:
    """
    Zero-shot prompt.

    Returns:
    str: The generated prompt.
    """
    # IMPLEMENT HERE
    return f"Question: {problem}\nAnswer:"

In [19]:
def prompt_generation_zero_shot_cot(problem: str) -> str:
    """
    Zero-shot Chain of Thought (CoT) prompt.

    Returns:
    str: The generated prompt.
    """
    # IMPLEMENT HERE
    return f"Question: {problem}\nLet's think step by step:\nAnswer:"

In [20]:
def prompt_generation_5_shot(problem: str, training_set: Dataset) -> str:
    """
    5-shot prompt generation for GSM8K problems. Randomly selects 5 examples from the training set.

    Returns:
    str: The generated prompt with 5 in-context_examples.
    """
    # IMPLEMENT HERE
    examples = np.random.choice(list(training_set), 5, replace=False)


    prompt = ""
    for example in examples:
        prompt += f"Example Question: {example['question']}\Solution: {example['answer']}\n\n"

    prompt += f"Question: {problem}\nAnswer:"
    return prompt

In [21]:
def prompt_generation_5_shot_cot(problem: str, training_set: Dataset) -> str:
    """
    5-shot Chain of Thought (CoT) prompt generation. Randomly selects 5 examples
    from the training set and includes reasoning steps.

    Returns:
    str: The generated prompt with 5 CoT in-context examples.
    """
    # IMPLEMENT HERE
    examples = np.random.choice(list(training_set), 5, replace=False)

    prompt = ""
    for example in examples:
        prompt += f"Example Question: {example['question']}\Let's think step by step:\n{example['answer']} Answer: {example['processed_answer']}\n\n"

    prompt += f"Question: {problem}\nLet's think step by step:\nAnswer:"
    return prompt

In [22]:
import random
from datasets import Dataset

# # Feel free to change the method definition

def my_prompt(problem: str, training_set: Dataset) -> str:
    """
    Your own unique way of prompting an LLM for Math word problems.

    Returns:
    str: The generated prompt
    """
    examples = np.random.choice(list(training_set), 5, replace=False)

    prompt = ""
    for example in examples:
        prompt += f"Example Question: {example['question']}\Reason and Solution:\n{example['answer']}\nAnswer: {example['processed_answer']}\n\n"

    prompt += f"Problem: {problem}\nLet's solve this problem step-by-step to ensure accuracy:\n"
    prompt += "1. Identify the relevant formula.\n"
    prompt += "2. Substitute known values into the formula.\n"
    prompt += "3. Perform the necessary calculations.\n"
    prompt += "4. Double-check the result for accuracy.\nAnswer:"

    return prompt


## 3.2 Prompting Gemini and Implementing Self-Consistency (5 + 5 + 10 pts)

Here, you will help build the wrapper for prompting Gemini using the prompt methods you have designed above.

You will then also implement Self-Consistency based prompting. Refer to the [Self-Consistency Paper](https://arxiv.org/abs/2203.11171). In order to implement Self-Consistency, you generate multiple Zero-Shot CoT (Approach 2 in the prompting methods) candidates, and take a majority vote of the answers predicted by each candidate.

### First, write the function where you will process the answer generated by the model. (5 pts)

Note that answer processing changes for different prompt types, so this function also takes in the name of the method in its argument.

In [23]:
import re
from typing import Any, List

def answer_processing(prediction: str, prompt_function: Any) -> str:
    """
    Processes the model's generated output to extract the final answer.

    Returns:
    str: The processed numerical answer.
    """
    prediction = prediction.replace(',', '')
    answer = prediction.strip().split("Answer:")[-1].strip()

    try:
        answer = round(float(re.findall(r"[-+]?\d*\.?\d+", answer)[-1]))
    except:
        answer = 0

    return str(answer)

In [24]:
# Do not change, method to calculate accuracy from predictions and ground truth labels

def evaluate_accuracy(predictions: List[str], ground_truths: List[str]) -> float:
    correct = 0
    total = len(predictions)

    for pred, true in zip(predictions, ground_truths):
        if pred == true:
            correct += 1

    accuracy = correct / total
    return accuracy * 100

### Next, write the wrapper function where you use all the building blocks constructed above to prompt the Gemini model (5 + 10 pts)


On how to prompt Gemini, refer to the [Gemini Text Generation Handbook](https://ai.google.dev/gemini-api/docs/text-generation?lang=python).

Hint: Reading this will help you figure out how to generate multiple candidates to implement Self-Consistency.

In [32]:
from typing import Any, Callable, List
from datasets import Dataset
from collections import Counter
import time

def pipeline_generate(
    model_instance: Any,
    training_set: Dataset,
    test_set: Dataset,
    prompt_function: Callable[[str], str],
    process_answer_function: Callable[[str, Callable], str],
    evaluation_function: Callable[[List[str], List[str]], float],
    self_consistency: int,
) -> float:
    """
    Args:
    model_instance (Any): The Google Gemini model instance.
    test_set (Dataset): The GSM8K test set to evaluate on.
    prompt_function (Callable): Function to generate prompts for the test set.
    process_answer_function (Callable): Function to process the model's generated answers.
    evaluation_function (Callable): Function to evaluate model's answers against the ground truth.
    self_consistency: Number of samples to run self-consistency approach on.
    If negative, 0 or 1, this implies regular prompting

    Returns:
    float: The accuracy of the model on the test set.
    """
    predictions = []
    ground_truths = test_set['processed_answer']

    prompts_with_examples = ['prompt_generation_5_shot', 'prompt_generation_5_shot_cot', 'my_prompt']

    for problem in test_set['question']:

        if prompt_function.__name__ in prompts_with_examples:
            prompt = prompt_function(problem, training_set)
        else:
            prompt = prompt_function(problem)

        if self_consistency > 1:
            answers = []
            for i in range(self_consistency):
                response = model_instance.generate_content(prompt).text
                processed_answer = process_answer_function(response, prompt_function)
                answers.append(processed_answer)
                time.sleep(5)
            answer_counts = Counter(answers)
            final_answer = answer_counts.most_common(1)[0][0]

        else:
            response = model_instance.generate_content(prompt).text
            final_answer = process_answer_function(response, prompt_function)

        predictions.append(final_answer)
        time.sleep(5)

    accuracy = evaluation_function(predictions, ground_truths)
    return accuracy

In [82]:
gsm8k_train_processed = process_gsm8k_answers(dataset['train'])
gsm8k_test_processed = process_gsm8k_answers(dataset['test'])



# The following line is just to test your systems, comment this line out to report results on the entire test set in 3.3
gsm8k_train_processed = Dataset.from_dict(gsm8k_train_processed[:1000])
gsm8k_test_processed = Dataset.from_dict(gsm8k_test_processed[:50])

accuracy = pipeline_generate(
    model_instance=model,
    training_set = gsm8k_train_processed,
    test_set=gsm8k_test_processed,
    prompt_function=prompt_generation_zero_shot,
    process_answer_function=answer_processing,
    evaluation_function=evaluate_accuracy,
    self_consistency=1,
)

print(f"Accuracy: {accuracy}%")

Accuracy: 44.0%


In [83]:
gsm8k_train_processed = process_gsm8k_answers(dataset['train'])
gsm8k_test_processed = process_gsm8k_answers(dataset['test'])


# The following line is just to test your systems, comment this line out to report results on the entire test set in 3.3
gsm8k_train_processed = Dataset.from_dict(gsm8k_train_processed[:1000])
gsm8k_test_processed = Dataset.from_dict(gsm8k_test_processed[:50])

accuracy = pipeline_generate(
    model_instance=model,
    training_set = gsm8k_train_processed,
    test_set=gsm8k_test_processed,
    prompt_function=prompt_generation_zero_shot_cot,
    process_answer_function=answer_processing,
    evaluation_function=evaluate_accuracy,
    self_consistency=1,
)

print(f"Accuracy: {accuracy}%")

Accuracy: 70.0%


In [23]:
gsm8k_train_processed = process_gsm8k_answers(dataset['train'])
gsm8k_test_processed = process_gsm8k_answers(dataset['test'])


# The following line is just to test your systems, comment this line out to report results on the entire test set in 3.3
gsm8k_train_processed = Dataset.from_dict(gsm8k_train_processed[:1000])
gsm8k_test_processed = Dataset.from_dict(gsm8k_test_processed[:50])

accuracy = pipeline_generate(
    model_instance=model,
    training_set = gsm8k_train_processed,
    test_set=gsm8k_test_processed,
    prompt_function=prompt_generation_5_shot,
    process_answer_function=answer_processing,
    evaluation_function=evaluate_accuracy,
    self_consistency=1,
)

print(f"Accuracy: {accuracy}%")

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

Map:   0%|          | 0/1319 [00:00<?, ? examples/s]

Accuracy: 70.0%


In [85]:
gsm8k_train_processed = process_gsm8k_answers(dataset['train'])
gsm8k_test_processed = process_gsm8k_answers(dataset['test'])


# The following line is just to test your systems, comment this line out to report results on the entire test set in 3.3
gsm8k_train_processed = Dataset.from_dict(gsm8k_train_processed[:1000])
gsm8k_test_processed = Dataset.from_dict(gsm8k_test_processed[:50])

accuracy = pipeline_generate(
    model_instance=model,
    training_set = gsm8k_train_processed,
    test_set=gsm8k_test_processed,
    prompt_function=prompt_generation_5_shot_cot,
    process_answer_function=answer_processing,
    evaluation_function=evaluate_accuracy,
    self_consistency=1,
)

print(f"Accuracy: {accuracy}%")

Accuracy: 72.0%


In [27]:
gsm8k_train_processed = process_gsm8k_answers(dataset['train'])
gsm8k_test_processed = process_gsm8k_answers(dataset['test'])


# The following line is just to test your systems, comment this line out to report results on the entire test set in 3.3
gsm8k_train_processed = Dataset.from_dict(gsm8k_train_processed[:1000])
gsm8k_test_processed = Dataset.from_dict(gsm8k_test_processed[:50])

accuracy = pipeline_generate(
    model_instance=model,
    training_set = gsm8k_train_processed,
    test_set=gsm8k_test_processed,
    prompt_function=my_prompt,
    process_answer_function=answer_processing,
    evaluation_function=evaluate_accuracy,
    self_consistency=1,
)

print(f"Accuracy: {accuracy}%")

Accuracy: 70.0%


In [29]:
gsm8k_train_processed = process_gsm8k_answers(dataset['train'])
gsm8k_test_processed = process_gsm8k_answers(dataset['test'])


# The following line is just to test your systems, comment this line out to report results on the entire test set in 3.3
gsm8k_train_processed = Dataset.from_dict(gsm8k_train_processed[:1000])
gsm8k_test_processed = Dataset.from_dict(gsm8k_test_processed[:50])

accuracy = pipeline_generate(
    model_instance=model,
    training_set = gsm8k_train_processed,
    test_set=gsm8k_test_processed,
    prompt_function=prompt_generation_zero_shot,
    process_answer_function=answer_processing,
    evaluation_function=evaluate_accuracy,
    self_consistency=5,
)

print(f"Accuracy: {accuracy}%")

Accuracy: 50.0%


## 3.3 Complete this table based on your implementation in 3.2 and answer the following questions (5 + 5 pts)

### Round each value up to two decimal points (5 pts)

Method|Accuracy
---|---|
0-shot| 44%
0-shot CoT| 70%
5-shot| 70%
5-shot CoT| 72%
My prompt| 70%
0-shot CoT Self-Consistency| 50%

### What was the intuition behind the prompt that you designed? (2 pts)

The prompt is designed to enhance accuracy by:

- Focusing on Formula Identification: By asking the model to state the formula, it ensures that the correct method is chosen before solving, leading to more accurate answers.

- Adding a Double-Check Step: The prompt instructs the model to verify its solution, which encourages careful calculation and helps catch errors, further improving reliability.

### What are the merits and demerits of using advanced prompting approaches like Chain of Thought or Self-Consistency? (3 pts)

Merits:

- Higher Accuracy: Techniques like CoT improve problem-solving accuracy by guiding the model through reasoning steps.
- Consistency: Self-Consistency uses multiple responses to reduce random errors and improve reliability.
- Interpretability: Advanced prompts make the model’s reasoning clearer, which is helpful for step-by-step explanations.

Demerits:

- Costly: Longer prompts may hit token limits and increase request costs.
- Higher Computation Cost: More computation due to multiple runs or lengthy prompts.
- Prompt Dependency: Models may overfit to complex prompts, reducing adaptability.