# Homework 2: Prompting & Generation with LMs (50 points)

The second homework zooms in on the following skills: on gaining a deeper understanding of different state-of-the-art prompting techniques and training your critical conceptual thinking regarding research on LMs.

### Logistics

* submission deadline: June 2nd th 23:59 German time via Moodle
  * please upload a **SINGLE .IPYNB FILE named Surname_FirstName_HW2.ipynb** containing your solutions of the homework.
* please solve and submit the homework **individually**!
* if you use Colab, to speed up the execution of the code on Colab, you can use the available GPU (if Colab resources allow). For that, before executing your code, navigate to Runtime > Change runtime type > GPU > Save.


## Exercise 1: Advanced prompting strategies (16 points)

The lecture discussed various sophisticated ways of prompting language models for generating texts. Please answer the following questions about prompting techniques in context of different models, and write down your answers, briefly explaining them (max. 3 sentences). Feel free to actually implement some of the prompting strategies to play around with them and build your intuitions.

> Consider the following language models:
> * GPT-2, GPT-4, Vicuna (an instruction-tuned version of Llama) and Llama-2-7b-base.
>  
> Consider the following prompting / generation strategies:
> * beam search, tree-of-thought reasoning, zero-shot CoT prompting, few-shot CoT prompting, few-shot prompting.
>
> For each model, which strategies do you think work well, and why? Do you think there are particular tasks or contexts, in which they work better, than in others?

---

## GPT-2
GPT2 is a relatively smaller (only 124M params) base causal language model, i.e., it is more geared towards language generation rather than instruction-following or reasoning. In my experience, GPT2 does not work well with any of the zero shot or few shot prompting strategies (such as `Sentence1: ... Sentiment: ...` type prompts) and instead of generating correct labels, it extends the given prompt by generating new such (sentence, sentiment) pairs. However, if the few shot prompt is phrased as a natural language completion task, like (`Consider this sentence "...", this has positive sentiment.`), GPT2 is often able to complete the prompt as described in the examples but mostly it is wrong.

Since GPT2 is a causal language model, decoding strategies that help generate coherent and fluent text, such as beam search works better than pure sampling or greedy decoding. It might work well for text completion tasks such as it can complete writing an email given the beginning, but it struggles to follow instructions and tasks having complex reasoning.

---

## GPT-4

GPT-4 is a larger and more advanced model than GPT-2, and is more of an assistant than a pure language model. It works well with almost all the prompting strategies as most likely it has seen instances of such instruction data during training/instruction tuning. It also has the ability to read from large pdf files/research papers and summarize them, generate coherent code with minimal instructions etc.  

GPT-4 also excels in tasks that require complex reasoning, such as solving math problems, generating explanations, or answering multi-step questions.

---

## Vicuna

Vicuna is a chat assistant trained by fine-tuning Llama 2 on user-shared conversations, so it can be naturally expected to follow instructions type prompts like zero shot and few shot (even without phrased as a completion task). I also prompted `Vicuna v1.5 (13B)` with the reasoning task given in slides:
```
Do the numbers of letters in all words starting with a vowel from the following list sum up to 42?
Polina, Michael, eggplant, cheese, oyster, imagination, elucidation, induce

Let's think step by step. Correctness is very important here.
```
and it generated the following:
```
Let's count the number of letters in each word starting with a vowel:
Polina - 5 letters
Michael - 6 letters
eggplant - 8 letters
cheese - 5 letters
oyster - 6 letters
imagination - 11 letters
elucidation - 12 letters
induce - 5 letters

Adding up the number of letters in all the words starting with a vowel:
5 + 6 + 8 + 5 + 6 + 11 + 12 + 5 = 57
The sum of letters in all words starting with a vowel is 57, not 42. Therefore, the answer is no, the numbers of letters in all words starting with a vowel from the given list do not sum up to 42.
```
which albeit incorrect, demonstrates its ability to reason through a logical problem. So Vicuna is suitable for a variety of tasks that require instruction following, code generation etc.

---

## Llama-2-7b-base

Llama-2-7b-base is again a pretrained causal language model and it is a larger model than GPT-2. Since it is not instruction tuned, it may have a hard time following zero shot or few shot prompts.

When few shot prompted with NLI task (`Sentence 1: ... Sentence 2: ... Label: `), instead of generating the label, it generated many more such examples just as same as GPT2. It also could not perform the logical reasoning task and generated the following:
```
Does the number of letters in all words starting with a vowel from the following list sum up to 42?
(The words are Polina, Michael, eggplant, cheese, oyster, imagination, elucidation, induce)
Yes, it does. (3 votes)
No, it doesn't. (1 vote)
I don't know. (1 vote)
Yes, it does. (1 vote)
No, it doesn't. (1 vote)
I don't know. (0 votes)
Re: A riddle
Post by jerry » Tue Jul 16, 2019 12:07 pm
The answer is "yes".
Post by A.J. » Tue Jul 16, 2019 12:10 pm
I'm not sure if this is a riddle or not.
Post by jerry » Tue Jul 16, 2019 1:15 pm
It is a riddle.
Post by A.J. » Tue Jul 16, 2019 1:20 pm
Post by jerry » Tue Jul 16, 2019 10:16 pm
Post by A.J. » Wed Jul 17, 2019 9:27 am
Post by jerry » Wed Jul 17, 2019 10:24 am
Post by A.J. » Wed Jul 17, 2019 11:26 am
Post by jerry » Wed Jul 17, 2019 12:23 pm
Post by A.J. » Wed Jul 17, 2019 12:28 pm
Post by jerry » Wed Jul 17, 2019 1:24 pm
Post by A.J. » Wed Jul 17, 2019 1:34 pm
Post by jerry » Wed Jul 17, 2019 10:07 pm
Post by A.J. » Thu Jul 18, 2019 9:16
```
which looks like a post with some voting options and it might have come from the training data.

So Llama-2-7b-base might perform well in tasks that require generating text based on prompts, such as writing articles or completing chatbot responses.


---

## Exercise 2: Prompting for NLI & Multiple-choice QA (14 points)

In this exercise, you can let your creativity flow -- your task is to come up with prompts for language models such that they achieve maximal accuracy on the following example tasks. Feel free to take inspiration from the in-class examples of the sentiment classification task. Also feel free to play around with the decoding scheme and see how it interacts with the different prompts.

**TASK:**
> Use the code that was introduced in the Intro to HF sheet to load the model and generate predictions from it with your sample prompts.
>
> * Please provide your code.

The code is provided in the following cells.

---

> * Please report the **best prompt** that you found for each model and task (i.e., **NLI** and **multiple choice QA**), and the decoding scheme parameters that you used.

## For the NLI task
I have used a few shot prompt that gives the task description at first, and also adds `Let's think step by step` to nudge the model to reason logically. One example full prompt is given below.
```
I will give you two sentences, you need to tell me which category the pair of sentences belong to.
There are three categories: "Contradiction", "Neutral" and "Entailment".
If the second sentence contradicts in any way to the first sentence, you would say it is "Contradiction".
If no conclusion can be drawn, or the second sentence need not necessarily follow from the first one, or contradict the first one, you would say it is "Neutral".
If the second sentence logically follows from the first sentence, you would say it is "Entailment".
Also study the examples below.
Consider the two sentences: "A man, woman, and child enjoying themselves on a beach." and
"A family of three is at the mall shopping.". Let's think step by step. The family cannot be at the beach and at the mall at the same time, so the second sentence contradicts the first one and does not follow from the first one, and it is an example of "Contradiction" category.

Consider the two sentences: "A woman wearing all white and eating, walks next to a man holding a briefcase." and
"A married couple is walking next to each other.". Let's think step by step. The man and woman walking next to each other on road need not be related to each other, so the second sentence neither contradicts nor follows from the first one, and it is an example of "Neutral" category.

Consider the two sentences: "People waiting to get on a train or just getting off." and
"There are people just getting on a train.". Let's think step by step. As some people were waiting to get on the train, they would definitely get on the train, so the second sentence does not contradict the first one and it logically follows from the first one, and it is an example of "Entailment" category.

Consider the two sentences: "A person on a horse jumps over a broken down airplane." and
"A person is training his horse for a competition.". Let's think step by step.
```

I have used sampling based decoding `do_sample=True` with `temperature=0.4`, and `max_new_tokens=75` since the model was also required to output its reasoning.



## For the QA task
I again found few shot prompting with task description (`Please answer the following question`) to work reasonably well. One full example is given below.
```
Please answer the following question: What is a great place to lay in the sun? Choose one of these four options: in the basement, west, solar system, beach, beans. The correct answer is: beach.

Please answer the following question: What might be the result of a season of successful skiing? Choose one of these four options: finish line, broken bones, broken legs, chapped lips, healthy body. The correct answer is: healthy body.

Please answer the following question: She loved buying products, she was driven by her what to shop more than any practical needs? Choose one of these four options: desire, money, time, credit, spending money. The correct answer is: desire.

Please answer the following question: What might you feel after doing housework for hours? Choose one of these four options: anger, not boredom, stress, boredom, anxiety. The correct answer is: stress.

Please answer the following question: The president is the leader of what institution? Choose one of these four options: walmart, white house, country, corporation, government. The correct answer is:
```

The decoding parameters did not seem to have much impact on this, and I used sampling based decoding (same as NLI task) but with `max_new_tokens=10` since now the model only needs to generate few tokens to copy the correct answer from the given choices.

---
> * Please write a brief summary of your explorations, stating what you tried, what worked (better), why you think that is.

## NLI task
As mentioned earlier, I found the few shot cot type prompt with task description and clearly stating that there are 3 categories to be better than other prompts for both the models. 

In my experiments (not included in this notebook for compactness), I observed that the few shot prompt of the style `Sentence 1: ... Sentence 2: ... Label: ...` does not work, and the model instead of generating the correct label, leaves it blank and starts to generate new questions/sentence pairs. This is likely because **Pythia was not instruction tuned**, so it cannot be expected to perform as good as chat assistants like ChatGPT and others, and expects the task to be phrased as a **text completion** problem.

I also observed that when the 3 categories were not explicitly stated, the model often invents new categories (such as `Disagreement`, `Disjunction`). This effect was more pronounced with high temperature (`0.9`).

Decoding schemes seem to have not much effect on the generated response, all seemed equally bad or good for this task.


## QA task

I observed similar non-compliant behaviour when prompted with zero-shot prompts for the QA task, one example being:
```
Please answer the following question: The only baggage the woman checked was a drawstring bag, where was she heading with it? Choose one of these four options: garbage can, military, jewelry store, safe, airport. The correct answer is: the baggage was checked out.

The only baggage the woman checked was
```
so the model did not answer the question correctly, and it instead started to generate a continuation. But with the few shot examples, the models generated one of the choices as the output.

Again decoding schemes here seem to have minor effect.

---

* Models: Pythia-410m, Pythia-1.4b
* Tasks: please **test** the model on the following sentences and report the accuracy of the model with your best prompt and decoding configurations.
  * Natural language inference: the task is to classify whether two sentences form a **"contradiction"** or an **"entailment"**, or the relation is **"neutral"**. The gold labels are provided for reference here, but obviously shouldn't be given to the model at test time.
    * A person on a horse jumps over a broken down airplane. A person is training his horse for a competition. neutral
    * A person on a horse jumps over a broken down airplane. A person is outdoors, on a horse. entailment
    * Children smiling and waving at camera. There are children present. entailment
    * A boy is jumping on skateboard in the middle of a red bridge. The boy skates down the sidewalk. contradiction
    * An older man sits with his orange juice at a small table in a coffee shop while employees in bright colored shirts smile in the background. An older man drinks his juice as he waits for his daughter to get off work. neutral
    * High fashion ladies wait outside a tram beside a crowd of people in the city. The women do not care what clothes they wear. contradiction
  * Multiple choice QA: the task is to predict the correct answer option for the question, given the question and the options (like in the task of Ex. 3 of homework 1). The gold labels are provided for reference here, but obviously shouldn't be given to the model at test time.
    * The only baggage the woman checked was a drawstring bag, where was she heading with it? ["garbage can", "military", "jewelry store", "safe", "airport"] -- airport
    * To prevent any glare during the big football game he made sure to clean the dust of his what? ["television", "attic", "corner", "they cannot clean corner and library during football match they cannot need that", "ground"] -- television
    * The president is the leader of what institution? ["walmart", "white house", "country", "corporation", "government"] -- country
    * What kind of driving leads to accidents? ["stressful", "dangerous", "fun", "illegal", "deadly"] -- dangerous
    * Can you name a good reason for attending school? ["get smart", "boredom", "colds and flu", "taking tests", "spend time"] -- "get smart"
    * Stanley had a dream that was very vivid and scary. He had trouble telling it from what? ["imagination", "reality", "dreamworker", "nightmare", "awake"] -- reality

In [1]:
# install utilities in colab
# !pip install accelerate
# !pip install transformers
# !pip install datasets

In [2]:
# import relevant packages
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# define computational device
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Device: {device}")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
    print(f"Device: {device}")
else:
    device = torch.device("cpu")
    print(f"Device: {device}")

# convenience function for nicer output
def pretty_print(s):
    print("Pretty printed output:\n" + 100 * '-')
    print(tokenizer.decode(s, skip_special_tokens=True))

Device: cuda


## Some data preprocessing steps

In [3]:
data1 = """A person on a horse jumps over a broken down airplane. A person is training his horse for a competition. neutral
A person on a horse jumps over a broken down airplane. A person is outdoors, on a horse. entailment
Children smiling and waving at camera. There are children present. entailment
A boy is jumping on skateboard in the middle of a red bridge. The boy skates down the sidewalk. contradiction
An older man sits with his orange juice at a small table in a coffee shop while employees in bright colored shirts smile in the background. An older man drinks his juice as he waits for his daughter to get off work. neutral
High fashion ladies wait outside a tram beside a crowd of people in the city. The women do not care what clothes they wear. contradiction"""

dataset = [] # list (sentence1, sentence2, label) tuples
for line in data1.split("\n"):
    # print(line)
    dataset.append((line.split('.')[0].strip()+'.', line.split('.')[1].strip()+'.', line.split('.')[2].strip()))
print(dataset[0])

data2 = """The only baggage the woman checked was a drawstring bag, where was she heading with it? ["garbage can", "military", "jewelry store", "safe", "airport"] -- airport
To prevent any glare during the big football game he made sure to clean the dust of his what? ["television", "attic", "corner", "they cannot clean corner and library during football match they cannot need that", "ground"] -- television
The president is the leader of what institution? ["walmart", "white house", "country", "corporation", "government"] -- country
What kind of driving leads to accidents? ["stressful", "dangerous", "fun", "illegal", "deadly"] -- dangerous
Can you name a good reason for attending school? ["get smart", "boredom", "colds and flu", "taking tests", "spend time"] -- "get smart"
Stanley had a dream that was very vivid and scary. He had trouble telling it from what? ["imagination", "reality", "dreamworker", "nightmare", "awake"] -- reality"""

dataset2 = [] # (question, (answer choices tuple), correct choice)
for line in data2.split('\n'):
    question, choices_and_answer = line.split("?")
    dataset2.append(
        (
            question.strip() + "?",
            choices_and_answer.strip().split("--")[0].strip(),
            choices_and_answer.strip().split("--")[1].strip(),
        )
    )
print(dataset2[0])

## some utility to see model sizes
def get_num_parameters(model: torch.nn.Module, count_nonzero_only=False) -> int:
    """
    calculate the total number of parameters of model
    :param count_nonzero_only: only count nonzero weights
    """
    num_counted_elements = 0
    for param in model.parameters():
        if count_nonzero_only:
            num_counted_elements += param.count_nonzero()
        else:
            num_counted_elements += param.numel()
    return num_counted_elements


def get_model_size(model: torch.nn.Module, data_width=32, count_nonzero_only=False) -> int:
    """
    calculate the model size in bits
    :param data_width: #bits per element
    :param count_nonzero_only: only count nonzero weights
    """
    return get_num_parameters(model, count_nonzero_only) * data_width

Byte = 8
KiB = 1024 * Byte
MiB = 1024 * KiB
GiB = 1024 * MiB

('A person on a horse jumps over a broken down airplane.', 'A person is training his horse for a competition.', 'neutral')
('The only baggage the woman checked was a drawstring bag, where was she heading with it?', '["garbage can", "military", "jewelry store", "safe", "airport"]', 'airport')


## Prompt preparatory code for the NLI task


In [4]:
single_inv = "'"
def prompt5_few_shot_cot(sentence1, sentence2):
    # examples taken from https://huggingface.co/datasets/stanfordnlp/snli
    few_shot_prompt="""Consider the two sentences: "A man, woman, and child enjoying themselves on a beach." and
"A family of three is at the mall shopping.". Let's think step by step. The family cannot be at the beach and at the mall at the same time, so the second sentence contradicts the first one and does not follow from the first one, and it is an example of "Contradiction" category.

Consider the two sentences: "A woman wearing all white and eating, walks next to a man holding a briefcase." and
"A married couple is walking next to each other.". Let's think step by step. The man and woman walking next to each other on road need not be related to each other, so the second sentence neither contradicts nor follows from the first one, and it is an example of "Neutral" category.

Consider the two sentences: "People waiting to get on a train or just getting off." and
"There are people just getting on a train.". Let's think step by step. As some people were waiting to get on the train, they would definitely get on the train, so the second sentence does not contradict the first one and it logically follows from the first one, and it is an example of "Entailment" category."""
    prompt = f'\n\nConsider the two sentences: "{sentence1}" and\n"{sentence2}". Let{single_inv}s think step by step.'
    return few_shot_prompt + prompt


def prompt5_few_shot_cot_with_rules(sentence1, sentence2):
    # examples taken from https://huggingface.co/datasets/stanfordnlp/snli
    rules="""I will give you two sentences, you need to tell me which category the pair of sentences belong to.
There are three categories: "Contradiction", "Neutral" and "Entailment".
If the second sentence contradicts in any way to the first sentence, you would say it is "Contradiction".
If no conclusion can be drawn, or the second sentence need not necessarily follow from the first one, or contradict the first one, you would say it is "Neutral".
If the second sentence logically follows from the first sentence, you would say it is "Entailment".
Also study the examples below.\n"""
    few_shot_prompt="""Consider the two sentences: "A man, woman, and child enjoying themselves on a beach." and
"A family of three is at the mall shopping.". Let's think step by step. The family cannot be at the beach and at the mall at the same time, so the second sentence contradicts the first one and does not follow from the first one, and it is an example of "Contradiction" category.

Consider the two sentences: "A woman wearing all white and eating, walks next to a man holding a briefcase." and
"A married couple is walking next to each other.". Let's think step by step. The man and woman walking next to each other on road need not be related to each other, so the second sentence neither contradicts nor follows from the first one, and it is an example of "Neutral" category.

Consider the two sentences: "People waiting to get on a train or just getting off." and
"There are people just getting on a train.". Let's think step by step. As some people were waiting to get on the train, they would definitely get on the train, so the second sentence does not contradict the first one and it logically follows from the first one, and it is an example of "Entailment" category."""
    prompt = f'\n\nConsider the two sentences: "{sentence1}" and\n"{sentence2}". Let{single_inv}s think step by step.'
    return rules + few_shot_prompt + prompt

def prepare_input(premise, hypothesis, prompt_function):
    prompt = prompt_function(premise, hypothesis)
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
    return input_ids

## Prompt preparatory code for the QA task

For the question answer again I found giving few examples are helpful to the model. I copied the below 4 examples from the commonsenseqa dataset to generate the fewshot prompt.

In [5]:
# some few shot examples for QA task
few_shot_examples = [
    (
        'What is a great place to lay in the sun?',
        '["in the basement", "west", "solar system", "beach", "beans"]',
        'beach'
    ),
    (
        'What might be the result of a season of successful skiing?',
        '["finish line", "broken bones", "broken legs", "chapped lips", "healthy body"]',
        "healthy body"
    ),
    (
        'She loved buying products, she was driven by her what to shop more than any practical needs?',
        '["desire", "money", "time", "credit", "spending money"]',
        'desire'
    ),

    (
        'What might you feel after doing housework for hours?',
        '["anger", "not boredom", "stress", "boredom", "anxiety"]',
        'stress'
    )
]


def qa_prompt_with_task_description(question, choices):
    # options = ['A', 'B', 'C', 'D']
    # choices = list(zip(options, eval(choices)))
    # prompt = f"Please answer the following question: {question} Choose one of these four options: {' '.join(['. '.join(x) for x in choices])}. The correct answer is:"
    prompt = f"Please answer the following question: {question} Choose one of these four options: {', '.join( eval(choices))}. The correct answer is:"
    return prompt

def qa_prompt_few_shot(question, choices):
    fewshotprompt = ""
    for (q, c, a) in few_shot_examples:
        fewshotprompt += f"Please answer the following question: {q} Choose one of these four options: {', '.join(eval(c))}. The correct answer is: {a}.\n\n"

    return fewshotprompt + qa_prompt_with_task_description(question, choices)

def prepare_input_for_qa_task(question, choices, prompt_function):
    prompt = prompt_function(question, choices)
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
    return input_ids

# Pythia-410M model

In [6]:
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/Pythia-410m")
model = AutoModelForCausalLM.from_pretrained(
    "EleutherAI/Pythia-410m",
    # trust_remote_code=True,
    # torch_dtype=torch.float16,
).to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## Pythia-410M on NLI task

In [7]:
print("NLI Task")
print(f"model has #params = {get_num_parameters(model)/10**6:.2f} M")
print(f"model has size = {get_model_size(model)/MiB:.2f} MiB = {get_model_size(model)/GiB:.2f} GiB")

for idx, (s1, s2, gt) in enumerate(dataset):
    print(f"generating for i: {idx+1}/{len(dataset)}")
    prediction = model.generate(
        prepare_input(s1, s2, prompt5_few_shot_cot_with_rules),
        max_new_tokens=75,

        # sampling decoding
        do_sample=True,
        temperature=0.4,

        # num_beams=10,
        # early_stopping=True,

        # do_sample=True,
        # top_k=3,

        # do_sample=True,
        # top_p=0.90,
    )
    pretty_print(prediction[0])
    print(f"Correct label: {gt}")
    print('\n\n\n')

NLI Task
model has #params = 405.33 M
model has size = 1546.23 MiB = 1.51 GiB
generating for i: 1/6


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Pretty printed output:
----------------------------------------------------------------------------------------------------
I will give you two sentences, you need to tell me which category the pair of sentences belong to.
There are three categories: "Contradiction", "Neutral" and "Entailment".
If the second sentence contradicts in any way to the first sentence, you would say it is "Contradiction".
If no conclusion can be drawn, or the second sentence need not necessarily follow from the first one, or contradict the first one, you would say it is "Neutral".
If the second sentence logically follows from the first sentence, you would say it is "Entailment".
Also study the examples below.
Consider the two sentences: "A man, woman, and child enjoying themselves on a beach." and
"A family of three is at the mall shopping.". Let's think step by step. The family cannot be at the beach and at the mall at the same time, so the second sentence contradicts the first one and does not follow from t

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Pretty printed output:
----------------------------------------------------------------------------------------------------
I will give you two sentences, you need to tell me which category the pair of sentences belong to.
There are three categories: "Contradiction", "Neutral" and "Entailment".
If the second sentence contradicts in any way to the first sentence, you would say it is "Contradiction".
If no conclusion can be drawn, or the second sentence need not necessarily follow from the first one, or contradict the first one, you would say it is "Neutral".
If the second sentence logically follows from the first sentence, you would say it is "Entailment".
Also study the examples below.
Consider the two sentences: "A man, woman, and child enjoying themselves on a beach." and
"A family of three is at the mall shopping.". Let's think step by step. The family cannot be at the beach and at the mall at the same time, so the second sentence contradicts the first one and does not follow from t

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Pretty printed output:
----------------------------------------------------------------------------------------------------
I will give you two sentences, you need to tell me which category the pair of sentences belong to.
There are three categories: "Contradiction", "Neutral" and "Entailment".
If the second sentence contradicts in any way to the first sentence, you would say it is "Contradiction".
If no conclusion can be drawn, or the second sentence need not necessarily follow from the first one, or contradict the first one, you would say it is "Neutral".
If the second sentence logically follows from the first sentence, you would say it is "Entailment".
Also study the examples below.
Consider the two sentences: "A man, woman, and child enjoying themselves on a beach." and
"A family of three is at the mall shopping.". Let's think step by step. The family cannot be at the beach and at the mall at the same time, so the second sentence contradicts the first one and does not follow from t

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Pretty printed output:
----------------------------------------------------------------------------------------------------
I will give you two sentences, you need to tell me which category the pair of sentences belong to.
There are three categories: "Contradiction", "Neutral" and "Entailment".
If the second sentence contradicts in any way to the first sentence, you would say it is "Contradiction".
If no conclusion can be drawn, or the second sentence need not necessarily follow from the first one, or contradict the first one, you would say it is "Neutral".
If the second sentence logically follows from the first sentence, you would say it is "Entailment".
Also study the examples below.
Consider the two sentences: "A man, woman, and child enjoying themselves on a beach." and
"A family of three is at the mall shopping.". Let's think step by step. The family cannot be at the beach and at the mall at the same time, so the second sentence contradicts the first one and does not follow from t

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Pretty printed output:
----------------------------------------------------------------------------------------------------
I will give you two sentences, you need to tell me which category the pair of sentences belong to.
There are three categories: "Contradiction", "Neutral" and "Entailment".
If the second sentence contradicts in any way to the first sentence, you would say it is "Contradiction".
If no conclusion can be drawn, or the second sentence need not necessarily follow from the first one, or contradict the first one, you would say it is "Neutral".
If the second sentence logically follows from the first sentence, you would say it is "Entailment".
Also study the examples below.
Consider the two sentences: "A man, woman, and child enjoying themselves on a beach." and
"A family of three is at the mall shopping.". Let's think step by step. The family cannot be at the beach and at the mall at the same time, so the second sentence contradicts the first one and does not follow from t

## results and accuracy calculation
for clarity, I rewrite only model's responses from the above cell to the tasks:

- Consider the two sentences: "A person on a horse jumps over a broken down airplane." and
"A person is training his horse for a competition.". Let's think step by step. The horse jumping over the broken down airplane is not related to the person training his horse for a competition, so the second sentence does not contradict the first one and it logically follows from the first one, and it is an example of "Entailment" category. (**wrong** as Correct label: neutral)
- Consider the two sentences: "A person on a horse jumps over a broken down airplane." and
"A person is outdoors, on a horse.". Let's think step by step. As some people were on a horse, they would definitely jump over the broken down airplane, so the second sentence does not contradict the first one and it logically follows from the first one, and it is an example of "Entailment" category. (**correct**)
- Consider the two sentences: "Children smiling and waving at camera." and
"There are children present.". Let's think step by step. As some children were present, they would definitely be present, so the second sentence logically follows from the first one and it is an example of "Entailment" category. (**correct**)
- Consider the two sentences: "A boy is jumping on skateboard in the middle of a red bridge." and
"The boy skates down the sidewalk.". Let's think step by step. The boy jumping on the skateboard is definitely not jumping on the sidewalk, so the second sentence does not contradict the first one, and it logically follows from the first one, and it is an example of "Contradiction" category. (**correct**, but reasoning is totally opposite!)
- Consider the two sentences: "An older man sits with his orange juice at a small table in a coffee shop while employees in bright colored shirts smile in the background." and
"An older man drinks his juice as he waits for his daughter to get off work.". Let's think step by step. The older man is probably waiting for his daughter to get off work, so the second sentence does not contradict the first one and it logically follows from the first one, and it is an example of "Contradiction" category. (**wrong** as Correct label: neutral)
- Consider the two sentences: "High fashion ladies wait outside a tram beside a crowd of people in the city." and
"The women do not care what clothes they wear.". Let's think step by step. The people in the crowd are not in high fashion, so the second sentence does not contradict the first one and it logically follows from the first one, and it is an example of "Entailment" category. (**wrong** as Correct label: contradiction)

So Pythia-410M got $3/6$ correct (accuracy $50\%$) for the NLI task with the **few shot cot with task description prompt**.

## Pythia-410M on question answering task

In [8]:
print(f"model has #params = {get_num_parameters(model)/10**6:.2f} M")
print(f"model has size = {get_model_size(model)/MiB:.2f} MiB = {get_model_size(model)/GiB:.2f} GiB")
for idx, (q, c, ans) in enumerate(dataset2):
    print(f"generating for i: {idx+1}/{len(dataset2)}")
    prediction = model.generate(
        prepare_input(q, c, qa_prompt_few_shot),
        max_new_tokens=10,

        do_sample=True,
        temperature=0.4,

        # num_beams=10,
        # early_stopping=True,

        # do_sample=True,
        # top_k=3,

        # do_sample=True,
        # top_p=0.90,
    )
    pretty_print(prediction[0])
    print(f"Ground truth answer: {ans}")
    print('\n\n\n')

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


model has #params = 405.33 M
model has size = 1546.23 MiB = 1.51 GiB
generating for i: 1/6


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Pretty printed output:
----------------------------------------------------------------------------------------------------
Please answer the following question: What is a great place to lay in the sun? Choose one of these four options: in the basement, west, solar system, beach, beans. The correct answer is: beach.

Please answer the following question: What might be the result of a season of successful skiing? Choose one of these four options: finish line, broken bones, broken legs, chapped lips, healthy body. The correct answer is: healthy body.

Please answer the following question: She loved buying products, she was driven by her what to shop more than any practical needs? Choose one of these four options: desire, money, time, credit, spending money. The correct answer is: desire.

Please answer the following question: What might you feel after doing housework for hours? Choose one of these four options: anger, not boredom, stress, boredom, anxiety. The correct answer is: stress.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Pretty printed output:
----------------------------------------------------------------------------------------------------
Please answer the following question: What is a great place to lay in the sun? Choose one of these four options: in the basement, west, solar system, beach, beans. The correct answer is: beach.

Please answer the following question: What might be the result of a season of successful skiing? Choose one of these four options: finish line, broken bones, broken legs, chapped lips, healthy body. The correct answer is: healthy body.

Please answer the following question: She loved buying products, she was driven by her what to shop more than any practical needs? Choose one of these four options: desire, money, time, credit, spending money. The correct answer is: desire.

Please answer the following question: What might you feel after doing housework for hours? Choose one of these four options: anger, not boredom, stress, boredom, anxiety. The correct answer is: stress.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Pretty printed output:
----------------------------------------------------------------------------------------------------
Please answer the following question: What is a great place to lay in the sun? Choose one of these four options: in the basement, west, solar system, beach, beans. The correct answer is: beach.

Please answer the following question: What might be the result of a season of successful skiing? Choose one of these four options: finish line, broken bones, broken legs, chapped lips, healthy body. The correct answer is: healthy body.

Please answer the following question: She loved buying products, she was driven by her what to shop more than any practical needs? Choose one of these four options: desire, money, time, credit, spending money. The correct answer is: desire.

Please answer the following question: What might you feel after doing housework for hours? Choose one of these four options: anger, not boredom, stress, boredom, anxiety. The correct answer is: stress.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Pretty printed output:
----------------------------------------------------------------------------------------------------
Please answer the following question: What is a great place to lay in the sun? Choose one of these four options: in the basement, west, solar system, beach, beans. The correct answer is: beach.

Please answer the following question: What might be the result of a season of successful skiing? Choose one of these four options: finish line, broken bones, broken legs, chapped lips, healthy body. The correct answer is: healthy body.

Please answer the following question: She loved buying products, she was driven by her what to shop more than any practical needs? Choose one of these four options: desire, money, time, credit, spending money. The correct answer is: desire.

Please answer the following question: What might you feel after doing housework for hours? Choose one of these four options: anger, not boredom, stress, boredom, anxiety. The correct answer is: stress.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Pretty printed output:
----------------------------------------------------------------------------------------------------
Please answer the following question: What is a great place to lay in the sun? Choose one of these four options: in the basement, west, solar system, beach, beans. The correct answer is: beach.

Please answer the following question: What might be the result of a season of successful skiing? Choose one of these four options: finish line, broken bones, broken legs, chapped lips, healthy body. The correct answer is: healthy body.

Please answer the following question: She loved buying products, she was driven by her what to shop more than any practical needs? Choose one of these four options: desire, money, time, credit, spending money. The correct answer is: desire.

Please answer the following question: What might you feel after doing housework for hours? Choose one of these four options: anger, not boredom, stress, boredom, anxiety. The correct answer is: stress.


## results and accuracy calculation

for clarity, I rewrite only model's responses to the questions:
- Please answer the following question: The only baggage the woman checked was a drawstring bag, where was she heading with it? Choose one of these four options: garbage can, military, jewelry store, safe, airport. The correct answer is: military. (**wrong** as answer: airport)
- Please answer the following question: To prevent any glare during the big football game he made sure to clean the dust of his what? Choose one of these four options: television, attic, corner, they cannot clean corner and library during football match they cannot need that, ground. The correct answer is: ground. (**wrong** as answer: television)
- Please answer the following question: The president is the leader of what institution? Choose one of these four options: walmart, white house, country, corporation, government. The correct answer is: white house. (**wrong** as answer: country)
- Please answer the following question: What kind of driving leads to accidents? Choose one of these four options: stressful, dangerous, fun, illegal, deadly. The correct answer is: dangerous. (**correct**)
- Please answer the following question: Can you name a good reason for attending school? Choose one of these four options: get smart, boredom, colds and flu, taking tests, spend time. The correct answer is: get smart. (**correct**)
- Please answer the following question: Stanley had a dream that was very vivid and scary. He had trouble telling it from what? Choose one of these four options: imagination, reality, dreamworker, nightmare, awake. The correct answer is: dreamworker. (**wrong** as answer: reality)

So on the QA task with the few shot prompt, Pythia-410M gets $2/6$ correct, accuracy $33.3\%$.

# Pythia-1.4B model

In [9]:
# pythia was not instruction finetuned, so it might have a hard time following structured texts.
# try by making the prompt a paragraph and see if it follows.
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/Pythia-1.4b")
model = AutoModelForCausalLM.from_pretrained(
    "EleutherAI/Pythia-1.4b",
    # trust_remote_code=True,
    # torch_dtype=torch.float16,
).to(device)

tokenizer_config.json:   0%|          | 0.00/396 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.93G [00:00<?, ?B/s]

## Pythia-1.4B on NLI task

In [10]:
print("NLI Task")
print(f"model has #params = {get_num_parameters(model)/10**6:.2f} M")
print(f"model has size = {get_model_size(model)/MiB:.2f} MiB = {get_model_size(model)/GiB:.2f} GiB")

for idx, (s1, s2, gt) in enumerate(dataset):
    print(f"generating for i: {idx+1}/{len(dataset)}")
    prediction = model.generate(
        prepare_input(s1, s2, prompt5_few_shot_cot_with_rules),
        max_new_tokens=75,

        # sampling decoding
        do_sample=True,
        temperature=0.4,

        # num_beams=10,
        # early_stopping=True,

        # do_sample=True,
        # top_k=3,

        # do_sample=True,
        # top_p=0.90,
    )
    pretty_print(prediction[0])
    print(f"Correct label: {gt}")
    print('\n\n\n')

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


NLI Task
model has #params = 1414.65 M
model has size = 5396.45 MiB = 5.27 GiB
generating for i: 1/6


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Pretty printed output:
----------------------------------------------------------------------------------------------------
I will give you two sentences, you need to tell me which category the pair of sentences belong to.
There are three categories: "Contradiction", "Neutral" and "Entailment".
If the second sentence contradicts in any way to the first sentence, you would say it is "Contradiction".
If no conclusion can be drawn, or the second sentence need not necessarily follow from the first one, or contradict the first one, you would say it is "Neutral".
If the second sentence logically follows from the first sentence, you would say it is "Entailment".
Also study the examples below.
Consider the two sentences: "A man, woman, and child enjoying themselves on a beach." and
"A family of three is at the mall shopping.". Let's think step by step. The family cannot be at the beach and at the mall at the same time, so the second sentence contradicts the first one and does not follow from t

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Pretty printed output:
----------------------------------------------------------------------------------------------------
I will give you two sentences, you need to tell me which category the pair of sentences belong to.
There are three categories: "Contradiction", "Neutral" and "Entailment".
If the second sentence contradicts in any way to the first sentence, you would say it is "Contradiction".
If no conclusion can be drawn, or the second sentence need not necessarily follow from the first one, or contradict the first one, you would say it is "Neutral".
If the second sentence logically follows from the first sentence, you would say it is "Entailment".
Also study the examples below.
Consider the two sentences: "A man, woman, and child enjoying themselves on a beach." and
"A family of three is at the mall shopping.". Let's think step by step. The family cannot be at the beach and at the mall at the same time, so the second sentence contradicts the first one and does not follow from t

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Pretty printed output:
----------------------------------------------------------------------------------------------------
I will give you two sentences, you need to tell me which category the pair of sentences belong to.
There are three categories: "Contradiction", "Neutral" and "Entailment".
If the second sentence contradicts in any way to the first sentence, you would say it is "Contradiction".
If no conclusion can be drawn, or the second sentence need not necessarily follow from the first one, or contradict the first one, you would say it is "Neutral".
If the second sentence logically follows from the first sentence, you would say it is "Entailment".
Also study the examples below.
Consider the two sentences: "A man, woman, and child enjoying themselves on a beach." and
"A family of three is at the mall shopping.". Let's think step by step. The family cannot be at the beach and at the mall at the same time, so the second sentence contradicts the first one and does not follow from t

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Pretty printed output:
----------------------------------------------------------------------------------------------------
I will give you two sentences, you need to tell me which category the pair of sentences belong to.
There are three categories: "Contradiction", "Neutral" and "Entailment".
If the second sentence contradicts in any way to the first sentence, you would say it is "Contradiction".
If no conclusion can be drawn, or the second sentence need not necessarily follow from the first one, or contradict the first one, you would say it is "Neutral".
If the second sentence logically follows from the first sentence, you would say it is "Entailment".
Also study the examples below.
Consider the two sentences: "A man, woman, and child enjoying themselves on a beach." and
"A family of three is at the mall shopping.". Let's think step by step. The family cannot be at the beach and at the mall at the same time, so the second sentence contradicts the first one and does not follow from t

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Pretty printed output:
----------------------------------------------------------------------------------------------------
I will give you two sentences, you need to tell me which category the pair of sentences belong to.
There are three categories: "Contradiction", "Neutral" and "Entailment".
If the second sentence contradicts in any way to the first sentence, you would say it is "Contradiction".
If no conclusion can be drawn, or the second sentence need not necessarily follow from the first one, or contradict the first one, you would say it is "Neutral".
If the second sentence logically follows from the first sentence, you would say it is "Entailment".
Also study the examples below.
Consider the two sentences: "A man, woman, and child enjoying themselves on a beach." and
"A family of three is at the mall shopping.". Let's think step by step. The family cannot be at the beach and at the mall at the same time, so the second sentence contradicts the first one and does not follow from t

## results and accuracy calculation

for clarity, I rewrite only model's responses from the above cell to the tasks:

- Consider the two sentences: "A person on a horse jumps over a broken down airplane." and
"A person is training his horse for a competition.". Let's think step by step. The person on the horse jumping over the airplane is not the same as the person training the horse for the competition, so the second sentence neither contradicts nor follows from the first one, and it is an example of "Neutral" category. (**correct**)
- Consider the two sentences: "A person on a horse jumps over a broken down airplane." and
"A person is outdoors, on a horse.". Let's think step by step. The person on a horse jumping over the airplane is not necessarily related to the person outdoors, so the second sentence does not contradict the first one and it logically follows from the first one, and it is an example of "Entailment" category. (**correct**)
- Consider the two sentences: "Children smiling and waving at camera." and
"There are children present.". Let's think step by step. The children need not be present for the camera to be there, so the second sentence logically follows from the first one, and it is an example of "Entailment" category. (**correct** but with incorrect reasoning may be)
- Consider the two sentences: "A boy is jumping on skateboard in the middle of a red bridge." and
"The boy skates down the sidewalk.". Let's think step by step. The boy jumping on the skateboard in the middle of the red bridge is not related to the boy skating down the sidewalk, so the second sentence does not contradict the first one and it logically follows from the first one, and it is an example of "Neutral" category. (**wrong**, Correct label: contradiction)
- Consider the two sentences: "An older man sits with his orange juice at a small table in a coffee shop while employees in bright colored shirts smile in the background." and
"An older man drinks his juice as he waits for his daughter to get off work.". Let's think step by step. The coffee shop is not a place where people are waiting to get on a train, so the second sentence does not contradict the first one and it logically follows from the first one, and it is an example of "Entailment" category. (**wrong** as Correct label: neutral)
- Consider the two sentences: "High fashion ladies wait outside a tram beside a crowd of people in the city." and
"The women do not care what clothes they wear.". Let's think step by step. The tram is not a place where people are waiting to get on the train, so the second sentence does not contradict the first one, and it logically follows from the first one, and it is an example of "Neutral" category. (**wrong** as Correct label: contradiction)

So Pythia-1.4B also got $3/6$ correct (accuracy $50\%$) for the NLI task with the **few shot cot with task description prompt**.

## Pythia-1.4B on question answering task

In [11]:
print(f"model has #params = {get_num_parameters(model)/10**6:.2f} M")
print(f"model has size = {get_model_size(model)/MiB:.2f} MiB = {get_model_size(model)/GiB:.2f} GiB")

for idx, (q, c, ans) in enumerate(dataset2):
    print(f"generating for i: {idx+1}/{len(dataset2)}")
    prediction = model.generate(
        prepare_input(q, c, qa_prompt_few_shot),
        max_new_tokens=10,

        do_sample=True,
        temperature=0.4,
    )
    pretty_print(prediction[0])
    print(f"Ground truth answer: {ans}")
    print('\n\n\n')

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


model has #params = 1414.65 M
model has size = 5396.45 MiB = 5.27 GiB
generating for i: 1/6


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Pretty printed output:
----------------------------------------------------------------------------------------------------
Please answer the following question: What is a great place to lay in the sun? Choose one of these four options: in the basement, west, solar system, beach, beans. The correct answer is: beach.

Please answer the following question: What might be the result of a season of successful skiing? Choose one of these four options: finish line, broken bones, broken legs, chapped lips, healthy body. The correct answer is: healthy body.

Please answer the following question: She loved buying products, she was driven by her what to shop more than any practical needs? Choose one of these four options: desire, money, time, credit, spending money. The correct answer is: desire.

Please answer the following question: What might you feel after doing housework for hours? Choose one of these four options: anger, not boredom, stress, boredom, anxiety. The correct answer is: stress.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Pretty printed output:
----------------------------------------------------------------------------------------------------
Please answer the following question: What is a great place to lay in the sun? Choose one of these four options: in the basement, west, solar system, beach, beans. The correct answer is: beach.

Please answer the following question: What might be the result of a season of successful skiing? Choose one of these four options: finish line, broken bones, broken legs, chapped lips, healthy body. The correct answer is: healthy body.

Please answer the following question: She loved buying products, she was driven by her what to shop more than any practical needs? Choose one of these four options: desire, money, time, credit, spending money. The correct answer is: desire.

Please answer the following question: What might you feel after doing housework for hours? Choose one of these four options: anger, not boredom, stress, boredom, anxiety. The correct answer is: stress.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Pretty printed output:
----------------------------------------------------------------------------------------------------
Please answer the following question: What is a great place to lay in the sun? Choose one of these four options: in the basement, west, solar system, beach, beans. The correct answer is: beach.

Please answer the following question: What might be the result of a season of successful skiing? Choose one of these four options: finish line, broken bones, broken legs, chapped lips, healthy body. The correct answer is: healthy body.

Please answer the following question: She loved buying products, she was driven by her what to shop more than any practical needs? Choose one of these four options: desire, money, time, credit, spending money. The correct answer is: desire.

Please answer the following question: What might you feel after doing housework for hours? Choose one of these four options: anger, not boredom, stress, boredom, anxiety. The correct answer is: stress.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Pretty printed output:
----------------------------------------------------------------------------------------------------
Please answer the following question: What is a great place to lay in the sun? Choose one of these four options: in the basement, west, solar system, beach, beans. The correct answer is: beach.

Please answer the following question: What might be the result of a season of successful skiing? Choose one of these four options: finish line, broken bones, broken legs, chapped lips, healthy body. The correct answer is: healthy body.

Please answer the following question: She loved buying products, she was driven by her what to shop more than any practical needs? Choose one of these four options: desire, money, time, credit, spending money. The correct answer is: desire.

Please answer the following question: What might you feel after doing housework for hours? Choose one of these four options: anger, not boredom, stress, boredom, anxiety. The correct answer is: stress.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Pretty printed output:
----------------------------------------------------------------------------------------------------
Please answer the following question: What is a great place to lay in the sun? Choose one of these four options: in the basement, west, solar system, beach, beans. The correct answer is: beach.

Please answer the following question: What might be the result of a season of successful skiing? Choose one of these four options: finish line, broken bones, broken legs, chapped lips, healthy body. The correct answer is: healthy body.

Please answer the following question: She loved buying products, she was driven by her what to shop more than any practical needs? Choose one of these four options: desire, money, time, credit, spending money. The correct answer is: desire.

Please answer the following question: What might you feel after doing housework for hours? Choose one of these four options: anger, not boredom, stress, boredom, anxiety. The correct answer is: stress.


## results and accuracy calculation

for clarity, I rewrite only model's responses to the questions:
- Please answer the following question: The only baggage the woman checked was a drawstring bag, where was she heading with it? Choose one of these four options: garbage can, military, jewelry store, safe, airport. The correct answer is: jewelry store. (**wrong** as answer: airport)
- Please answer the following question: To prevent any glare during the big football game he made sure to clean the dust of his what? Choose one of these four options: television, attic, corner, they cannot clean corner and library during football match they cannot need that, ground. The correct answer is: ground. (**wrong** as answer: television)
- Please answer the following question: The president is the leader of what institution? Choose one of these four options: walmart, white house, country, corporation, government. The correct answer is: corporation. (**wrong** as answer: country)
- Please answer the following question: What kind of driving leads to accidents? Choose one of these four options: stressful, dangerous, fun, illegal, deadly. The correct answer is: dangerous. (**correct**)
- Please answer the following question: Can you name a good reason for attending school? Choose one of these four options: get smart, boredom, colds and flu, taking tests, spend time. The correct answer is: boredom. (**wrong** as answer: "get smart")
- Please answer the following question: Stanley had a dream that was very vivid and scary. He had trouble telling it from what? Choose one of these four options: imagination, reality, dreamworker, nightmare, awake. The correct answer is: nightmare. (**wrong** as answer: reality)

So on the QA task with the few shot prompt, Pythia-1.4B gets $1/6$ correct, accuracy $16.67\%$.

## Exercise 3: First neural LM (20 points)

Next to reading and understanding package documentations, a key skill for NLP researchers and practitioners is reading and critically assessing NLP literature. The density, but also the style of NLP literature has undergone a significant shift in the recent years with increasing acceleration of progress. Your task in this exercise is to read a paper about one of the first successful neural langauge models, understand its key architectural components and compare how these key components have evolved in modern systems that were discussed in the lecture.

> Specifically, please read this paper and answer the following questions: [Bengio et al. (2003)](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)
>
> * How were words / tokens represented? What is the difference / similarity to modern LLMs?

Here each unique word as appeared in the training datasets (Brown corpus, AP news) is included in the vocabulary, including punctuations and distinguishing upper and lower cases. Also to reduce the vocabulary size even further, they had only one symbol for all rare words with frequency $\leq 3$. The vocabulary size was almost $17,000$. They used words as themselves and did not have any tokenizers which are used as a key data preprocessing step in modern LMs.

Modern LMs do not always store full words in their vocabulary, instead they store word roots, subwords, suffix-prefix separately etc. These are called _tokens_, and the vocabulary size is large, e.g., $50257$ for GPT2. Storing subword tokens instead of full words can help tokenize a new word (that was not encountered during training) into already known subwords to feed into the network, while Bengio et al.'s model will have to incrementally update the vocabulary in order to accommodate a new word.


---
> * How was the context represented? What is the difference / similarity to modern LLMs?

The context is represented as a sequence of $n-1$ contiguous words to predict the $n$-th word in the Bengio et al. paper. They simply concatenated the word embeddings for the $n-1$ words to prepare the input to the hidden layer. They experimented with $n=5$ or $n=3$ in their _fixed context size_ MLP.

In modern LLMs, the maximum context length is very large, e.g., for GPT-2 it was $1024$ tokens (instead of words), and $4096$ tokens for Llama2, and these models also see contexts of varying lengths during training. Also modern LMs DO NOT simply concatenate all the tokens in the input sequence to prepare the context, instead they use the attention mechanism to make the tokens informed about each other to create richer intermediate token representations.

---
> * What is the curse of dimensionality? Give a concrete example in the context of language modeling.

Consider the following domains in increasing dimensions: the interval in $[-1, 1]$ in $\mathbb{R}$, the square $[-1, 1]^2$ in $\mathbb{R}^2$, the cube $[-1, 1]^3$ in $\mathbb{R}^3$ and so on. We observe data samples from this domain in our dataset.
The curse of dimensionality essentially says that in high dimensions, to _fill_ the entire space we would need an exponentially high number of data points. In other words, in high dimensions data is extremely sparse and we can always find an empty region of the space (in the domain) where there is no observed data close by, so it is very difficult to generalize.

**Language modeling example:** Language models try to predict the conditional distribution of the next word/token given the context, i.e., a sequence of ordered words/tokens, and this probability is represented as $p(w_{n+1}|w_1, w_2, w_3, \dots w_n)$. Here curse of dimensionality kicks in while trying to capture longer contexts for large values of $n$. For a context size of $n$, to fully specify the conditional probability distribution we would need to specify the distribution $p(w_{n+1}|w_1, \dots w_n)$ for all possible prefixes ($w_1, w_2, w_3, \dots w_n$), and the number of such prefixes becomes $|V|^n$ where $V$ is the number of words/tokens in the vocabulary. For large vocabulary size (modern is over ~50k), this is prohibitively expensive to specify the full conditional distribution.

But in practice, not all combination of prefixes are likely because of syntactic and semantic restrictions, so this distribution is very sparse, and a good approximation can be obtained by learning such constraints from data.

---

> * Which training data was used? What is the difference / similarity to modern LLMs?

The authors trained their neural language model on two separate text data sources:
1. the **Brown corpus** which contains over $1.18M$ words
2. and the **Associated Press (AP) News** (a collection of news reports) from 1995 and 1996 consisting of ~ $16M$ words.

Both their data sources are fairly limited compared to the current sources of text data for LLMs which is basically the entirety of the Internet. Some examples of modern data sources include: BookCorpus ($800M$ words), and **English Wikipedia ($2500M$ words) that were used to pretrain the BERT encoder and the Colossal Clean Crawled Corpus (about $800$GB of text scraped from internet wth billions of tokens or words) introduced in the T5 paper. So **scale** or size of training data is one obvious difference here, and also Bengio et al. only focused on pretraining while modern LLMs are often customized for various downstream tasks by supervised fine-tuning.

---

> * Which components of the Bengio et al. (2003) model (if any) can be found in modern LMs?
1. the idea of predicting categorical distribution (over the vocabulary) for the next word from a context (sequence of words) --- this is still the main pretraining objective in modern decoder type LMs
2. the word embedding look up table (Matrix $C$) is also used in modern LLMs to convert from raw tokens to word embeddings
3. the direct connections which is popularly known as skip connections now, and it helps with the vanishing gradients problem in very deep neural networks
4. the feedforward network (without the final layer softmax) as a whole is used as one block (FFN) in the Transformer architecture after the attention layer
---

>
> * Please formulate one question about the paper (not the same as the questions above) and post it to the dedicated **Forum** space, and **answer 1 other question** about the paper.

My question: [Direct connections vs. skip connections](https://moodle.zdv.uni-tuebingen.de/mod/forum/discuss.php?d=14910)

The question I answered: [Data parallel processing](https://moodle.zdv.uni-tuebingen.de/mod/forum/discuss.php?d=14920)

---


Furthermore, your task is to carefully dissect the paper by Bengio et al. (2003) and analyse its structure and style in comparison to another more recent paper:  [Devlin et al. (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805)


**TASK:**

> For each section of the Bengio et al. (2003) paper, what are key differences between the way it is written, the included contents, to the BERT paper (Devlin et al., 2019)? What are key similarities? Write max. 2 sentences per section.


**Introduction**


Bengio et al. (2003) focuses on introducing a new MLP architecture for language modeling to tackle the curse of dimensionality problem typically encountered by n-gram LMs of that time, and they mainly focus on the language modeling task, i.e., predicting next word from the context.

Devlin et al. (2019) focuses on applying language models to various downstream tasks and introduces the BERT architecture, which reduces the need for many task specific architectures as BERT only needs to be finetuned for a wide array of downstream tasks (e.g., NLI, question answering).

Both papers introduce a new architecture to tackle problems in natural language processing relevant to their time.


**Related Work**

Bengio et al. (2003) provided a detailed review of traditional language modeling techniques, such as n-gram models, information retrieval, and papers that also used distributed representations for words in the past, but they did not have a dedicated section for it, and instead kept it as a separate section under Introduction.

The BERT paper being more recent discussed more things in the "Related Work" section such as advances in unsupervised pretraining of word embeddings (such as GloVe and ELMo) and finetuning approaches (such as GPT2 at that time).

Both papers acknowledged the limitations of traditional language modeling approaches and positioned their contributions well in that context.


**Model Architecture (A Neural Model + Parallel Implementation in Bengio et al., BERT in Devlin et al.)**

Bengio et al. (2003) proposed a neural network architecture with a single hidden layer that uses distributed weight vectors to obtain better generalization (perplexity) compared to n-gram models. This MLP model was trained using the **causal** language modeling objective, i.e., predicting the next word given only _past_ words and they also devoted an entire section to discuss the implementation of their approach as autograd engines (like PyTorch, JAX) were not popular at that time.

Devlin et al. (2019) introduced a multi-layer transformer (encoder) architecture with bi-directional attention mechanism. They used two different objectives to pretrain the model:
- masked language modeling, where the model can access both past and future tokens in order to predict the word for a certain masked token
- next sentence prediction task, to learn better representation for NLI and QA tasks,

so BERT is not a causal language model and cannot be used to generate new text on its own.

Both papers focus on designing a model that can capture patterns (syntax, semantics etc.) in language data such as next or masked word prediction.


**Experiments / Experimental Results**

Bengio et al. (2003) evaluated the model on a single language modeling task, while Devlin et al. (2019) evaluated BERT on a wide range of NLP tasks, including question answering, sentiment analysis, and text classification. Also as also noted before, their pre-training dataset sizes were also massively different (max $16M$ words for Bengio et al., whereas $2500M$ tokens for BERT).

Both papers emphasize the importance of large-scale training data and computational resources (such as model size).



**Extensions and Future Work / Ablation Studies**

Here Bengio et al. discussed different ways to improve on the MLP model such as incorporating out of vocabulary words, tricks to reduce computation complexity etc.

The BERT paper had a separate section after the experiments for ablation studies to validate the design choices (such as pretraining losses, model sizes etc.) that they made in the experiments.


**Conclusion**

Bengio et al. (2003) concludes by highlighting the potential of neural networks for language modeling as another statistical model of language, while Devlin et al. (2019) concludes by demonstrating the effectiveness of pre-trained language models for a wide range of downstream, or even low resource NLP tasks.

Both papers highlight their major contribution and stress again on its potential applications in NLP.

---