# Homework 2: Prompting & Generation with LMs (50 points)

The second homework zooms in on the following skills: on gaining a deeper understanding of different state-of-the-art prompting techniques and training your critical conceptual thinking regarding research on LMs. 

### Logistics

* submission deadline: June 2nd th 23:59 German time via Moodle
  * please upload a **SINGLE .IPYNB FILE named Surname_FirstName_HW2.ipynb** containing your solutions of the homework.
* please solve and submit the homework **individually**! 
* if you use Colab, to speed up the execution of the code on Colab, you can use the available GPU (if Colab resources allow). For that, before executing your code, navigate to Runtime > Change runtime type > GPU > Save.


## Exercise 1: Advanced prompting strategies (16 points)

The lecture discussed various sophisticated ways of prompting language models for generating texts. Please answer the following questions about prompting techniques in context of different models, and write down your answers, briefly explaining them (max. 3 sentences). Feel free to actually implement some of the prompting strategies to play around with them and build your intuitions.

> Consider the following language models: 
> * GPT-2, GPT-4, Vicuna (an instruction-tuned version of Llama) and Llama-2-7b-base.
>  
> Consider the following prompting / generation strategies: 
> * beam search, tree-of-thought reasoning, zero-shot CoT prompting, few-shot CoT prompting, few-shot prompting.
> 
> For each model, which strategies do you think work well, and why? Do you think there are particular tasks or contexts, in which they work better, than in others?

## Exercise 2: Prompting for NLI & Multiple-choice QA (14 points)

In this exercise, you can let your creativity flow -- your task is to come up with prompts for language models such that they achieve maximal accuracy on the following example tasks. Feel free to take inspiration from the in-class examples of the sentiment classification task. Also feel free to play around with the decoding scheme and see how it interacts with the different prompts.

**TASK:**
> Use the code that was introduced in the Intro to HF sheet to load the model and generate predictions from it with your sample prompts.
> 
> * Please provide your code.
> * Please report the best prompt that you found for each model and task (i.e., NLI and multiple choice QA), and the decoding scheme parameters that you used. 
> * Please write a brief summary of your explorations, stating what you tried, what worked (better), why you think that is.

* Models: Pythia-410m, Pythia-1.4b
* Tasks: please **test** the model on the following sentences and report the accuracy of the model with your best prompt and decoding configurations.
  * Natural language inference: the task is to classify whether two sentences form a "contradiction" or an "entailment", or the relation is "neutral". The gold labels are provided for reference here, but obviously shouldn't be given to the model at test time.
    * A person on a horse jumps over a broken down airplane. A person is training his horse for a competition. neutral
    * A person on a horse jumps over a broken down airplane. A person is outdoors, on a horse. entailment
    * Children smiling and waving at camera. There are children present. entailment
    * A boy is jumping on skateboard in the middle of a red bridge. The boy skates down the sidewalk. contradiction
    * An older man sits with his orange juice at a small table in a coffee shop while employees in bright colored shirts smile in the background. An older man drinks his juice as he waits for his daughter to get off work. neutral
    * High fashion ladies wait outside a tram beside a crowd of people in the city. The women do not care what clothes they wear. contradiction
  * Multiple choice QA: the task is to predict the correct answer option for the question, given the question and the options (like in the task of Ex. 3 of homework 1). The gold labels are provided for reference here, but obviously shouldn't be given to the model at test time.
    * The only baggage the woman checked was a drawstring bag, where was she heading with it? ["garbage can", "military", "jewelry store", "safe", "airport"] -- airport
    * To prevent any glare during the big football game he made sure to clean the dust of his what? ["television", "attic", "corner", "they cannot clean corner and library during football match they cannot need that", "ground"] -- television
    * The president is the leader of what institution? ["walmart", "white house", "country", "corporation", "government"] -- country
    * What kind of driving leads to accidents? ["stressful", "dangerous", "fun", "illegal", "deadly"] -- dangerous
    * Can you name a good reason for attending school? ["get smart", "boredom", "colds and flu", "taking tests", "spend time"] -- "get smart"
    * Stanley had a dream that was very vivid and scary. He had trouble telling it from what? ["imagination", "reality", "dreamworker", "nightmare", "awake"] -- reality

## Exercise 3: First neural LM (20 points)

Next to reading and understanding package documentations, a key skill for NLP researchers and practitioners is reading and critically assessing NLP literature. The density, but also the style of NLP literature has undergone a significant shift in the recent years with increasing acceleration of progress. Your task in this exercise is to read a paper about one of the first successful neural langauge models, understand its key architectural components and compare how these key components have evolved in modern systems that were discussed in the lecture. 

> Specifically, please read this paper and answer the following questions: [Bengio et al. (2003)](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)
>
> * How were words / tokens represented? What is the difference / similarity to modern LLMs?

Here each unique word as appeared in the training datasets (Brown corpus, AP news) is included in the vocabulary, including punctuations and distinguishing upper and lower cases. Also to reduce the vocabulary size even further, they had only one symbol for all rare words with frequency $\leq 3$. The vocabulary size was almost $17,000$. They used words as themselves and did not have any tokenizers which are used as a key data preprocessing step in modern LMs.

Modern LMs do not always store full words in their vocabulary, instead they store word roots, subwords, suffix-prefix separately etc. These are called _tokens_, and the vocabulary size is large, e.g., $50257$ for GPT2. Storing subword tokens instead of full words can help tokenize a new word (that was not encountered during training) into already known subwords to feed into the network, while Bengio et al.'s model will have to incrementally update the vocabulary in order to accommodate a new word.


---
> * How was the context represented? What is the difference / similarity to modern LLMs?

The context is represented as a sequence of $n-1$ contiguous words to predict the $n$-th word in the Bengio et al. paper. They simply concatenated the word embeddings for the $n-1$ words to prepare the input to the hidden layer. They experimented with $n=5$ or $n=3$ in their _fixed context size_ MLP.

In modern LLMs, the maximum context length is very large, e.g., for GPT-2 it was $1024$ tokens (instead of words), and $4096$ tokens for Llama2, and these models also see contexts of varying lengths during training. Also modern LMs DO NOT simply concatenate all the tokens in the input sequence to prepare the context, instead they use the attention mechanism to make the tokens informed about each other to create richer intermediate token representations.

---
> * What is the curse of dimensionality? Give a concrete example in the context of language modeling.

ffff



---

> * Which training data was used? What is the difference / similarity to modern LLMs?

The authors trained their neural language model on two separate text data sources:
1. the **Brown corpus** which contains over $1.18M$ words
2. and the **Associated Press (AP) News** (a collection of news reports) from 1995 and 1996 consisting of ~ $16M$ words.

Both their data sources are fairly limited compared to the current sources of text data for LLMs which is basically the entirety of the Internet. Some examples of modern data sources include: BookCorpus ($800M$ words), and **English Wikipedia ($2500M$ words) that were used to pretrain the BERT encoder and the Colossal Clean Crawled Corpus (about $800$GB of text scraped from internet wth billions of tokens or words) introduced in the T5 paper. So **scale** or size of training data is one obvious difference here, and also Bengio et al. only focused on pretraining while modern LLMs are often customized for various downstream tasks by supervised fine-tuning.

---

> * Which components of the Bengio et al. (2003) model (if any) can be found in modern LMs?
1. the idea of predicting categorical distribution (over the vocabulary) for the next word from a context (sequence of words) --- this is still the main pretraining objective in modern decoder type LMs
2. the word embedding look up table (Matrix $C$) is also used in modern LLMs to convert from raw tokens to word embeddings
3. the direct connections which is popularly known as skip connections now, and it helps with the vanishing gradients problem in very deep neural networks
4. the feedforward network (without the final layer softmax) as a whole is used as one block (FFN) in the Transformer architecture after the attention layer
---

> 
> * Please formulate one question about the paper (not the same as the questions above) and post it to the dedicated **Forum** space, and **answer 1 other question** about the paper.

My question: [Direct connections vs. skip connections](https://moodle.zdv.uni-tuebingen.de/mod/forum/discuss.php?d=14910)

The question I answered: [Data parallel processing](https://moodle.zdv.uni-tuebingen.de/mod/forum/discuss.php?d=14920)

---


Furthermore, your task is to carefully dissect the paper by Bengio et al. (2003) and analyse its structure and style in comparison to another more recent paper:  [Devlin et al. (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805)


**TASK:**

> For each section of the Bengio et al. (2003) paper, what are key differences between the way it is written, the included contents, to the BERT paper (Devlin et al., 2019)? What are key similarities? Write max. 2 sentences per section.


| Section  | NLM Bengio et al. (2003) | BERT Devlin et al. (2019) |
| ------------- | ------------- | ------------- |
| Content Cell  | Content Cell  | Content Cell  |
| Content Cell  | Content Cell  | Content Cell  |