# Homework 3: LLM agents & RL fine-tuning

The third homework zooms in on the following skills: implementing an **advanced generation system**, diving into **task-specific RL fine-tuning hands-on** and **critically thinking about fine-tuning of LMs**.

### Logistics

* submission deadline: June 25th th 23:59 German time via Moodle
  * please upload a **SINGLE .IPYNB FILE named Surname_FirstName_HW3.ipynb** containing your solutions of the homework.
* please solve and submit the homework **individually**!
* if you use Colab, to speed up the execution of the code on Colab, you can use the available GPU (if Colab resources allow). For that, before executing your code, navigate to Runtime > Change runtime type > GPU > Save.
* please note that we will need a lot of GPU memory for both Ex. 1 and Ex. 2 -- therefore, it might be best to do the tasks in **separate runtimes on Colab**, otherwise you might run into out of memory issues.

## Exercise 1: Building a retrieval-augmented generation system (30 points)

An increasingly popular approach to language generation is so called *retrieval-augmented generation* (RAG) wherein a language model is supplied with additional (textual) information retrieved from some storage, in addition to the actual task query. It has been found that this additional context improves model performance, and, e.g., allows to use LLMs with custom information (e.g., proprietary documents etc).

The general set up of a RAG system is as follows:
1. Some form of a database (DB) with (searchable) relevant background information (e.g., a database, a set of documents, ...) is created.
   1. A common database format are *vector DBs*, or, vectore stores. You can optionally learn more about vector DBs, e.g., here: https://www.pinecone.io/learn/vector-database/. The important conceptual point is that some form of a searchable database with relevant (textual) information is created.
2. An LLM that will be generating the responses to the queries, given context, is chosen.
3. An embedding model is chosen.
4. Task queries (e.g., questions or instructions) are provided to the system.
   1. The query is converted to an embedding (using the model chosen ins tep 3), and the embedding is used to search and retrieve relevant information from the database. The specific retrieval method depnds on the nature of the database.
   2. The relevant information is supplied to the LLM as context.
5. Given the extended context, the LLM provides output.

This is visualized in the figure below.

![img](../tutorials/pics/basic_rag.png)

The image is sourced from [here](https://docs.llamaindex.ai/en/stable/getting_started/concepts/).

For more details on RAG, you can read the first part of [this](https://docs.llamaindex.ai/en/stable/getting_started/concepts/) blog post (until "important concepts within each step"). [Here](https://arxiv.org/pdf/2005.11401) is an optional paper about RAG, in case you want to learn more.

**YOUR TASK**
> Your task in this exercise is to explore RAG by implementing a RAG system for recipe generation. The implemented RAG system should be compared to the performance of the same model in a "vanilla" set-up where the model solves the task directly.
>
> We will use the package `LlamaIndex` and the LLM `phi-3-mini-4k-instruct` model as the backbone for the implementation. We will use the `BAAI/bge-small-en-v1.5` model as our embedding model.
>
> We will use unstructured data in the form of a recipe dataset `m3hrdadfi/recipe_nlg_lite`. This dataset will be indexed and it will be used to supplement information for the LLM, additionally to the query. The train split of the dataset should be used for the index, and a sample from the test dataset will be used for sampling queries with which the system will be tested.
>
> For this task, please complete the following steps:
> 1. Download the dataset from Huggingface.
> 2. Briefly familiarize yourself with the dataset.
> 3. Briefly familiarize yourself with [this](https://docs.llamaindex.ai/en/stable/getting_started/starter_example_local/) LLamaIndex example RAG system.  
> 4. Complete the code below (in place of "### YOUR CODE HERE ####"), following the instructions in the comments to build a working RAG system that will generate recipes. Note that you will have to work with the LlamaIndex documentation to complete and understand the code. Some links are already provided.
> 5. Answer the questions at the end of the exercise.

In [1]:
# uncomment and run in your environment / on Colab, if you haven't installed these packages yet
!pip install llama-index-embeddings-huggingface
!pip install llama-index-llms-huggingface
!pip install sentence-transformers
!pip install datasets
!pip install llama-index
# !pip install "transformers[torch]" "huggingface_hub[inference]"
!pip install accelerate
!pip install -i https://pypi.org/simple/ bitsandbytes
from IPython.display import clear_output
clear_output()

In [2]:
# import packages
from datasets import load_dataset
import os
import pandas as pd
from llama_index.core import VectorStoreIndex, Settings, Document
# from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.huggingface import HuggingFaceLLM
from transformers import AutoTokenizer
import torch




In [3]:
# load dataset from HF
dataset = load_dataset("m3hrdadfi/recipe_nlg_lite")
# convert train split to pandas dataframe
dataset_df = pd.DataFrame(dataset["train"])
clear_output()

In [4]:
# explore
dataset_df.head()

Unnamed: 0,uid,name,description,link,ner,ingredients,steps
0,dab8b7d0-e0f6-4bb0-aed9-346e80dace1f,pork chop noodle soup,we all know how satisfying it is to make great...,https://www.yummly.com/private/recipe/Pork-Cho...,"bone in pork chops, salt, pepper, vegetable oi...","3.0 bone in pork chops, salt, pepper, 2.0 tabl...",season pork chops with salt and pepper . heat ...
1,b03f346bf39efcbace5d30a8f962147c8c4c361f,5 ingredient almond cake with fresh berries,this simple almond cake is made with just five...,https://www.skinnytaste.com/5-ingredient-almon...,"large eggs, large egg whites, sugar, pure vani...","3 large eggs, 3 large egg whites, 2/3 cup suga...",position a rack in the middle of the oven and ...
2,89b49e742b2c1d234b83044c14d81155dfea7f19,shrimp cakes,"these light, pan seared shrimp cakes are moist...",https://www.skinnytaste.com/shrimp-cakes/,"peeled and deveined jumbo shrimp, plus 3 table...","1 pound peeled and deveined jumbo shrimp, 1 cu...",pat shrimp dry with a paper towel and place in...
3,5db9af50-63dc-4c5b-9db1-783cf96675d3,chili roasted okra,"chili roasted okra with okra, sesame oil, red ...",https://www.yummly.com/private/recipe/Chili-Ro...,"okra, sesame oil, red pepper flakes, salt, pepper","1.0 pound okra, 1.0 tablespoon sesame oil, 1.0...",preheat the oven to 425degf . wash and dry the...
4,9b8da42d-d07c-4766-9f15-fd3fd6e19bf6,slow cooker chicken chili,warm up on a cold day with this slow cooker ch...,https://www.yummly.com/private/recipe/Slow-Coo...,"oil, chicken, chili powder, onion, jalapeno pe...","1.0 tablespoon oil, 1.0 pound chicken, 1.5 tab...",heat oil in skillet over medium high heat . ad...


In [5]:
dataset_df.iloc[0]['name']

'pork chop noodle soup'

In [6]:
dataset_df.iloc[0]['ingredients']

'3.0 bone in pork chops, salt, pepper, 2.0 tablespoon vegetable oil, 2.0 cup chicken broth, 4.0 cup vegetable broth, 1.0 red onion, 4.0 carrots, 2.0 clove garlic, 1.0 teaspoon dried thyme, 0.5 teaspoon dried basil, 1.0 cup rotini pasta, 2.0 stalk celery'

In [7]:
dataset_df.iloc[0]['steps']

'season pork chops with salt and pepper . heat oil in a dutch oven over medium high heat . add chops and cook for about 4 minutes, until golden brown . flip and cook 4 minutes more, until golden brown . transfer chops to a plate and set aside . pour half of chicken broth into pot, scraping all browned bits from bottom . add remaining chicken broth, vegetable broth, onion, carrots, celery and garlic . mix well and bring to a simmer . add 1 quart water, thyme, basil, 2 teaspoons salt and 1 teaspoon pepper . mix well and bring to a simmer . add chops back to pot and return to simmer . reduce heat and simmer for 90 minutes, stirring occasionally, being careful not to break up chops . transfer chops to plate, trying not to break them up . set aside to cool . raise the heat and bring the soup to a boil . add pasta and cook for about 12 minutes, until tender . when the chops are cool, pull them apart, discarding all the bones and fat . add the meat back to soup and stir well . taste for salt 

In [8]:
# 1. In order to construct a VectorStorageIndex with the texts from the train dataset split, we need to
# create list of formatted texts.
# We want to construct texts of the form: "Name of recipe \n\n ingredients \n\n steps"

texts = [
    #### YOUR CODE HERE #####
    f"{dataset_df.iloc[idx]['name']} \n\n {dataset_df.iloc[idx]['ingredients']} \n\n {dataset_df.iloc[idx]['steps']}" for idx in range(dataset_df.shape[0])
]
texts[:2]

['pork chop noodle soup \n\n 3.0 bone in pork chops, salt, pepper, 2.0 tablespoon vegetable oil, 2.0 cup chicken broth, 4.0 cup vegetable broth, 1.0 red onion, 4.0 carrots, 2.0 clove garlic, 1.0 teaspoon dried thyme, 0.5 teaspoon dried basil, 1.0 cup rotini pasta, 2.0 stalk celery \n\n season pork chops with salt and pepper . heat oil in a dutch oven over medium high heat . add chops and cook for about 4 minutes, until golden brown . flip and cook 4 minutes more, until golden brown . transfer chops to a plate and set aside . pour half of chicken broth into pot, scraping all browned bits from bottom . add remaining chicken broth, vegetable broth, onion, carrots, celery and garlic . mix well and bring to a simmer . add 1 quart water, thyme, basil, 2 teaspoons salt and 1 teaspoon pepper . mix well and bring to a simmer . add chops back to pot and return to simmer . reduce heat and simmer for 90 minutes, stirring occasionally, being careful not to break up chops . transfer chops to plate, 

In [9]:
print(f'total documents:{len(texts)}')
print(texts[100])

total documents:6118
houston's veggie burger 

 1 15 ounce can black beans, 1 teaspoon olive oil, 1/4 cup chopped onion, 1 clove garlic, 1 teaspoon smoked paprika, 1 teaspoon cumin, 1/2 teaspoon chili powder, 1 teaspoon kosher salt, freshly ground black pepper, 1/4 cup bbq sauce, 1 tablespoon molasses, 1/4 cup old fashioned oats, 1 1/4 cup cooked brown rice, 2 tablespoons finely chopped canned beets, 1 tablespoon beet juice, 1 large egg, 4 whole wheat 100 calorie hamburger buns, optional toppings sliced pepper jack cheese, 

 add the beans to a large mixing bowl . gently pat beans dry with a paper towel . using the back side of a fork or potato masher, mash beans until smooth and pasty . heat a small skillet over medium heat . when hot, add the oil, onion and garlic . saute 3 minutes then transfer to the bowl with the beans . in a small bowl, add the paprika, cumin, chili powder, salt and pepper . mix until combined then add to the large bowl . using the same small bowl, mix the bbq sa

In [10]:
# 2. We construct single Documents from the texts
# these documents will be used to construct the vector database
documents = [Document(text=t) for t in texts]
# documents

In [11]:
documents[0]

Document(id_='4a83941f-ac81-4301-8cf3-29a878da5d34', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='pork chop noodle soup \n\n 3.0 bone in pork chops, salt, pepper, 2.0 tablespoon vegetable oil, 2.0 cup chicken broth, 4.0 cup vegetable broth, 1.0 red onion, 4.0 carrots, 2.0 clove garlic, 1.0 teaspoon dried thyme, 0.5 teaspoon dried basil, 1.0 cup rotini pasta, 2.0 stalk celery \n\n season pork chops with salt and pepper . heat oil in a dutch oven over medium high heat . add chops and cook for about 4 minutes, until golden brown . flip and cook 4 minutes more, until golden brown . transfer chops to a plate and set aside . pour half of chicken broth into pot, scraping all browned bits from bottom . add remaining chicken broth, vegetable broth, onion, carrots, celery and garlic . mix well and bring to a simmer . add 1 quart water, thyme, basil, 2 teaspoons salt and 1 teaspoon pepper . mix well and bring to a simmer . ad

In [12]:
# 3. We prepare some utility functions which are required for the LLM to generate maximally accurate responses
# this includes correctly formatting the query and the context into the prompt and special tokens
# that are expected by the chosen LLM backbone.

# we format the texts into the Phi-3 prompt format
# See https://huggingface.co/microsoft/Phi-3-mini-4k-instruct
# to check here how the prompt should look like!
def completion_to_prompt(completion):
    prompt = f'<|user|>\n{completion} <|end|>\n<|assistant|>'
    return prompt ### YOUR CODE HERE ###

In the next cell, the RAG building blocks are put together. Your task is to find out what the different configurations mean and correctly complete the code.

In [13]:
# 4. Save setting that are reused by our RAG system across queries
# you can learn more about the Settings object here: https://docs.llamaindex.ai/en/stable/module_guides/supporting_modules/settings/

# the embedding model is defined
Settings.embed_model = HuggingFaceEmbedding(
    ### YOUR CODE HERE ###
    model_name='BAAI/bge-small-en-v1.5',
)

# backbone LLM is passed to the settings
# this is actually the model that is used to generate the response to the query, given retrieved info
# https://docs.llamaindex.ai/en/stable/understanding/using_llms/using_llms/
# and here: https://docs.llamaindex.ai/en/stable/module_guides/models/llms/usage_custom/
Settings.llm = HuggingFaceLLM(
    ### YOUR CODE HERE ###
    model_name='microsoft/Phi-3-mini-4k-instruct',
    ### YOUR CODE HERE ###
    tokenizer_name='microsoft/Phi-3-mini-4k-instruct',
    #### YOUR CODE HERE ###
    context_window=1024,
    max_new_tokens=128,
    generate_kwargs={"temperature": 0.7, "do_sample": True},
    completion_to_prompt=completion_to_prompt,
    device_map="auto",
    model_kwargs={"torch_dtype": torch.float16, "load_in_8bit": True, "trust_remote_code": True},
)
print("Set LLM!")

# https://docs.llamaindex.ai/en/stable/module_guides/indexing/vector_store_index/
# we create a vector store from our documents
# here, we let the VectorStore convert the documents to nodes automatically
index = VectorStoreIndex.from_documents(
    #### YOUR CODE HERE ###
    documents
)
print("Created index!")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/931 [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.8k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors.index.json:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.17k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/568 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Set LLM!
Created index!


Below is a single example for running a query with the RAG system, and inspecting various interesting aspects of the response generated by the model. Your task is, in the following, to set up a testing loop, which will test different queries with the RAG system and vanilla generation with the same LLM. Use the example as help. Provide comments explaning the single paramters for the following example, in place of "### YOUR COMMENT HERE ###".

In [16]:
# https://docs.llamaindex.ai/en/stable/module_guides/deploying/query_engine/
# we define the query engine: generic interface that allows to ask questions over data
query_engine = index.as_query_engine(
    ### response_mode defines how to combine the retrieved documents
    ### (through one or more LLM calls if all documents do not fit in the context size) into a single answer for the query ###
    response_mode="compact",
    ### return the top_k chunks/nodes of documents that are semantically similar to the query ###
    similarity_top_k=3,
    verbose=True,
)
# https://docs.llamaindex.ai/en/stable/module_guides/querying/response_synthesizers/
response = query_engine.query("How do I make pork chop noodle soup?")
print(response)

for i, n in enumerate(response.source_nodes):
    print(f"----- Node {i} -----")
    print(n.node.get_content())
    print("score")
    print(n.score)

To make pork chop noodle soup, follow these steps:

1. Season 3 bone-in pork chops with salt and pepper.
2. Heat 2 tablespoons of vegetable oil in a Dutch oven over medium-high heat.
3. Add the pork chops and cook for about 4 minutes on each side, until golden brown.
4. Transfer chops to a plate and set aside.
5. Pour half of the chicken broth into the pot, scraping all browned bits from the bottom. Add the remaining ch
----- Node 0 -----
pork chop noodle soup 

 3.0 bone in pork chops, salt, pepper, 2.0 tablespoon vegetable oil, 2.0 cup chicken broth, 4.0 cup vegetable broth, 1.0 red onion, 4.0 carrots, 2.0 clove garlic, 1.0 teaspoon dried thyme, 0.5 teaspoon dried basil, 1.0 cup rotini pasta, 2.0 stalk celery 

 season pork chops with salt and pepper . heat oil in a dutch oven over medium high heat . add chops and cook for about 4 minutes, until golden brown . flip and cook 4 minutes more, until golden brown . transfer chops to a plate and set aside . pour half of chicken broth into 

In [17]:
# comp = Settings.llm.complete("How do I make pork chop noodle soup?")

In [18]:
# print(comp.text)

In [21]:
# testing loop
rag_responses = []
vanilla_responses = []
retrieved_node_texts = []
retrieved_node_scores = []

# retrieve 20 random dish names from test dataset to test the system on
test_df = pd.DataFrame(dataset["test"]).sample(20)
test_queries = [
    f'How do I make {r["name"]}?' for
    _, r in test_df.iterrows()
]
for i, query in enumerate(test_queries[:5]):
    print(test_queries[i])

for query in test_queries[:5]:
    ### YOUR CODE HERE ###
    # run the query against the RAG system
    response_rag = query_engine.query(query)
    rag_responses.append(str(response_rag))

    # record the texts of the nodes that were retrieved for this query
    retrieved_node_texts.append(
        [n.node.get_content() for n in response_rag.source_nodes] ### YOUR CODE HERE ###
    )

    # record the scores of the texts of the retrieved nodes
    retrieved_node_scores.append(
        [n.score for n in response_rag.source_nodes] ### YOUR CODE HERE ###]
    )
    ### YOUR CODE HERE ###
    # implement the "vanilla" (i.e., straightforward) generation of the response to the same query with the backbone LLM
    # Hint: check the intro-to-hf sheet for examples how to generate text with an LM
    response_vanilla = Settings.llm.complete(query).text
    vanilla_responses.append(response_vanilla)

How do I make mussels in basil cream sauce?
How do I make roasted turkey?
How do I make spicy breakfast fajitas with eggs and guacamole?
How do I make chateaubriand steaks with mushroom red wine sauce?
How do I make cauliflower griddle cakes?


In [22]:
test_queries[:5]

['How do I make mussels in basil cream sauce?',
 'How do I make roasted turkey?',
 'How do I make spicy breakfast fajitas with eggs and guacamole?',
 'How do I make chateaubriand steaks with mushroom red wine sauce?',
 'How do I make cauliflower griddle cakes?']

In [23]:
retrieved_node_scores

[[0.806470621256049, 0.7932070397580322, 0.7791804284897788],
 [0.8476552516868898, 0.837986768369647, 0.8375830887768813],
 [0.8059946481988572, 0.761993548826202, 0.7517859715667436],
 [0.7772794543055869, 0.7599149079050604, 0.7559853815785039],
 [0.7822581639252272, 0.7817568511052448, 0.7790570657620833]]

In [24]:
for i, _ in enumerate(test_queries[:5]):
    print(f"---- i:{i+1}/{len(test_queries[:5])} ----")
    print(f"test query: {test_queries[i]}")
    print(f"*** vanilla response: {vanilla_responses[i]}")
    print(f"*** rag response:\n{rag_responses[i]}")
    print('----'*10)

---- i:1/5 ----
test query: How do I make mussels in basil cream sauce?
*** vanilla response: 

A. In a large pot over medium heat, sauté 4 sliced garlic cloves in olive oil until golden, 2 minutes.  Stir in 2 pounds mussels and 1 cup basil cream and 1 cup white wine and 1 cup water.  Cook for 4 minutes or until mussels open, stirring occasionally.  Remove from heat, discarding any mussels that don't open.  Stir in 1/2 cup grated parmesan cheese and serve.
B. In a large pot over medium heat, saut
*** rag response:
I'm sorry, but based on the provided context information, there isn't a recipe for mussels in basil cream sauce. However, you can create a similar dish using the given ingredients and techniques. Here's a possible recipe inspired by the provided context:

Ingredients:
- 2 pounds fresh live mussels
- 3/4 cup white wine
- 3/4 cup water
- 1/2 cup heavy cream
- 1/4 cup finely chopped fresh basil
- 1 tablespoon finely
----------------------------------------
---- i:2/5 ----
test q

In [26]:
# rag_responses

In [27]:
retrieved_node_texts[0]

['steamed mussels with piri piri sauce \n\n 6 tablespoons finely chopped red onion, 1/4 cup finely chopped parsley, 3 tablespoons olive oil, 2 tablespoons red wine vinegar, 1 tablespoon water, 1 garlic clove, 1/2 jalapeno pepper, 1/4 teaspoon kosher salt, 1/8 teaspoon black pepper, 1/8 to 1/4 teaspoon crushed red pepper, 2 pounds fresh live mussels, 3/4 cup white wine, 3/4 cup water \n\n combine all the sauce ingredients in a medium bowl and mix well . sit at room temperature while preparing the mussels . place mussels in a colander and rinse them under cold water to remove any sand . scrub them with a stiff brush under cold running water to remove any sand . the shells will be closed until you cook them, discard any cracked shells . to debeard, use your fingers to firmly pull out the hairy filaments . clean off the outsides with a brush to remove any barnacles or dirt . place 3/4 cups of water and 3/4 cup of white wine in a large pot and bring to boil . add the mussels, cover and stea

> **Questions:**
>
> 1. Inspect the results of the testing. (a) How often do you prefer the RAG response over the vanilla response? (b) Do you observe differences between the RAG and vanilla responses? If yes, what are these? (c) Inpsect the retrieved documents and their scores. Do they make sense for the queries? Do the scores match your intuition about their relevance for the query?

(a) I would prefer RAG response over the vanilla response in all of the 5 cases above for reasons mentioned in part (b).

(b) Some differences I noted between the vanilla and RAG responses:
- few vanilla responses are _empty_, i.e., the model did not generate anything which is a bit strange.
- there are _repeated lines/sentences_ in few of the vanilla responses, whereas RAG responses are often coherent and compact.
- vanilla responses are often _unstructured_, i.e., they list some steps but often incomplete, on the other hand, most RAG responses give some instructions first and then list the necessary ingredients required.

(c) The retrieved documents are indeed relevant for the queries. Almost in all cases, the retrieved documents have scores greater than $0.75$.

For example, for the query: `How do I make mussels in basil cream sauce?`, all 3 of the retrieved documents have high similarity ($0.80$, $0.79$ and $0.77$) and contain some recipe involving `mussels` as seen above, although not exactly the recipe for `mussels in basil cream sauce`. The model nicely improvised in this case to generate a similar looking recipe from the retrieved documents.

---

> 2. What could be advantages and disadvantages of using RAG? Name 1 each.

**Advantage**: RAG helps the model to generate the answer based on the provided context that might reduce the chance of hallucinations.

**Disadvantage**: Often it is not possible to fit all documents in the model prompt (because of limited context size), hence there are different ways of combining the LLM's generated answers over multiple calls and this makes the system a little complex.

---
> 3. What is the difference between documents and nodes in the RAG system?

In LlamaIndex, a `Document` is a container object for a (large) text document, PDF or any other type of document. On the other hand, a `Node` is a smaller unit that is retrieved during the retrieval process, that has metadata associated with it. A big document might get chunked into multiple nodes.

---
> 5. What does the embedding model do? What is the measure that underlies retrieval of relevant documents?

The embedding model produces a compact numerical vector representation (_embedding_) for each document (provided for indexing). When a query comes, the same embedding model also produces the embedding for the query, and tries to find out most similar documents (embedding vectors) based on some similarity criteria.

In LlamaIndex, the default similarity measure is **cosine similarity**.


---
> 6. What are different response modes of the query engine? Is the chosen mode a good choice for our application? Why (not)?

Often all the retrieved documents for a given query do not fit in a single call to the LLM because of limited context size. Response modes define how to combine LLM generations from many calls to provide the user with a final single answer to their query. The different response modes available are: `refine`, `compact`, `tree_summarize`, `accumulate`, `simple_summarize` etc.

For our application, the `compact` response mode was chosen that tries to fit as many documents/nodes in the prompt to minimize calls to the LLM. It looks like a good choice in this recipe application, because the documents/recipes are
short (100-200 words) and the total context size of the model is $1024$, so it can easily accommodate 5-6 documents in the prompt.


---

## Exercise 2: RLHF for summarization (15 points)

In this exercise, we want to fine-tune GPT-2 to generate human-like news summaries, following a procedure that is very similar to the example of the movie review generation from [sheet 4.1](https://cogsciprag.github.io/Understanding-LLMs-course/tutorials/04a-finetuning-RL.html). The exercise is based on the paper by [Ziegler et al. (2020)](https://arxiv.org/pdf/1909.08593).

To this end, we will use the following components:
* in order to initialize the policy, we use GPT-2 that was already fine-tuned for summarization, i.e., our SFT model is [this](https://huggingface.co/Ayham/albert_gpt2_Full_summarization_cnndm)
* as our reward model, we will use a task-specific reward signal, namely, the ROUGE score that evaluates a summary generated by a model against a human "gold standard" summary.
* a dataset of CNN news texts and human-written summaries (for computing the rewards) for the fine-tuning which can be found [here](https://huggingface.co/datasets/abisee/cnn_dailymail). Please note that we will use the *validation* split because we only want to run short fine-tuning.

**NOTE:** for building the datset and downloading the pretrained model, ~4GB of space will be used.

> **YOUR TASK:**
>
> Your job for this task is to set up the PPO-based training with the package `trl`, i.e., the set up step 3 of [this](https://cdn.openai.com/instruction-following/draft-20220126f/methods.svg) figure.
> 1. Please complete the code or insert comments what a particular line of code does below where the comments says "#### YOUR CODE / COMMENT HERE ####". For this and for answering the questions, you might need to dig a bit deeper into the working of proximal policy optimization (PPO), the algorithm that we are using for training. You can find relevant information, e.g., [here](https://huggingface.co/docs/trl/main/en/ppo_trainer).
> 2. To test your implementation, you can run the training for some steps, but you are NOT required to train the full model since it will take too long.
> 3. Answer the questions below.

In [5]:
!pip install trl accelerate==0.27.2 evaluate rouge_score datasets
from IPython.display import clear_output
clear_output()

In [6]:
# import libraries
import torch
from tqdm import tqdm
import pandas as pd

tqdm.pandas()

from transformers import AutoTokenizer
from datasets import load_dataset

from trl import (
    PPOTrainer,
    PPOConfig,
    AutoModelForCausalLMWithValueHead
)
import evaluate

In [7]:
config = PPOConfig(
    model_name="gavin124/gpt2-finetuned-cnn-summarization-v2",  # model we wish to align/train
    learning_rate=1.41e-5,
    steps=250,
    #### YOUR COMMENT HERE (what is batch_size) ####
    batch_size=4,      # Number of samples per optimisation step to forward pass through the model
    mini_batch_size=4, # Number of samples optimized in each mini batch
    #### YOUR COMMENT HERE (what is ppo_epochs) ####
    ppo_epochs=4,      # Number of optimisation epochs per batch of samples

    remove_unused_columns=False, # to keep the `highlights` column during batch preparation
)

In [8]:
ds = load_dataset("abisee/cnn_dailymail", '1.0.0', split="validation")
clear_output()

In [9]:
ds[0].keys()

dict_keys(['article', 'highlights', 'id'])

In [10]:
print(ds[2]['article'])

(CNN)French striker Bafetimbi Gomis, who has a history of fainting, said he is now "feeling well" after collapsing during Swansea's 3-2 loss at Tottenham in the Premier League on Wednesday. The worrying incident occurred in the first half at White Hart Lane -- after Tottenham scored in the seventh minute -- but the 29-year-old left the pitch conscious following about five minutes of treatment. The Guardian added that he was wearing an oxygen mask. Play was temporarily stopped before resuming. As the match progressed, Swansea tweeted that Gomis was "fine," with manager Garry Monk using the same word to describe Gomis' condition. Gomis spent the night in hospital as a precaution, Swansea said on its website. "I wanted to reassure you concerning my health," Gomis told the website. "It actually looks much scarier than it is physically dangerous, and I am feeling well now. "I have been under a great deal of stress and fatigue due to my father's health, which requires me to go back and forth

In [11]:
ds[2]['highlights'] #--- ground truth summary

'Bafetimbi Gomis collapses within 10 minutes of kickoff at Tottenham . But he reportedly left the pitch conscious and wearing an oxygen mask . Gomis later said that he was "feeling well" The incident came three years after Fabrice Muamba collapsed at White Hart Lane .'

We load the CNN dataset into a DataFrame and and truncate the texts to 500 tokens, because we don't want the training to be too memory heavy and we want to have "open" some tokens for the generation (GPT-2's context window size is 1024). Then we tokenize each text and pad it.

In [12]:
def build_dataset(
        config,
        dataset_name="abisee/cnn_dailymail"
    ):
    """
    Build dataset for training. This builds the dataset from `load_dataset`.

    Args:
        dataset_name (`str`):
            The name of the dataset to be loaded.

    Returns:
        dataloader (`torch.utils.data.DataLoader`):
            The dataloader for the dataset.
    """
    tokenizer = AutoTokenizer.from_pretrained("gavin124/gpt2-finetuned-cnn-summarization-v2")# AutoTokenizer.from_pretrained(#### YOUR CODE HERE ####)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = 'left'
    # load the datasets
    ds = load_dataset(dataset_name, '1.0.0', split="validation")

    def tokenize(sample):
        sample["input_ids"] = tokenizer.encode(
            #### YOUR CODE HERE (hint: inspect the dataset to see how to access the input text)####,
            sample["article"],
            return_tensors="pt",
            max_length=512,
            truncation=True,
            padding="max_length"
        )
        # get the truncated natural text, too
        sample["query"] = tokenizer.decode(sample["input_ids"][0])  # PPO needs a query column
        return sample

    ds = ds.map(tokenize, batched=False)
    ds.set_format(type="torch")
    return ds

In [13]:
# build the dataset
dataset = build_dataset(config)

def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

clear_output()

In [15]:
# # some exploration to find out why batch parameter was missing
# # dict((key, [d[key] for d in dataset]) for key in dataset[0])
# for key in dataset[0]:
#     print(key)

#     for d in dataset:
#         print(key, type(d[key]))
#         break

#     # print(dict((key, [d[key] for d in dataset])))
#     print('---')
#     # break

In [16]:
print(dataset)
print(f'keys in dataset: {dataset[0].keys()}')

Dataset({
    features: ['article', 'highlights', 'id', 'input_ids', 'query'],
    num_rows: 13368
})
keys in dataset: dict_keys(['article', 'highlights', 'id', 'input_ids', 'query'])


In [17]:
print(dataset[0]['article'])

(CNN)Share, and your gift will be multiplied. That may sound like an esoteric adage, but when Zully Broussard selflessly decided to give one of her kidneys to a stranger, her generosity paired up with big data. It resulted in six patients receiving transplants. That surprised and wowed her. "I thought I was going to help this one person who I don't know, but the fact that so many people can have a life extension, that's pretty big," Broussard told CNN affiliate KGO. She may feel guided in her generosity by a higher power. "Thanks for all the support and prayers," a comment on a Facebook page in her name read. "I know this entire journey is much bigger than all of us. I also know I'm just the messenger." CNN cannot verify the authenticity of the page. But the power that multiplied Broussard's gift was data processing of genetic profiles from donor-recipient pairs. It works on a simple swapping principle but takes it to a much higher level, according to California Pacific Medical Center 

In [18]:
print(dataset[0]['query']) # --- note that the query is a truncated article!

(CNN)Share, and your gift will be multiplied. That may sound like an esoteric adage, but when Zully Broussard selflessly decided to give one of her kidneys to a stranger, her generosity paired up with big data. It resulted in six patients receiving transplants. That surprised and wowed her. "I thought I was going to help this one person who I don't know, but the fact that so many people can have a life extension, that's pretty big," Broussard told CNN affiliate KGO. She may feel guided in her generosity by a higher power. "Thanks for all the support and prayers," a comment on a Facebook page in her name read. "I know this entire journey is much bigger than all of us. I also know I'm just the messenger." CNN cannot verify the authenticity of the page. But the power that multiplied Broussard's gift was data processing of genetic profiles from donor-recipient pairs. It works on a simple swapping principle but takes it to a much higher level, according to California Pacific Medical Center 

We load the **finetuned GPT2 model with a value head and the tokenizer**. We load the model twice; the first model is the one that will be optimized while the second model serves as a reference to calculate the KL-divergence from the starting point.

In [19]:
model = AutoModelForCausalLMWithValueHead.from_pretrained('gavin124/gpt2-finetuned-cnn-summarization-v2')  # the policy model to be optimized/aligned with human preferences
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained('gavin124/gpt2-finetuned-cnn-summarization-v2') # the original SFT model as a baseline to compare deviation
tokenizer = AutoTokenizer.from_pretrained('gavin124/gpt2-finetuned-cnn-summarization-v2') # tokenizer

tokenizer.pad_token = tokenizer.eos_token
# note --- these are not the reward model, so why do they have the value head? Is it because it is needed by PPO internally?

config.json:   0%|          | 0.00/907 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/510M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [20]:
model # notice the v_head in model

AutoModelForCausalLMWithValueHead(
  (pretrained_model): GPT2LMHeadModel(
    (transformer): GPT2Model(
      (wte): Embedding(50260, 768)
      (wpe): Embedding(1024, 768)
      (drop): Dropout(p=0.1, inplace=False)
      (h): ModuleList(
        (0-11): 12 x GPT2Block(
          (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (attn): GPT2Attention(
            (c_attn): Conv1D()
            (c_proj): Conv1D()
            (attn_dropout): Dropout(p=0.1, inplace=False)
            (resid_dropout): Dropout(p=0.1, inplace=False)
          )
          (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (mlp): GPT2MLP(
            (c_fc): Conv1D()
            (c_proj): Conv1D()
            (act): NewGELUActivation()
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
      (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    )
    (lm_head): Linear(in_features=768, out_features=50260, bias=False)
  )
  (

*AutoModelForCausalLMWithValueHead* is a model class provided by `trl` that is used for **training models with RL with a *baseline*.** The baseline is used as shown, e.g., on slide 76-78 of lecture 05. Specifically, the baseline is simultaneously learned during training, and learns to predict the so-called action value, namely **the expected reward for generating a particular completion**, given the query. This baseline is implemented as an additional (scalar output) head next to the next-token prediction head of the policy, and is called the value head. Based on the query and completion representation, it learns to predict a scalar reward which is compared to the ground truth reward from the reward model.

The PPOTrainer takes care of device placement and optimization later on:

In [21]:
ppo_trainer = PPOTrainer(config,
                         model,
                         ref_model,
                         tokenizer,
                         dataset=dataset,
                         data_collator=collator,
                         )

In [22]:
# inspect a batch
batch = next(iter(ppo_trainer.dataloader))

In [23]:
batch.keys()

dict_keys(['article', 'highlights', 'id', 'input_ids', 'query'])

In [24]:
device = ppo_trainer.accelerator.device
if ppo_trainer.accelerator.num_processes == 1:
    device = 0 if torch.cuda.is_available() else "cpu"  # to avoid a `pipeline` bug
print("Device: ", device)

Device:  0


In [25]:
ppo_trainer.accelerator.device

device(type='cuda')

In [26]:
rouge = evaluate.load("rouge")

def reward_fn(
        output: list[str],
        original_summary: list[str]
    ):
    """
    ####
    reward signal to compare model generated summaries against ground truth summaries.
    here we are not using a separately trained reward model to score the generation,
    instead we are using rouge metric to give the reward score (higher the better).
    ####
    """
    scores = []
    for o, s in list(zip(output, original_summary)):
      score = rouge.compute(predictions=[o.strip()], references=[s])["rouge1"]
      scores.append(torch.tensor(score))

    return scores

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [27]:
batch['query'][0]

"Police are hunting for a 'dangerous' convicted killer who absconded from the accommodation where he was living while on probation. William Kerr, 53, was jailed for life in 1998 for the murder of 43-year-old Maureen Comfort, whose body was found in a cupboard in her flat in Leeds. He was released on licence from HMP Stocken, Rutland, on January 23 this year and moved to approved accommodation in Hull. On the run: Convicted killer William Kerr, 53, pictured, was jailed for life in 1998 for the murder of 43-year-old Maureen Comfort, whose body was found in a cupboard in her flat in Leeds. Kerr now needs to be arrested and returned to prison as 'a matter of urgency', police said, after he breached his licence conditions by absconding from his residence. Members of public are being asked not to approach Kerr if they see them, and instead to call the police immediately on 999. Kerr was jailed along with co-defendant Christopher Moody. According to court papers, Ms Comfort was last seen aliv

In [28]:
batch['highlights'][0]

'William Kerr was jailed for life for the murder of Maureen Comfort in 1998 . He was released on licence in January and moved to probation hostel . The 53-year-old absconded and needs to be arrested and returned to jail .'

In [29]:
reward_fn(batch['query'], batch['highlights'])

[tensor(0.1659, dtype=torch.float64),
 tensor(0.2457, dtype=torch.float64),
 tensor(0.1290, dtype=torch.float64),
 tensor(0.1511, dtype=torch.float64)]

In [30]:
reward_fn(batch['highlights'], batch['highlights'])

[tensor(1., dtype=torch.float64),
 tensor(1., dtype=torch.float64),
 tensor(1., dtype=torch.float64),
 tensor(1., dtype=torch.float64)]

In [31]:
# stats

In [None]:
output_max_length = 128
#### YOUR COMMENT HERE: explain what kind of decoding scheme these parameters initialize ####
#### the below kwargs initialize a pure-sampling based generation scheme from the distribution over the next word
generation_kwargs = {
    "min_length": -1,  # no limit on minimum length
    "top_k": 0.0,      # deactivates top-k sampling, so all words have a chance of being sampled acc. to their probability
    "top_p": 1.0,      # deactivates top-p sampling, entire probability distribution is to be considered
    "do_sample": True, # sampling is enabled
    "pad_token_id": tokenizer.eos_token_id,
    "max_new_tokens": output_max_length
}


# for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
for epoch, batch in enumerate(ppo_trainer.dataloader):
    query_tensors = batch["input_ids"]
    query_tensors = [q.squeeze() for q in query_tensors]

    #### Get response from gpt2
    response_tensors = []
    for query in query_tensors:
        response = ppo_trainer.generate(query, **generation_kwargs)
        response_tensors.append(response.squeeze()[-output_max_length:])
    batch["response"] = [tokenizer.decode(r.squeeze()) for r in response_tensors]

    #### Compute score with the reward_fn above
    rewards = reward_fn(output=batch['response'], original_summary=batch['highlights']) #### YOUR CODE HERE ####

    #### Run PPO step
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
    ppo_trainer.log_stats(stats, batch, rewards)
    if epoch%10==0:
        print(f'average rewards after epoch{epoch+1}: {torch.tensor(rewards).mean().item()}')
    if epoch == 100:
        break

average rewards after epoch1: 0.156044610151753




average rewards after epoch11: 0.2908636898728733
average rewards after epoch21: 0.161980168060895




average rewards after epoch31: 0.21702499955924615




average rewards after epoch41: 0.2360020393525112
average rewards after epoch51: 0.22784885992908518
average rewards after epoch61: 0.13734126046381837
average rewards after epoch71: 0.243687707641196
average rewards after epoch81: 0.28662710898072097
average rewards after epoch91: 0.20853769767097022
average rewards after epoch101: 0.26554341698488027


In [None]:
stats

{'objective/kl': 0.0,
 'objective/kl_dist': array([0., 0., 0., 0.], dtype=float32),
 'objective/logprobs': array([[-1.7661407e+01, -6.7980914e+00, -8.7211885e+00, ...,
         -6.0022049e+00, -5.8400145e+00, -1.4361165e+00],
        [-1.8295897e+01, -1.7596505e+00, -7.3675761e+00, ...,
         -1.8162313e+00, -4.7044687e+00, -5.4516268e-01],
        [-1.8029343e+01, -1.5287596e+00, -9.0074599e-01, ...,
         -2.9677842e+00, -4.0714946e+00, -6.8337460e+00],
        [-1.6447329e+01, -3.5221531e+00, -3.9507768e+00, ...,
         -4.1675322e-02, -1.0817159e-03, -6.4407244e-02]], dtype=float32),
 'objective/ref_logprobs': array([[-1.7661407e+01, -6.7980914e+00, -8.7211885e+00, ...,
         -6.0022049e+00, -5.8400145e+00, -1.4361165e+00],
        [-1.8295897e+01, -1.7596505e+00, -7.3675761e+00, ...,
         -1.8162313e+00, -4.7044687e+00, -5.4516268e-01],
        [-1.8029343e+01, -1.5287596e+00, -9.0074599e-01, ...,
         -2.9677842e+00, -4.0714946e+00, -6.8337460e+00],
        [-1

> **QUESTIONS:**
>
> 1. What are the three main steps in the training loop? Please name them (in descriptive words, you don't need to cite the code).

- generation of a response ($Y$) for a given query ($X$)
- getting the reward $r(X, Y)$ for the query-response pair using rouge score
- running the PPO maximization step for improving the policy (`model`) based on rewards without deviating much from the `ref_model` (ppo step)

---

> 2. Suppose the plots below show training metrics for different runs of the summarization model training. Interpret what each of them tells us about training success; i.e., did the training go well on this run? Do we expect to get good summaries? Why? Be concise!

- **A**: looks like a non-aligned reference model as the range of rewards obtained is random and covers the entire range, so generated summaries are sometimes good, sometimes bad.

- **B**: batch reward sadly decreases on ppo training, so maybe training is not effective, (perhaps regularization (on KL div from reference model) is too strong?)

- **C**: reward goes up with steps, so the model is expected to generate summaries that are aligned with human preferences.

---

> 3. We have truncated the query articles to maximally 512 tokens. Given that **we are using ROUGE with respect to ground truth summaries** as a reward, why might this be problematic?

The policy model might be tempted to complete the article rather than generating summary (even though it is finetuned to produce summaries). The ground truth summaries however were generated from the full article, so it is also perhaps a bit unfair to expect the model to generate good summaries from the truncated articles.

---

> 3. [Bonus 2pts] The overall loss that is optimized during training with PPO consists of two components: the policy loss that is computed based on the completion log probability and the reward, and the value function loss which is computed based on the the predicted and received reward for a completion. These two loss components are weighed in the total loss function with the value function coefficient (`vf_coef`). Intuitively, how does it affect training if the coefficient is set to a high value?

If `vf_coef` is set to a high value, the value function network would be forced to predict rewards as close as possible to the actual reward, and it might overfit to the reward model. For example, the policy might figure out ways to generate text that get high reward but it might no longer be aligned to what the human wanted as a response (reward hacking).

---

![img](data/rewards.png)

## Exercise 3: Aspects of fine-tuning (5 points)

> Please answer the following questions. Be concise!
>
> 1. When assistants are trained with RLHF, they are often optimized to be helpful and harmless. However, it has been observed that **the goals of being harmless and helpful at the same time may be at odds**. In particular, the problem of evasive behavior has been observed for models optimized for these goals. For example, [this paper](https://arxiv.org/pdf/2212.08073.pdf) mentions this problem. In your own words, please briefly describe what evasive behavior of LLMs is, give an example, and why it is a problem.


Evasive behaviour of LLMs is when an LLM refuses to respond to a seemingly innocuous query, or avoids the topic completely proving a non-informative response to the user. A super helpful LLM might cause harm by generating a detailed response to inappropriate queries by a malicious user, such as "How do I steal food from my neighbour?", and hence being helpful and harmless at the same time may be at odds. However, the LLM simply saying "I cannot help you with that." is also not desired, and is an example of evasive behaviour. Ideally, we would want the LLM to respond with some text on why it is socially unacceptable to steal and warn the user about legal consequences of the act, hence evasive behaviour is a problem.



My answer above is motivated from the following excerpt from paper:
> _That is, helpfulness tends to increase harmfulness, since models are willing to obey pernicious requests, and conversely models trained to be harmless tend to be more evasive and generally less helpful. By harmfulness we include both a variety of forms of harm to the user and responses that help the user to achieve harmful aims._



---
> 2. What special tokens are commonly used for chat model fine-tuning, and what is their purpose?

To specifically delineate the roles for users and assistants, some special tokens are used to inform the model where the user query is located, and also when should the assistant/chat LLM should begin its response. For example: `<|user|>` and `<|assistant|>` in the completion prompt in the RAG exercise. There are also special tokens to mark the beginning and end of user (or assistant) texts, such as `<|im_start|>`, and `<|im_end|>`.

References: [TinyLlama-1.1B-Chat-v0.3](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.3), [meta-llama-3](https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/)

---
> 3. Please name two parameter-efficient fine-tuning techniques and briefly explain one advantage of using each technique over full-scale fine-tuning.

- **Selective fine tuning**: this PEFT method _freezes_ most of the layers in the base model and only updates few layers, e.g., the final two feedforward layers during finetuning with task specific data. This is advantageous over full finetuning as we do not have to update (hence also not necessary to store gradients) the frozen part and it reduces the (GPU) memory load.

- **Prompt tuning**: this PEFT method tries to find out the optimal way of _asking_ a model to do a task, by appending few random vectors before the actual task prompt, and updating only those _soft prompt_ vectors during finetuning. Prompt tuning freezes the entire base model so it also have low memory overhead, but again for each task it requires a separate soft prompt to prepend to the task prompt.

---