## CA 3, LLMs Spring 2024

- **Name: Sina Tabassi**
- **Student ID: 810199554**

---
### This is due on **May 11th, 2024**, submitted via [elearn](https://elearn.ut.ac.ir/).
#### Your submission should be named using the following format: `CA3_LASTNAME_STUDENTID.ipynb`.

---

##### *How to do this problem set:*

- Some questions require writing Python code and computing results, and the rest of them have written answers. For coding problems, you will have to fill out all code blocks that say `WRITE YOUR CODE HERE`.

- For text-based answers, you should replace the text that says "Write your answer here..." with your actual answer.

- There is no penalty for using AI assistance on this homework as long as you fully disclose it in the final cell of this notebook (this includes storing any prompts that you feed to large language models). That said, anyone caught using AI assistance without proper disclosure will receive a zero on the assignment (we have several automatic tools to detect such cases). We're literally allowing you to use it with no limitations, so there is no reason to lie!

---

##### *Academic honesty*

- We will audit the Colab notebooks from a set number of students, chosen at random. The audits will check that the code you wrote actually generates the answers in your notebook. If you turn in correct answers on your notebook without code that actually generates those answers, we will consider this a serious case of cheating.

- We will also run automatic checks of Colab notebooks for plagiarism. Copying code from others is also considered a serious case of cheating.

---

Colab Notebook Link: https://colab.research.google.com/drive/1BMSD2ehI5Yd1f4KIaO5vI-Vrx3x4nGQT?usp=sharing

# Chain-of-Thought (CoT) (20 points)

If you have any further questions or concerns, contact the TA via email: mehdimohajeri@ut.ac.ir

LLMs have demonstrated good reasoning abilities. Furthermore, their capabilities can be further improved by incorporating reasoning techniques. One of the most notable developments in this area is the [Chain-of-Thought (CoT)](https://arxiv.org/abs/2201.11903), which was introduced by Google. This approach has shown promising results in improving the reasoning capabilities of language models across a variety of tasks. Can you explain what CoT is and how it works? (2.5 Points)

*Answer:*

The CoT refers to the interconnected series of ideas, concepts, and words that the model generates or processes to understand and respond to a given prompt or input. It's essentially the flow of reasoning or logic that the model follows to produce coherent outputs.

The CoT, as we can read in the `Chain-of-Thought (CoT)` paper, has 4 general steps:
- First, chain of thought, in principle, allows models to decompose multi-step problems into intermediate steps, which means that additional computation can be allocated to problems that require more reasoning steps.

- Second, a chain of thought provides an interpretable window into the behavior of the model,suggesting how it might have arrived at a particular answer and providing opportunities to debug where the reasoning path went wrong (althoughfully characterizing a model's computations that support an answer remains an open question).

- Third, chain-of-thought reasoning can be used for tasks such as math word problems, commonsense reasoning, and symbolic manipulation, and is potentially applicable (at least in principle) to any task that humans can solve via language.

- Finally, chain-of-thought reasoning can be readily elicited in sufficiently large off-the-shelf language models simply by including examples of chain of thought sequences into the exemplars of few-shot prompting.


In this section, you should use the CoT technique. firstly you need to load the [Phi-2 model](https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/). This model has been introduced by Microsoft as a small LLM

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

torch.set_default_device("cuda")

model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", torch_dtype="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/35.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.34k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Use Phi-2 to answer the questions below with and without CoT. Compare results and explain their difference. (4 Points)

In [None]:
def generate_output(model, input, max_length=500,temp=False ):

  input = f"Question: {input}\nOutput:"
  input = tokenizer(input, return_tensors="pt", return_attention_mask=True)
  tokenizer.pad_token = tokenizer.eos_token

  if temp:
    outputs = model.generate(**input, max_length=max_length, temperature=temp, do_sample=True)
  else:
    outputs = model.generate(**input, max_length=max_length)
  text = tokenizer.batch_decode(outputs)[0]

  return text

In [None]:
questions = ["Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?",
"Jack is stranded on a desert island. He wants some salt to season his fish. He collects 2 liters of seawater in an old bucket. If the water is 20% salt, how many ml of salt will Jack get when all the water evaporates?",
"John volunteers at a shelter twice a month for 3 hours at a time. How many hours does he volunteer per year?",
"There are 32 tables in a hall. Half the tables have 2 chairs each, 5 have 3 chairs each and the rest have 4 chairs each. How many chairs in total are in the hall?",
"Bert fills out the daily crossword puzzle in the newspaper every day. He uses up a pencil to fill out the puzzles every two weeks. On average, it takes him 1050 words to use up a pencil. How many words are in each crossword puzzle on average?"
]

## Without CoT

In [None]:
without_cot_results = []
for question in questions:
  without_cot_results.append(generate_output(model, question))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


## With CoT

In [None]:
with_cot_results = []
cot_steps = []
cot_prompt = "{question}\nlet's think step by step."
conc_prompt = "{question}\nContext: {cot}\nNow, what is the final result?"

for question in questions:
  cot = generate_output(model, cot_prompt.format(question=question))
  cot_steps.append(cot)
  result = generate_output(model, conc_prompt.format(question=question, cot=cot))
  with_cot_results.append(result)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


## Compare Results

In [None]:
print("Results without context of thought:\n")
for i, result in enumerate(without_cot_results):
    print(f"Question {i+1}:\n {result}")
    print("#########################################################")

Results without context of thought:

Question 1:
 Question: Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?
Output: Weng earned $9 for babysitting.
<|endoftext|>
#########################################################
Question 2:
 Question: Jack is stranded on a desert island. He wants some salt to season his fish. He collects 2 liters of seawater in an old bucket. If the water is 20% salt, how many ml of salt will Jack get when all the water evaporates?
Output: To find the amount of salt in the seawater, we need to multiply the volume of water by the percentage of salt. 

2 liters x 20% = 0.4 liters

To convert liters to milliliters, we need to multiply by 1000.

0.4 liters x 1000 = 400 ml

Therefore, Jack will get 400 ml of salt when all the water evaporates.
<|endoftext|>
#########################################################
Question 3:
 Question: John volunteers at a shelter twice a month for 3 hours at a time.

In [None]:
print("\nResults with context of thought:\n")
for i, result in enumerate(with_cot_results):
    print(f"Question {i+1}:\n {result}")
    print("#########################################################")


Results with context of thought:

Question 1:
 Question: Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?
Context: Question: Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?
let's think step by step.
Output: Let's convert 50 minutes to hours. Since there are 60 minutes in an hour, 50 minutes is equal to 50/60 = 5/6 hours.
To find out how much Weng earned, we can multiply her hourly rate of $12 by the number of hours she babysat, which is 5/6.
So, Weng earned 12 * (5/6) = $10.
Therefore, Weng earned $10.
<|endoftext|>
Now, what is the final result?
Output:
The final result is:
```
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
```
<|endoftext|>
#########################################################
Question 2:
 Question: Jack is stranded on a desert island. He wants some salt to season his fish. He collects 2 liters of seawater in an old bucket. If the water is 20% salt

The following chart is the result of each approach:

| Question | Without CoT  | With CoT
|----------|--------------|---------|
| 1        | Wrong        | Correct
| 2        | Correct      | Correct
| 3        | Wrong        | Wrong
| 4        | Wrong        | Correct
| 5        | Correct      | Correct

In the chart depicted above, it is evident that the approach utilizing CoT yields superior results. Specifically, the approach incorporating CoT achieved correct answers for questions 1 and 4, surpassing the approach lacking CoT.

Consequently, we deduce that the CoT-enhanced approach demonstrates higher accuracy compared to its counterpart without CoT. Therefore, it is recommended to employ the CoT approach in reasoning tasks.

## Other Methods for Reasoning

There are many other approaches to utilize the reasoning abilities of LLMs. Describe the [Tree-of-Thought (ToT)](https://arxiv.org/abs/2305.10601) and [Self-Consistency](https://arxiv.org/abs/2203.11171) within these approaches. (3.5 Points)

*Answer:*


**The Tree-of-Thought:** ToT methodology in reasoning tasks involves systematically branching out from initial prompts or questions, akin to the growth of a tree, to explore various lines of thought and potential conclusions. This approach encompasses a structured framework where each step in the thought process leads to further elaboration or refinement, allowing for comprehensive examination and analysis of the underlying concepts or problems. By delineating the progression of thoughts into distinct branches, the ToT method facilitates a more organized and thorough exploration of reasoning pathways, ultimately aiding in the generation of well-founded conclusions or solutions.


**Self-Consistency:**denotes the coherence and logical integrity maintained throughout the sequence of connected ideas or steps. Within this approach, each subsequent thought or conclusion is built upon the foundation laid by preceding ones, ensuring a seamless progression of reasoning. By adhering to principles of internal consistency, where each step aligns with and supports the overarching line of reasoning, the CoT method fosters robust and dependable chains of thought. This commitment to self-consistency enhances the reliability and persuasiveness of the argumentation or problem-solving process, contributing to more cogent and defensible outcomes in reasoning tasks. Moreover, CoT-SC involves sampling independent chains of thought and then selecting the most frequent output, further reinforcing the coherence and reliability of the reasoning process by integrating diverse perspectives and ensuring consensus in the final output.

Now, implement Self-Consistency to answer the questions of the previous section. (6 Points)

In [None]:
import re
from collections import Counter

def get_last_digits(s):
  return re.findall(r"[-+]?(?:\d*\.*\d+)", s)[-1]

def most_common(lst):
    return Counter(lst).most_common(1)[0][0]

number_of_samples = 4

for question_num, question in enumerate(questions, start=1):
  outputs = []
  for _ in range(number_of_samples):
    output = generate_output(model, cot_prompt.format(question=question), temp=0.7)
    outputs.append(output)

  votes = []
  for s in outputs:
      last_digits = get_last_digits(s)
      votes.append(last_digits)

  most_voted = most_common(votes)


  print("Question", question_num, ":\n", question)
  print()
  print("Self-consistency Result:\n", most_voted)
  print()
  for index, output in enumerate(outputs, start=1):
      print(f"Self-consistency path {index}:\n {output}")
      print()
  print("#############################################################")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question 1 :
 Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?

Self-consistency Result:
 10

Self-consistency path 1:
 Question: Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?
let's think step by step.
Output: Let's convert 50 minutes to hours. Since there are 60 minutes in an hour, 50 minutes is equal to 50/60 = 5/6 hours.
Now, we can calculate how much Weng earned by multiplying her hourly rate by the number of hours she babysat.
Weng earned $12/hour * 5/6 hours = $10.
Therefore, Weng earned $10 for babysitting.
<|endoftext|>

Self-consistency path 2:
 Question: Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?
let's think step by step.
Output: One hour is equivalent to 60 minutes, so 50 minutes is equivalent to 50/60 = 5/6 hours.
To find out how much Weng earned, we can multiply the number

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question 2 :
 Jack is stranded on a desert island. He wants some salt to season his fish. He collects 2 liters of seawater in an old bucket. If the water is 20% salt, how many ml of salt will Jack get when all the water evaporates?

Self-consistency Result:
 400

Self-consistency path 1:
 Question: Jack is stranded on a desert island. He wants some salt to season his fish. He collects 2 liters of seawater in an old bucket. If the water is 20% salt, how many ml of salt will Jack get when all the water evaporates?
let's think step by step.
Output: 
Step 1: Calculate the amount of salt in 2 liters of seawater.
Given that the seawater is 20% salt, we can calculate the amount of salt as follows:
Amount of salt = 2 liters * 0.20
Amount of salt = 0.4 liters

Step 2: Convert liters to milliliters.
Since 1 liter is equal to 1000 milliliters, we can convert 0.4 liters to milliliters as follows:
Amount of salt = 0.4 liters * 1000 milliliters/liter
Amount of salt = 400 milliliters

Therefore, Jack

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question 3 :
 John volunteers at a shelter twice a month for 3 hours at a time. How many hours does he volunteer per year?

Self-consistency Result:
 72

Self-consistency path 1:
 Question: John volunteers at a shelter twice a month for 3 hours at a time. How many hours does he volunteer per year?
let's think step by step.
Output: John volunteers at a shelter twice a month for 3 hours at a time. 
To calculate the total hours he volunteers per year, we can multiply the number of hours per visit (3 hours) by the number of visits per month (2 visits) and then multiply that by the number of months in a year (12 months).
(3 hours/visit) x (2 visits/month) x (12 months/year) = 72 hours/year.
Therefore, John volunteers for a total of 72 hours per year.
<|endoftext|>

Self-consistency path 2:
 Question: John volunteers at a shelter twice a month for 3 hours at a time. How many hours does he volunteer per year?
let's think step by step.
Output: Let's break down the information given in the ques

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question 4 :
 There are 32 tables in a hall. Half the tables have 2 chairs each, 5 have 3 chairs each and the rest have 4 chairs each. How many chairs in total are in the hall?

Self-consistency Result:
 77

Self-consistency path 1:
 Question: There are 32 tables in a hall. Half the tables have 2 chairs each, 5 have 3 chairs each and the rest have 4 chairs each. How many chairs in total are in the hall?
let's think step by step.
Output: Let's start by finding the total number of chairs in the first two types of tables:

Half the tables have 2 chairs each, so there are 32 / 2 = 16 tables with 2 chairs each.

These tables have a total of 16 * 2 = 32 chairs.

The other 5 tables have 3 chairs each, so there are 5 * 3 = 15 tables with 3 chairs each.

These tables have a total of 15 * 3 = 45 chairs.

To find the total number of chairs in the hall, we add up the number of chairs from each type of table:

Total number of chairs = number of chairs from tables with 2 chairs + number of chairs fr

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question 5 :
 Bert fills out the daily crossword puzzle in the newspaper every day. He uses up a pencil to fill out the puzzles every two weeks. On average, it takes him 1050 words to use up a pencil. How many words are in each crossword puzzle on average?

Self-consistency Result:
 5470

Self-consistency path 1:
 Question: Bert fills out the daily crossword puzzle in the newspaper every day. He uses up a pencil to fill out the puzzles every two weeks. On average, it takes him 1050 words to use up a pencil. How many words are in each crossword puzzle on average?
let's think step by step.
Output: Let x be the number of words in each crossword puzzle.
The number of pencils Bert uses in a week is 7/2 = 3.5 pencils.
The number of pencils Bert uses in a year is 3.5 * 52 = 182 pencils.
The number of words Bert writes in a year is 182 * 1050 = 190050 words.
The number of words in each crossword puzzle is 190050/365 ≈ 5470 words.
<|endoftext|>

Self-consistency path 2:
 Question: Bert fills ou

Consider LLMs' features and propose a new approach based on them to enhance LLMs' reasoning abilities. Why do you believe this approach could enhance LLMs' reasoning abilities? (4 Points)

*Answer:*

We propose a better approach in language model (LLM) that incorporates several advanced features aimed at enhancing its reasoning capabilities. These include the integration of Chain of Thoughts (CoT) for facilitating coherent reasoning processes, the implementation of Self Consistency to ensure logical coherence in generated outputs, and the adoption of a zero-temperature setting to maximize confidence in predictions. Additionally, the model introduces a role-playing mechanism to enhance task diversity, allowing it to adapt its reasoning strategies based on contextual cues provided by designated roles. For instance, when tasked with generating mathematics questions, the model can be guided by a designated role, such as a mathematics teacher, to tailor its responses accordingly. This comprehensive approach empowers the model to reason more effectively across various tasks and contexts.

the main features of this models are as follows:

- CoT enables the model to maintain a coherent flow of reasoning across multiple steps or iterations. Instead of generating responses in isolation, the model can generate a sequence of interconnected thoughts or actions, mimicking human-like reasoning processes. This allows the model to reason through complex scenarios more effectively and generate more coherent and contextually relevant outputs.

- Self Consistency ensures that the model's generated outputs are internally consistent and logically coherent. By enforcing self-consistency constraints during generation, the model can avoid generating contradictory or nonsensical responses. This improves the overall quality and reliability of the model's outputs, enhancing its reasoning abilities.

- Setting the temperature of the model to zero essentially means maximizing the confidence of its predictions. Instead of generating probabilistic outputs, the model produces deterministic outputs with high confidence levels. This enables the model to make more confident and assertive predictions, which is particularly useful in tasks where precision and accuracy are crucial.

- Role-playing provides the model with contextual cues or personas to adapt its behavior and reasoning strategies based on the specific task at hand. For example, if the model is tasked with generating mathematics questions, providing it with a "mathematics teacher" role-play prompt informs the model about the context and expectations of the task. This helps the model tailor its responses and reasoning approaches to suit the given role or scenario.

Advantages of this model and how it enhances LLM's reasoning abilities:

- **Enhanced Coherence:** By incorporating CoT and self-consistency, the model can produce more coherent and logically consistent outputs, improving its overall reasoning abilities.

- **Increased Confidence:** Zero temperature ensures that the model's predictions are made with high confidence levels, reducing uncertainty and improving the reliability of its reasoning.

- **Task Adaptability:** Role-playing enables the model to adapt its reasoning strategies based on the specific task or context, allowing for greater flexibility and versatility in its responses.

- **Improved Generalization:** The combination of these features enables the model to generalize its reasoning abilities across a wide range of tasks and scenarios, making it more robust and capable of handling diverse challenges.



# PEFT (30 + 5 points)

If you have any further questions or concerns, contact the TA via email: pedram.rostami@ut.ac.ir

## Why We Are Using PEFT (5 points)

In this question, we're delving into PEFT. First, let's start by exploring why PEFT is crucial when training LLMs. For instance, let's consider the scenario where we want to train the [microsoft/phi-2](https://huggingface.co/microsoft/phi-2) model. To get started, take a look at the Huggingface blog post on [model memory anatomy](https://huggingface.co/docs/transformers/en/model_memory_anatomy) to estimate how much memory we'll require. Just assume we're sticking to pure fp16 with Adam optimizer and a batch size of 1. (4 points)

*Answer:*

Using fp32 we have following results:
- **Gradients:** 10.8 GB
- **Optimizer States:**  21.6 GB (Using normal AdamW)
- **Model Weights:** 10.8 GB
- **Total:** 10.8 + 21.6 + 10.8 = 43.2 GB

Now by using fp16, we have following results:
-  **Total:** 5.4 + 10.8 + 10.8 = 27 GB


Compare your estimation with the memory estimation provided by the [Model Memory Calculator](https://huggingface.co/spaces/hf-accelerate/model-memory-usage). (1 point)

*Answer:*

Using fp32 we have following results:
- **Gradients:** 9.9 GB
- **Model Weights:** 9.9 GB
- **Training Using Adam:**  40 GB (Using normal AdamW)

Now by using fp16, we have following results:
- **Gradients:** 14.7 GB
- **Model Weights:** 9.9 GB
- **Training Using Adam:**  20 GB (Using normal AdamW)

## Preparing Dataset (5 points)

We're going to train the phi-2 model for a question generation task based on passages. For this purpose, we're using the Super-NaturalInstruction dataset, which comprises instruction tuning datasets for over 1600 tasks across different languages. While the dataset is available on the [Huggingface Hub](https://huggingface.co/datasets/Muennighoff/natural-instructions), downloading all its components consumes considerable time. Consequently, we're opting to download only the English Question Generation segment.

In [1]:
!wget https://huggingface.co/datasets/Muennighoff/natural-instructions/resolve/main/train/task001_quoref_question_generation_train.jsonl
!pip install datasets

--2024-05-15 15:36:00--  https://huggingface.co/datasets/Muennighoff/natural-instructions/resolve/main/train/task001_quoref_question_generation_train.jsonl
Resolving huggingface.co (huggingface.co)... 18.172.134.24, 18.172.134.88, 18.172.134.4, ...
Connecting to huggingface.co (huggingface.co)|18.172.134.24|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/repos/a1/fe/a1fedd93d2c00f67a096c36747356c03b6f01649bae4b4be932e6531a496022a/89ad3018bdb2cec45afea661fbe2fc8df9593243f58531d381c19b5fb13ce581?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27task001_quoref_question_generation_train.jsonl%3B+filename%3D%22task001_quoref_question_generation_train.jsonl%22%3B&Expires=1716046560&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxNjA0NjU2MH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9yZXBvcy9hMS9mZS9hMWZlZGQ5M2QyYzAwZjY3YTA5NmMzNjc0NzM1NmMwM2I2ZjAxNjQ5YmFlNG

Read the dataset file and convert it into a `dataset` object. Then, split the dataset, selecting 95% for the training set and 5% for the test set. (5 points)

In [2]:
def generate_output(model, input, max_length=500,temp=False):

  input = f"Question: {input}\nOutput:"
  input = tokenizer(input, return_tensors="pt", return_attention_mask=True)
  tokenizer.pad_token = tokenizer.eos_token

  if temp:
    outputs = model.generate(**input, max_length=max_length, temperature=temp, do_sample=True)
  else:
    outputs = model.generate(**input, max_length=max_length)
  text = tokenizer.batch_decode(outputs)[0]

  return text


def apply_alpaca_template(instruction, input=None):
  if input is not None:
    return f"""\
### Instruction:
{instruction}

### Input:
{input}

### Response:
"""

In [3]:
from datasets import load_dataset

dataset = load_dataset('json', data_files='task001_quoref_question_generation_train.jsonl')

print(dataset)

train_dataset = dataset['train'].train_test_split(test_size=0.05, seed=42)['train']
test_dataset = dataset['train'].train_test_split(test_size=0.05, seed=42)['test']

print("Training set size:", len(train_dataset))
print("Test set size:", len(test_dataset))

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['task_name', 'id', 'definition', 'inputs', 'targets'],
        num_rows: 21817
    })
})
Training set size: 20726
Test set size: 1091


## Pretrained Model (5 points)

Choose random samples from the test set, apply the [Alpaca template](https://github.com/tatsu-lab/stanford_alpaca?tab=readme-ov-file#data-release) to them, and obtain the model outputs (If you are using the [sample code](https://huggingface.co/microsoft/phi-2#sample-code) provided by Microsoft for using the model, please comment out the `torch.set_default_device("cuda")` line to conserve memory. Instead, you can move the model to the GPU using the `.to` function after loading it.). (5 points)

In [None]:
import torch
import random
from transformers import AutoModelForCausalLM, AutoTokenizer

torch.set_default_device("cuda")

model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", torch_dtype="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)

tokenizer.pad_token = tokenizer.eos_token

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
number_of_samples = 5

test_samples = random.sample(list(test_dataset), number_of_samples)

for i, sample in enumerate(test_samples, start=1):
  instructions = sample["definition"]
  input = sample["inputs"]
  prompt = apply_alpaca_template(instructions, input)
  output = generate_output(model, prompt, max_length=1000)

  print(f"Sample {i}:\n")
  print("Generated Output:", output)
  print("#########################################################\n")


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample 1:

Generated Output: Question: ### Instruction:
In this task, you're given passages that contain mentions of names of people, places, or things. Some of these mentions refer to the same person, place, or thing. Your job is to write questions that evaluate one's understanding of such references. Good questions are expected to link pronouns (she, her, him, his, their, etc.) or other mentions to people, places, or things to which they may refer. Do not ask questions that can be answered correctly without understanding the paragraph or having multiple answers. Avoid questions that do not link phrases referring to the same entity. For each of your questions, the answer should be one or more phrases in the paragraph, and it should be unambiguous.

### Input:
Passage: Martin Luther King Jr. was held in the Birmingham jail and was denied a consultation with an attorney from the NAACP without guards present. When historian Jonathan Bass wrote of the incident in 2001, he noted that news 

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample 2:

Generated Output: Question: ### Instruction:
In this task, you're given passages that contain mentions of names of people, places, or things. Some of these mentions refer to the same person, place, or thing. Your job is to write questions that evaluate one's understanding of such references. Good questions are expected to link pronouns (she, her, him, his, their, etc.) or other mentions to people, places, or things to which they may refer. Do not ask questions that can be answered correctly without understanding the paragraph or having multiple answers. Avoid questions that do not link phrases referring to the same entity. For each of your questions, the answer should be one or more phrases in the paragraph, and it should be unambiguous.

### Input:
Passage: The remaining vampire covens are on the verge of annihilation by the Lycans. Both species are searching for Selene: the vampires seek justice for the death of Viktor, while the Lycans, led by Marius, intend to use her to

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample 3:

Generated Output: Question: ### Instruction:
In this task, you're given passages that contain mentions of names of people, places, or things. Some of these mentions refer to the same person, place, or thing. Your job is to write questions that evaluate one's understanding of such references. Good questions are expected to link pronouns (she, her, him, his, their, etc.) or other mentions to people, places, or things to which they may refer. Do not ask questions that can be answered correctly without understanding the paragraph or having multiple answers. Avoid questions that do not link phrases referring to the same entity. For each of your questions, the answer should be one or more phrases in the paragraph, and it should be unambiguous.

### Input:
Passage: When Stephen Herrick, a sedate, mild-mannered shipping magnate, loses his opera tickets, Mrs. Grange, the aggressive mother of his fiancée Cecilia, insists upon being seated in the Herrick box anyway. Upon finding the Du

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample 4:

Generated Output: Question: ### Instruction:
In this task, you're given passages that contain mentions of names of people, places, or things. Some of these mentions refer to the same person, place, or thing. Your job is to write questions that evaluate one's understanding of such references. Good questions are expected to link pronouns (she, her, him, his, their, etc.) or other mentions to people, places, or things to which they may refer. Do not ask questions that can be answered correctly without understanding the paragraph or having multiple answers. Avoid questions that do not link phrases referring to the same entity. For each of your questions, the answer should be one or more phrases in the paragraph, and it should be unambiguous.

### Input:
Passage: In 1792 during the Reign of Terror of the French Revolution, a secret league of brave Englishmen are rescuing French aristocrats from the guillotine. The leader of this secret society is a mysterious English nobleman kno

## Fine-tuning with LoRA (15 + 5 points)

In this phase, we're fine-tuning the phi-2 model on a question generation dataset. To begin, we need to format our dataset into the instruction tuning format. For this task, we can employ `DataCollatorForCompletionOnlyLM`. Look at the [example](https://huggingface.co/docs/trl/en/sft_trainer#train-on-completions-only) in the HuggingFace documentation and instantiate the data collator using the Alpaca template. (3 points)

In [4]:
!pip install -qU trl peft accelerate

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m245.2/245.2 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.1/199.1 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m25.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.4/102.4 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [5]:
import torch
from trl import DataCollatorForCompletionOnlyLM
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)
tokenizer.add_special_tokens({'pad_token': tokenizer.eos_token})
tokenizer.padding_side = "right"

def formatting_prompts_func(sample):
    output_texts = []
    for i in range(len(sample['definition'])):
        text = f"""\
### Instruction:
{sample["definition"][i]}

### Input:
{sample["inputs"][i]}

### Response:
{sample["targets"][i]}
"""
        output_texts.append(text)
    return output_texts


response_template = "### Response:"

collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/7.34k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Refer to the HuggingFace [documentation](https://huggingface.co/docs/trl/en/sft_trainer#training-adapters) and instantiate the Lora config. (3 points)

In [28]:
#!pip install -i https://pypi.org/simple/ bitsandbytes
#!pip install -U accelerate
#!pip install -U transformers

Looking in indexes: https://pypi.org/simple/
Collecting bitsandbytes
  Downloading bitsandbytes-0.43.1-py3-none-manylinux_2_24_x86_64.whl (119.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: bitsandbytes
Successfully installed bitsandbytes-0.43.1


In [6]:
from peft import LoraConfig, PeftConfig

peft_config = LoraConfig(
    r=4,
    lora_alpha=8,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

train_dataset_samples = train_dataset.train_test_split(train_size=0.1, shuffle=True, seed=42)["train"]
print("Splited Samples:", len(train_dataset_samples))

Configure other training arguments. [Here](https://huggingface.co/docs/transformers/v4.40.1/en/main_classes/trainer#transformers.TrainingArguments) is a list of available options. Consider using a small batch size to prevent CUDA out of memory errors. You can augment batch size artificially through gradient accumulation. Enabling gradient checkpointing can further save memory. You may train the model for tens of steps. (3 points)

In [7]:
from transformers import TrainingArguments, BitsAndBytesConfig

training_args = TrainingArguments(
    output_dir="./",
    num_train_epochs=1,
    per_device_train_batch_size=5,
    save_total_limit=0,
    report_to="none",
    auto_find_batch_size=True,
)

Take a look at the HuggingFace [documentation](https://huggingface.co/docs/trl/en/sft_trainer) on supervised fine-tuning trainers. Instantiate the trainer and train the model ( Note that you should initialize the phi-2 model with `bfloat16` or `float16` dtype to avoid encountering Cuda out of memory errors.). (3 points)

In [8]:
from transformers import AutoModelForCausalLM

model = (AutoModelForCausalLM.from_pretrained("microsoft/phi-2",torch_dtype=torch.float16,trust_remote_code=True).to("cuda" if torch.cuda.is_available() else "cpu"))



config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/35.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Get the final model from the trainer and merge the Lora weights with it. Then, test the model with the inputs you gave to the pretrained model and compare the results. (3 points)

In [13]:
from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset_samples,
    formatting_func=formatting_prompts_func,
    peft_config=peft_config,
    tokenizer= tokenizer,
    max_seq_length=None,
    args=training_args
)



Map:   0%|          | 0/2072 [00:00<?, ? examples/s]

In [14]:
trainer.train()

Step,Training Loss


Step,Training Loss
500,0.0
1000,0.0
1500,0.0
2000,0.0




TrainOutput(global_step=2072, training_loss=0.0, metrics={'train_runtime': 1097.4418, 'train_samples_per_second': 1.888, 'train_steps_per_second': 1.888, 'total_flos': 2.057511529930752e+16, 'train_loss': 0.0, 'epoch': 1.0})

In [15]:
trainer.model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): PhiForCausalLM(
      (model): PhiModel(
        (embed_tokens): Embedding(51200, 2560)
        (embed_dropout): Dropout(p=0.0, inplace=False)
        (layers): ModuleList(
          (0-31): 32 x PhiDecoderLayer(
            (self_attn): PhiSdpaAttention(
              (q_proj): lora.Linear(
                (base_layer): Linear(in_features=2560, out_features=2560, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2560, out_features=4, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=4, out_features=2560, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): Linear(in_features

In [27]:
import random

torch.set_default_device("cuda")

model = trainer.model.merge_and_unload()

test_samples = random.sample(list(test_dataset), 5)

for i, sample in enumerate(test_samples, start=1):
  instructions = sample["definition"]
  input = sample["inputs"]
  prompt = apply_alpaca_template(instructions, input)
  output = generate_output(model, prompt, max_length=1000)

  print(f"Sample {i}:\n")
  print("Generated Output:", output)
  print("#########################################################\n")


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample 1:

Generated Output: Question: ### Instruction:
In this task, you're given passages that contain mentions of names of people, places, or things. Some of these mentions refer to the same person, place, or thing. Your job is to write questions that evaluate one's understanding of such references. Good questions are expected to link pronouns (she, her, him, his, their, etc.) or other mentions to people, places, or things to which they may refer. Do not ask questions that can be answered correctly without understanding the paragraph or having multiple answers. Avoid questions that do not link phrases referring to the same entity. For each of your questions, the answer should be one or more phrases in the paragraph, and it should be unambiguous.

### Input:
Passage: There is considerable uncertainty about the identities of the Grand Dukes of Lithuania between Traidenis' death in 1282 and Vytenis' assumption of power in 1295. This is in part because the two main sources for Lithuania

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample 2:

Generated Output: Question: ### Instruction:
In this task, you're given passages that contain mentions of names of people, places, or things. Some of these mentions refer to the same person, place, or thing. Your job is to write questions that evaluate one's understanding of such references. Good questions are expected to link pronouns (she, her, him, his, their, etc.) or other mentions to people, places, or things to which they may refer. Do not ask questions that can be answered correctly without understanding the paragraph or having multiple answers. Avoid questions that do not link phrases referring to the same entity. For each of your questions, the answer should be one or more phrases in the paragraph, and it should be unambiguous.

### Input:
Passage: Pierre Benjamin Monteux (pronounced [pjɛʁ mɔ̃.tø]; 4 April 1875 – 1 July 1964) was a French (later American) conductor. After violin and viola studies, and a decade as an orchestral player and occasional conductor, he be

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample 3:

Generated Output: Question: ### Instruction:
In this task, you're given passages that contain mentions of names of people, places, or things. Some of these mentions refer to the same person, place, or thing. Your job is to write questions that evaluate one's understanding of such references. Good questions are expected to link pronouns (she, her, him, his, their, etc.) or other mentions to people, places, or things to which they may refer. Do not ask questions that can be answered correctly without understanding the paragraph or having multiple answers. Avoid questions that do not link phrases referring to the same entity. For each of your questions, the answer should be one or more phrases in the paragraph, and it should be unambiguous.

### Input:
Passage: Piper reported that as he was leaving Exchange Buildings to return to Houndsditch he saw a man acting suspiciously in the shadows of the cul-de-sac. As the policeman approached him, the man walked away; Piper later descr

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample 4:

Generated Output: Question: ### Instruction:
In this task, you're given passages that contain mentions of names of people, places, or things. Some of these mentions refer to the same person, place, or thing. Your job is to write questions that evaluate one's understanding of such references. Good questions are expected to link pronouns (she, her, him, his, their, etc.) or other mentions to people, places, or things to which they may refer. Do not ask questions that can be answered correctly without understanding the paragraph or having multiple answers. Avoid questions that do not link phrases referring to the same entity. For each of your questions, the answer should be one or more phrases in the paragraph, and it should be unambiguous.

### Input:
Passage: The Annunciation draws heavily on van der Weyden's 1430s Louvre Annunciation, his c. 1455  Saint Columba altarpiece, and the Clugny Annunciation (c. 1465–75), which is attributed to either van der Weyden or Memling. Meml

*Answer:*

Sorry but it didn't train enough :) . It would get many hours to train on this train dataset. So we get divided some samples for training in lower time.

We know that fine-tuning LLMs on Colab or Kaggle notebooks can be a bit tricky, and fine-tuning phi-2 for this task may require more GPU hours. The main point of this question is to teach you how to train your model using HuggingFace packages. So, it's okay if your model doesn't produce optimal results. However, there are 5 additional points available if it can generate better results :)

# RAG (50 points)

If you have any further questions or concerns, contact the TA via email: alisalemi@ut.ac.ir

## Install Requirements

In [1]:
%pip install -q langchain
%pip install -q ctransformers
%pip install -q sentence_transformers
%pip install -q datasets
%pip install -q rank_bm25
%pip install -q faiss-gpu
%pip install -q arxiv
%pip install -q pymupdf

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m41.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.9/302.9 kB[0m [31m29.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.0/121.0 kB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.3/49.3 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.0/53.0 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m142.5/142.5 kB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m27.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━

## 1. An Overview of LangChain (10 pt)

LangChain is an open-source framework designed to simplify the creation of applications using LLMs. It provides a standard interface for chains, lots of integrations with other tools, and end-to-end chains for common applications.

In this overview, we will provide a step-by-step guide on how to construct a basic application using LangChain. This application will fetch country-related information from a Large Language Model. For this purpose, we will be utilizing the LLaMa 2 chat 7B as our base model.

In [2]:
from langchain_community.llms import CTransformers

model = CTransformers(
  model="TheBloke/Llama-2-7B-Chat-GGUF",
  model_file="llama-2-7b-chat.Q8_0.gguf",
  model_type="llama",
  config={
    "gpu_layers": 50
  }
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

llama-2-7b-chat.Q8_0.gguf:   0%|          | 0.00/7.16G [00:00<?, ?B/s]

### 1.1 GGUF Format (3 pt)

Write a brief paragraph discussing the GGUF format and its benefits. Compare it with transformers library.

*Answer:*


The GGUF format in LLM is a novel approach designed to enhance the interoperability and efficiency of generative language models. Unlike the Transformers library, which primarily focuses on providing pre-trained models and tools for natural language processing tasks, GGUF serves as a standardized format for input and output across various language generation tasks. By standardizing the input format, GGUF simplifies the integration of different models and enables smoother transitions between tasks, facilitating seamless model chaining and ensemble methods. Additionally, GGUF promotes model interpretability and reproducibility by establishing clear guidelines for input data formatting and output generation, thereby fostering collaboration and innovation within the research community.

### 1.2 Simple Chain (2 pt)

Complete the next cell to create a simple chain that takes the name of a country as input and outputs its capital. To accomplish this, you should utilize the `HumanMessagePromptTemplate` and `AIMessagePromptTemplate` classes to formulate an effective prompt.

In [None]:
from langchain_core.prompts import ChatPromptTemplate, HumanMessagePromptTemplate, AIMessagePromptTemplate
from langchain_core.output_parsers import StrOutputParser

prompt = ChatPromptTemplate.from_messages([
  HumanMessagePromptTemplate.from_template("What is the capital of {country}?"),
  AIMessagePromptTemplate.from_template("")
])

output_parser = StrOutputParser()

simple_chain = prompt | model | output_parser

answer = simple_chain.invoke({"country": "Iran"})

print(answer)


 The capital of Iran is Tehran. Located in the north of the country, it has a population of over 8 million people and is home to many cultural and historical landmarks, including the Golestan Palace, the National Museum of Iran, and the Azadi Tower.


Write about the objectives behind the creation of `HumanMessagePromptTemplate` and `AIMessagePromptTemplate` classes. What they actually do? Write a brief description.

*Answer:*

- **HumanMessagePromptTemplate:** This class is designed to encapsulate the structure of messages or prompts that a human would typically provide as input to a language model. It allows for the creation of templates that mimic the natural language queries or instructions a human might give when interacting with the model. By capturing human-like prompts, this class helps ensure that the model's responses are coherent and relevant to the user's input. It enhances the usability and user experience of the language model by enabling more intuitive interactions.

- **AIMessagePromptTemplate:** In contrast, the AIMessagePromptTemplate class is tailored to represent the expected responses or messages generated by the language model itself. It allows for the specification of templates that outline the structure or format of the model's output. This class is essential for guiding the model's behavior and ensuring that its responses align with the desired format or style. By providing templates for expected model output, it helps maintain consistency and coherence in the generated text, enhancing the model's usability and reliability for specific applications.

What is the purpose of adding an empty `AIMessagePromptTemplate` at the end of prompt? What is the consequences of omitting it?

*Answer:*

What is the purpose of adding an empty `AIMessagePromptTemplate` at the end of prompt?

- Adding an empty AIMessagePromptTemplate at the end of the prompt serves as a placeholder for the model's response. Its purpose is to guide the model by indicating the expected format or structure of its output. By including this template, you provide a clear signal to the model regarding the context and style of the response it should generate. This helps improve the coherence and relevance of the model's output, aligning it more closely with the user's expectations.

What is the consequences of omitting it?
- **Uncertain Expectations:** The model may struggle to understand the desired format or structure of its output. This ambiguity can lead to inconsistent or irrelevant responses, reducing the usability and reliability of the model.

- **Unpredictable Output:** makes it challenging for users to anticipate the model's behavior. The model might generate responses that deviate significantly from what the user intended or expects, resulting in a less satisfactory user experience.

- **Difficulty in Interpretation:** The absence of a clear indication of the expected model output makes it harder for developers and users to interpret the generated responses. This can hinder efforts to analyze and evaluate the model's performance accurately.

### 1.3 JSON Chain (5 pt)

Now we want to improve the chain to extract data from the model response. Modify the existing prompt to request information about a country's name, population, and major cities in addition to the capital. Additionally, incorporate a `SystemMessagePromptTemplate` to ensure the model's response is structured in JSON format. Keep in mind that a distinct parser is required to parse the JSON output.

In [None]:
from langchain_core.prompts import ChatPromptTemplate, HumanMessagePromptTemplate, AIMessagePromptTemplate, SystemMessagePromptTemplate
from langchain_core.output_parsers import JsonOutputParser

resp_template = '{{"country": "{country}", "capital": "", "population": "", "cities": []}}'
prompt = ChatPromptTemplate.from_messages([
    HumanMessagePromptTemplate.from_template(f"You are a helpful assistant and you only can answers in JSON format. Please response in the format shown below, and do not generate any more tokens: \n{resp_template}"),
    AIMessagePromptTemplate.from_template("what is the capital, population and major cities of {country}?"),
    SystemMessagePromptTemplate.from_template("")
])

output_parser = JsonOutputParser()

json_chain = prompt | model | output_parser


answers = json_chain.batch([
  {"country": "Iran"},
  {"country": "USA"},
  {"country": "Japan"},
  {"country": "Nigeria"}
])


for ans in answers:
    print(f"{ans['country']}:")
    print(f"  capital: {ans['capital']}")
    print(f"  population: {ans['population']}")
    print(f"  important cities: {ans['cities']}")


Iran:
  capital: Tehran
  population: 83152469 (estimated in 2020)
  important cities: [{'name': 'Tehran', 'population': '153772000'}, {'name': 'Mashhad', 'population': '28321128'}, {'name': 'Isfahan', 'population': '24693745'}]
USA:
  capital: Washington D.C.
  population: 331057862
  important cities: ['New York City', 'Los Angeles', 'Chicago', 'Houston', 'Philadelphia']
Japan:
  capital: Tokyo
  population: 127450000
  important cities: [{'name': 'Tokyo', 'population': '38129000'}, {'name': 'Osaka', 'population': '21146000'}, {'name': 'Nagoya', 'population': '21120000'}]
Nigeria:
  capital: Abuja
  population: 258168689
  important cities: [{'name': 'Lagos', 'population': '20315349'}, {'name': 'Kano', 'population': '31558847'}, {'name': 'Ibadan', 'population': '32105519'}]


## 2. Different Types of Retrievers (15 pt)

In this section, We use mini-bioasq dataset to evalute different types of retrivers.

In [None]:
import json
from datasets import load_dataset

corpus = load_dataset("rag-datasets/mini-bioasq", "text-corpus", split="passages")
qa_dataset = load_dataset("rag-datasets/mini-bioasq", "question-answer-passages", split="test[:100]")

qa_dataset = qa_dataset.map(lambda data: {
  "relevant_passage_ids": json.loads(data["relevant_passage_ids"])
})

print(corpus)
print(qa_dataset)


Downloading data:   0%|          | 0.00/24.5M [00:00<?, ?B/s]

Generating passages split:   0%|          | 0/40221 [00:00<?, ? examples/s]

Downloading data:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4719 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Dataset({
    features: ['passage', 'id'],
    num_rows: 40221
})
Dataset({
    features: ['question', 'answer', 'relevant_passage_ids', 'id'],
    num_rows: 100
})


### 2.1 Evaluate Retriever (4 pt)

To effectively compare various retrieval systems, we must define a metric. Complete the `evaluate_retriever` function to measure the accuracy of the retrieved documents. Consider the `relevant_passage_ids` column as the expected documents to be retrieved.

In [None]:
def evaluate_retriever(retriever):
    correct = 0
    total = 0

    for data in qa_dataset:
        relevant_passage_ids = data["relevant_passage_ids"]

        question = data["question"]

        retrieved_passage_ids = [d.metadata['id'] for d in retriever.invoke(question)]

        correct += len(set(relevant_passage_ids) & set(retrieved_passage_ids))
        total += len(relevant_passage_ids)

        acc = correct / total

    return acc

### 2.2 TF-IDF Retriever (3 pt)

Create a TF-IDF retriever and configure it to returns the top 5 relevant documents.

In [None]:
from langchain_core.documents import Document
from langchain_community.retrievers import TFIDFRetriever

docs = []
for doc in corpus:
    docs.append(Document(page_content=doc["passage"], metadata={"id": doc["id"]}))

tfidf_retriever = TFIDFRetriever.from_documents(docs, k=5)

### 2.3 Semantic Retriever (5 pt)

Semantic retrievers operate by retrieving documents through embeddings. These systems require an embedding model to convert documents into a vector space, and a vector database to find the closest documents to a query. Construct a semantic retriever that utilizes [`intfloat/e5-base`](https://huggingface.co/intfloat/e5-base) as the embedding model and FAISS for the vector database.

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
import torch

embedding_model = HuggingFaceEmbeddings(model_name="intfloat/e5-base",model_kwargs={"device": 'cuda' if torch.cuda.is_available() else 'cpu'},encode_kwargs={'normalize_embeddings': True})
semantic_retriever = (FAISS.from_documents(documents=docs, embedding=embedding_model).as_retriever(search_kwargs={"k":5}))



### 2.4 Compare Retrivers (3 pt)

Calculate the score for each retriever using `evaluate_retriever` you previously writed. In this question, which one outperforms the other? Illustrate a scenario for each retriver that it outperforms the other.

*Answer:*

### TF-IDF Based Retrieval

TF-IDF is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents. In TF-IDF based retrieval, documents are represented as vectors where each dimension corresponds to a term, and the value of each dimension represents the TF-IDF score of that term in the document. When a query is entered, documents are ranked based on their similarity to the query using cosine similarity or other distance metrics.

The advantages of this approach:
- Simple and easy to implement.
- Works well for short and medium-sized documents.
- Generally faster computation compared to more complex models.

The disadvantages of this approach:
- Ignores the semantic meaning of words and focuses solely on term frequency and document frequency
- May struggle with synonymy and polysemy, where different words with similar or multiple meanings are not properly captured.
- Less effective for longer documents or those with complex structures.


### Semantic Based Retrieval

Semantic-based retrieval involves understanding the meaning of words, phrases, and documents. This can be achieved through various techniques such as word embeddings, semantic similarity measures, or neural network-based models like BERT (Bidirectional Encoder Representations from Transformers). In this approach, documents and queries are embedded into a continuous vector space where semantic similarity is computed.

The advantages of this approach:
- Captures semantic meaning and context of words and documents.
- Can handle synonymy and polysemy more effectively.
- Better suited for longer documents and those with complex structures.

The disadvantages of this approach:
- Often requires large amounts of data and computational resources for training complex models.
- May struggle with rare or domain-specific terms if not properly trained on relevant data.
- More complex implementation and tuning compared to TF-IDF.


### TF-IDF Outperforms Semantic Retrieval

- When dealing with short documents or queries where term frequency plays a crucial role in determining relevance.
- In scenarios where computational resources are limited, and a simpler model is preferred.
- When the dataset lacks sufficient training data for semantic models to learn meaningful representations.

### Semantic Retrieval Outperforms TF-IDF

- When dealing with long documents or queries where understanding semantic meaning is essential for relevance.
- In cases where synonymy and polysemy are prevalent, and capturing word meaning is critical.
- When the dataset is large and diverse enough to train sophisticated semantic models effectively.

In [None]:
tfidf_acc = evaluate_retriever(tfidf_retriever)
semantic_acc = evaluate_retriever(semantic_retriever)

print(f"TF-IDF accuracy: {tfidf_acc:.2f}")
print(f"semantic accuracy: {semantic_acc:.2f}")


TF-IDF accuracy: 0.20
semantic accuracy: 0.24


## 3. RAG (25 pt)

In this section, you should use all the concepts you've learned until now to create a complete RAG chain.

In [1]:
from langchain_community.llms import CTransformers

model = CTransformers(model="TheBloke/Llama-2-7B-Chat-GGUF",model_file="llama-2-7b-chat.Q8_0.gguf",model_type="llama", config={"gpu_layers": 50,'context_length' : 2048})

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

### 3.1 Load Documents (2 pt)

Load [RAFT](https://arxiv.org/abs/2403.10131) and [DSPy](https://arxiv.org/abs/2401.12178) papers. You can use `ArxivLoader` to get documents from arXiv.


In [6]:
from langchain.document_loaders import ArxivLoader
import pandas as pd

loader1 = ArxivLoader("2403.10131").load()
loader2 = ArxivLoader("2401.12178").load()

docs =loader1 + loader2
docs

[Document(page_content='RAFT: Adapting Language Model to Domain Specific RAG\nTianjun Zhang Shishir G. Patil Naman Jain Sheng Shen\nMatei Zaharia Ion Stoica Joseph E. Gonzalez\ntianjunz@berkeley.edu, shishirpatil@berkeley.edu\nUC Berkeley\nAbstract\nPretraining Large Language Models (LLMs) on\nlarge corpora of textual data is now a standard\nparadigm. When using these LLMs for many\ndownstream applications, it is common to ad-\nditionally bake in new knowledge (e.g., time-\ncritical news, or private domain knowledge) into\nthe pretrained model either through RAG-based-\nprompting, or finetuning. However, the optimal\nmethodology for the model to gain such new\nknowledge remains an open question. In this pa-\nper, we present Retrieval Augmented Fine Tun-\ning (RAFT), a training recipe that improves the\nmodel’s ability to answer questions in an "open-\nbook" in-domain setting. In RAFT, given a ques-\ntion, and a set of retrieved documents, we train\nthe model to ignore those documents t

### 3.2 Split Documents into Chunks (4 pt)

Usually, each document is constructed from multiple sections, each with a separate topic. It is better to split each document into smaller parts named chunks and search among them instead of actual documents. Write a splitter to create chunks from loaded documents.

In [11]:
from langchain_community.document_loaders import DataFrameLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=30)

chunks = text_splitter.split_documents(docs)

### 3.3 Retriever (3 pt)

Create a retriever of your choice.

In [12]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
import torch

embedding_model = HuggingFaceEmbeddings(model_name="intfloat/e5-base",model_kwargs={"device": 'cuda' if torch.cuda.is_available() else 'cpu'},encode_kwargs={'normalize_embeddings': True},)
retriever = (FAISS.from_documents(documents=chunks, embedding=embedding_model).as_retriever(search_kwargs={"k":3}))

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/67.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/356 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

### 3.4 Design Prompt (2 pt)

Design a suitable prompt for RAG.

In [18]:
from langchain_core.prompts import SystemMessagePromptTemplate, HumanMessagePromptTemplate, AIMessagePromptTemplate, ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser


prompt = ChatPromptTemplate.from_messages([
    SystemMessagePromptTemplate.from_template("You are a helpful assistants and you should answer questions based on contexts. You will be given a context and a question and you have to answer based on the provided context. You just have the knowledge of the provided context and if the necessary information to answer the question is not provided, you should say \"I don't know\"."),
    HumanMessagePromptTemplate.from_template("Provided context to answer the question: \n{context}\n\nQuestion to be answered: \n{question}"),
    AIMessagePromptTemplate.from_template(""),
])

### 3.5 RAG Chain (3 pt)

Design a question from the documents and get the retriever and RAG output for that question.

In [14]:
from langchain_core.runnables import RunnablePassthrough

rag_chain = (
  {"context": retriever, "question": RunnablePassthrough()}
  | prompt
  | model
  | StrOutputParser()
)

question = "what is Retrieval Augmented Fine Tuning model?"
retrieved_doc = retriever.invoke(question)
answer = rag_chain.invoke(question)

print(f"retrieved document:\n{retrieved_doc}\n")
print(f"answer:\n{answer}")

retrieved document:
[Document(page_content='In this paper, we study how to combine supervised\nfine-tuning (SFT) with retrieval augmented generation\n(RAG). We propose a novel adaptation strategy – Retrieval-\nAugmented Fine Tuning (RAFT). RAFT specifically ad-\ndresses the challenge of fine-tuning LLMs to incorporate', metadata={'Published': '2024-03-15', 'Title': 'RAFT: Adapting Language Model to Domain Specific RAG', 'Authors': 'Tianjun Zhang, Shishir G. Patil, Naman Jain, Sheng Shen, Matei Zaharia, Ion Stoica, Joseph E. Gonzalez', 'Summary': 'Pretraining Large Language Models (LLMs) on large corpora of textual data is\nnow a standard paradigm. When using these LLMs for many downstream\napplications, it is common to additionally bake in new knowledge (e.g.,\ntime-critical news, or private domain knowledge) into the pretrained model\neither through RAG-based-prompting, or fine-tuning. However, the optimal\nmethodology for the model to gain such new knowledge remains an open question.

### 3.6 Out of Domain Question (4 pt)

Ask a question that is not related to documents. Does model answer it? Change your prompt to force model say "I don't know" when some one asks out of domains questions.

*Answer:*

My question is `Which animal has eight legs?` and the model couldn't answer my made up question.

At first it didn't say `I don't know` about a question that the model can't answer. But after modifing the system pormpt to force the model to say `I don't know` when it doesn't have any knowledge about that filed.

In [20]:
rag_chain = (
  {"context": retriever, "question": RunnablePassthrough()}
  | prompt
  | model
  | StrOutputParser()
)

question = "Which animal has eight legs?"
retrieved_doc = retriever.invoke(question)
answer = rag_chain.invoke(question)

print(f"retrieved document:\n{retrieved_doc}\n")
print(f"answer:\n{answer}")

retrieved document:
[Document(page_content='Overall, the LLaMA-7B model, both with and without the\nRAG, performs poorly due to its answering style not align-\ning with the ground truth. By applying domain specific\n4\nRAFT: Adapting Language Model to Domain Specific RAG', metadata={'Published': '2024-03-15', 'Title': 'RAFT: Adapting Language Model to Domain Specific RAG', 'Authors': 'Tianjun Zhang, Shishir G. Patil, Naman Jain, Sheng Shen, Matei Zaharia, Ion Stoica, Joseph E. Gonzalez', 'Summary': 'Pretraining Large Language Models (LLMs) on large corpora of textual data is\nnow a standard paradigm. When using these LLMs for many downstream\napplications, it is common to additionally bake in new knowledge (e.g.,\ntime-critical news, or private domain knowledge) into the pretrained model\neither through RAG-based-prompting, or fine-tuning. However, the optimal\nmethodology for the model to gain such new knowledge remains an open question.\nIn this paper, we present Retrieval Augmented 

### 3.7 The Effect of Temperature (7 pt)

RAG performance is highly dependent on model temperature. Explain that low temperature is better or high temperature? For the same prompt, compare the output of the model with low and high temperature.

*Answer:*

The performance of RAG (Retriever-Reader model for open-domain Question Answering) is significantly influenced by the temperature setting. When operating at a low temperature, the model tends to prioritize responses that are closely aligned with its training data and internal probabilities. This often results in concise, factually accurate answers that directly address the given question. Low temperature settings generally lead to more predictable outputs, which can be advantageous when seeking reliable and precise responses. Additionally, these responses are typically well-grounded in the information provided by the model's training data, enhancing their credibility.

On the other hand, employing a high temperature setting encourages the model to explore a broader range of possibilities, potentially generating more diverse and creative responses. While this can lead to more interesting and novel answers, it may also result in less coherent and relevant outputs. High temperature settings can introduce more variability into the responses, which might be beneficial in certain contexts where creativity and exploration are valued over precision.

Ultimately, the choice between low and high temperature settings depends on the specific requirements of the task at hand. Low temperatures are generally preferred when accuracy and reliability are paramount, while high temperatures may be more suitable for tasks where creativity and exploration are desired, even at the expense of precision.

## Using Other LLMs in Project



In this project I used the help of Chatgpt 3.5 for questions that has been asked in the notebook.

Also I used the help of Chatgpt 3.5 for knowing the functionality of each code boxes and also for generating codes that help me compelte demanded tasks.

My Usage of Chatgpt 3.5 was for two purposes:

1- For generating answers similar to what questions want

2- For checking my answers or fix my gramatical errors or unkown syntax errors.

You can consider that I used the help of Chatgpt 3.5 in most parts of this project. So I don't bring all my prompts in this notebook because it gets too messy.