**LLM Workshop 2024 by Sebastian Raschka**

This code is based on *Build a Large Language Model (From Scratch)*, [https://github.com/rasbt/LLMs-from-scratch](https://github.com/rasbt/LLMs-from-scratch)

- Instruction finetuning from scratch: [ch07.ipynb](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/01_main-chapter-code/ch07.ipynb)

# Setup

In [1]:
# Requirements from: https://github.com/rasbt/LLM-workshop-2024/blob/main/requirements.txt
requirements = """
# torch >= 2.0.1
# tiktoken >= 0.5.1
# matplotlib >= 3.7.1
# numpy >= 1.24.3
# tensorflow >= 2.15.0
# tqdm >= 4.66.1
# numpy >= 1.25, < 2.0
# pandas >= 2.2.1
# psutil >= 5.9.5
litgpt[all] >= 0.4.1
"""

with open("requirements.txt", mode="wt") as f:
    f.write(requirements)

%pip install -r requirements.txt --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m160.7/160.7 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m205.3/205.3 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m32.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.9/101.9 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.4/5.4 MB[0m [31m41.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 MB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━

Add dataset and image files from Sebastian Raschka's training repository

In [1]:
from pathlib import Path

import requests

session = requests.Session()
with open("instruction-data.json", "wt", encoding="utf-8") as f:
    response = session.get("https://raw.githubusercontent.com/rasbt/LLM-workshop-2024/main/06_finetuning/instruction-data.json")
    f.write(response.text)

for img_num in range(1, 16):
    filepath = Path(f"figures/{img_num:02d}.png")
    if not filepath.parent.exists():
        filepath.parent.mkdir(parents=True, exist_ok=True)
    with open(filepath, mode="wb") as img_file:
        img_file.write(session.get(f"https://raw.githubusercontent.com/rasbt/LLM-workshop-2024/main/06_finetuning/figures/{img_num:02d}.png").content)


<br>
<br>
<br>
<br>

# 6) Instruction finetuning (part 1; intro)

<img src="https://github.com/rasbt/LLM-workshop-2024/blob/main/06_finetuning/figures/01.png?raw=1" width=1000px>

---

# 6.1 Introduction to instruction finetuning

- We saw that pretraining an LLM involves a training procedure where it learns to generate one word at a time
- Hence, a pretrained LLM is good at text completion, but it is not good at following instructions
- In this last part of the workshop, we teach the LLM to follow instructions better

<img src="https://github.com/rasbt/LLM-workshop-2024/blob/main/06_finetuning/figures/02.png?raw=1" width=800px>

<br>
<br>
<br>
<br>

# 6.2 Preparing a dataset for supervised instruction finetuning

- We will work with a simple instruction dataset I prepared for this

In [2]:
import json


file_path = "instruction-data.json"

with open(file_path, "r") as file:
    data = json.load(file)
print("Number of entries:", len(data))

Number of entries: 1100


- Each item in the `data` list we loaded from the JSON file above is a dictionary in the following form

In [3]:
print("Example entry:\n", data[50])

Example entry:
 {'instruction': 'Identify the correct spelling of the following word.', 'input': 'Ocassion', 'output': "The correct spelling is 'Occasion.'"}


- Note that the `'input'` field can be empty:

In [4]:
print("Another example entry:\n", data[999])

Another example entry:
 {'instruction': "What is an antonym of 'complicated'?", 'input': '', 'output': "An antonym of 'complicated' is 'simple'."}


- Instruction finetuning is often referred to as "supervised instruction finetuning" because it involves training a model on a dataset where the input-output pairs are explicitly provided
- There are different ways to format the entries as inputs to the LLM; the figure below illustrates two example formats that were used for training the Alpaca (https://crfm.stanford.edu/2023/03/13/alpaca.html) and Phi-3 (https://arxiv.org/abs/2404.14219) LLMs, respectively

<img src="https://github.com/rasbt/LLM-workshop-2024/blob/main/06_finetuning/figures/03.png?raw=1" width=900px>

- Suppose we use Alpaca-style prompt formatting, which was the original prompt template for instruction finetuning
- Shown below is how we format the input that we would pass as input to the LLM

In [5]:
def format_input(entry):
    instruction_text = (
        f"Below is an instruction that describes a task. "
        f"Write a response that appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )

    input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""

    return instruction_text + input_text

- A formatted response with input field looks like as shown below

In [6]:
model_input = format_input(data[50])
desired_response = f"\n\n### Response:\n{data[50]['output']}"

print(model_input + desired_response)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Identify the correct spelling of the following word.

### Input:
Ocassion

### Response:
The correct spelling is 'Occasion.'


- Below is a formatted response without an input field

In [7]:
model_input = format_input(data[999])
desired_response = f"\n\n### Response:\n{data[999]['output']}"

print(model_input + desired_response)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What is an antonym of 'complicated'?

### Response:
An antonym of 'complicated' is 'simple'.


- Tokenized, this looks like as follows

<img src="https://github.com/rasbt/LLM-workshop-2024/blob/main/06_finetuning/figures/04.png?raw=1" width=1000px>

- To make it work with batches, we add "padding" tokens

- Tokenized, this looks like as follows

<img src="https://github.com/rasbt/LLM-workshop-2024/blob/main/06_finetuning/figures/05.png?raw=1" width=1000px>

- Above, only the inputs are shown for simplicity; however, similar to pretraining, the target tokens are shifted by 1 position:

- Tokenized, this looks like as follows

<img src="https://github.com/rasbt/LLM-workshop-2024/blob/main/06_finetuning/figures/06.png?raw=1" width=700px>

- In addition, it is also common to mask the target text
- By default, PyTorch has the `cross_entropy(..., ignore_index=-100)` setting to ignore examples corresponding to the label -100
- Using this -100 `ignore_index`, we can ignore the additional end-of-text (padding) tokens in the batches that we used to pad the training examples to equal length
- However, we don't want to ignore the first instance of the end-of-text (padding) token (50256) because it can help signal to the LLM when the response is complete

- Tokenized, this looks like as follows

<img src="https://github.com/rasbt/LLM-workshop-2024/blob/main/06_finetuning/figures/07.png?raw=1" width=1000px>

**LLM Workshop 2024 by Sebastian Raschka**

---

# 6) Instruction finetuning (part 2; finetuning)

- In this notebook, we get to the actual finetuning part
- But first, let's briefly introduce a technique, called LoRA, that makes the finetuning more efficient
- It's not required to use LoRA, but it can result in noticeable memory savings while still resulting in good modeling performance


# 6.1 Introduction to LoRA

- Low-rank adaptation (LoRA) is a machine learning technique that modifies a pretrained model to better suit a specific, often smaller, dataset by adjusting only a small, low-rank subset of the model's parameters
- This approach is important because it allows for efficient finetuning of large models on task-specific data, significantly reducing the computational cost and time required for finetuning

- Suppose we have a large weight matrix $W$ for a given layer
- During backpropagation, we learn a $\Delta W$ matrix, which contains information on how much we want to update the original weights to minimize the loss function during training
- In regular training and finetuning, the weight update is defined as follows:

$$W_{\text{updated}} = W + \Delta W$$

- The LoRA method proposed by [Hu et al.](https://arxiv.org/abs/2106.09685) offers a more efficient alternative to computing the weight updates $\Delta W$ by learning an approximation of it, $\Delta W \approx AB$.
- In other words, in LoRA, we have the following, where $A$ and $B$ are two small weight matrices:

$$W_{\text{updated}} = W + AB$$

- The figure below illustrates these formulas for full finetuning and LoRA side by side

<img src="https://raw.githubusercontent.com/rasbt/LLM-workshop-2024/main/06_finetuning/figures/08.png" width="1100px">

- If you paid close attention, the full finetuning and LoRA depictions in the figure above look slightly different from the formulas I have shown earlier
- That's due to the distributive law of matrix multiplication: we don't have to add the weights with the updated weights but can keep them separate
- For instance, if $x$ is the input data, then we can write the following for regular finetuning:

$$x (W+\Delta W) = x W + x \Delta W$$

- Similarly, we can write the following for LoRA:

$$x (W+A B) = x W + x A B$$

- The fact that we can keep the LoRA weight matrices separate makes LoRA especially attractive
- In practice, this means that we don't have to modify the weights of the pretrained model at all, as we can apply the LoRA matrices on the fly
- After setting up the dataset and loading the model, we will implement LoRA in the code to make these concepts less abstract

<img src="https://raw.githubusercontent.com/rasbt/LLM-workshop-2024/main/06_finetuning/figures/09.png" width="800px">



# 6.2 Creating training and test sets

- There's one more thing before we can start finetuning: creating the training and test subsets
- We will use 85% of the data for training and the remaining 15% for testing

In [2]:
import json


file_path = "instruction-data.json"

with open(file_path, "r") as file:
    data = json.load(file)
print("Number of entries:", len(data))

Number of entries: 1100


In [3]:
train_portion = int(len(data) * 0.85)  # 85% for training
test_portion = int(len(data) * 0.15)    # 15% for testing

train_data = data[:train_portion]
test_data = data[train_portion:]

In [4]:
print("Training set length:", len(train_data))
print("Test set length:", len(test_data))

Training set length: 935
Test set length: 165


In [5]:
with open("train.json", "w") as json_file:
    json.dump(train_data, json_file, indent=4)

with open("test.json", "w") as json_file:
    json.dump(test_data, json_file, indent=4)


# 6.3 Instruction finetuning

- Using LitGPT, we can finetune the model via `litgpt finetune model_dir`
- However, here, we will use LoRA finetuning `litgpt finetune_lora model_dir` since it will be quicker and less resource intensive

In [6]:
!litgpt download list

repo_id: list
Please specify --repo_id <repo_id>. Available values:
codellama/CodeLlama-13b-hf
codellama/CodeLlama-13b-Instruct-hf
codellama/CodeLlama-13b-Python-hf
codellama/CodeLlama-34b-hf
codellama/CodeLlama-34b-Instruct-hf
codellama/CodeLlama-34b-Python-hf
codellama/CodeLlama-70b-hf
codellama/CodeLlama-70b-Instruct-hf
codellama/CodeLlama-70b-Python-hf
codellama/CodeLlama-7b-hf
codellama/CodeLlama-7b-Instruct-hf
codellama/CodeLlama-7b-Python-hf
databricks/dolly-v2-12b
databricks/dolly-v2-3b
databricks/dolly-v2-7b
EleutherAI/pythia-1.4b
EleutherAI/pythia-1.4b-deduped
EleutherAI/pythia-12b
EleutherAI/pythia-12b-deduped
EleutherAI/pythia-14m
EleutherAI/pythia-160m
EleutherAI/pythia-160m-deduped
EleutherAI/pythia-1b
EleutherAI/pythia-1b-deduped
EleutherAI/pythia-2.8b
EleutherAI/pythia-2.8b-deduped
EleutherAI/pythia-31m
EleutherAI/pythia-410m
EleutherAI/pythia-410m-deduped
EleutherAI/pythia-6.9b
EleutherAI/pythia-6.9b-deduped
EleutherAI/pythia-70m
EleutherAI/pythia-70m-deduped
garage-bA

In [24]:
!litgpt download TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T

repo_id: TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T
Setting HF_HUB_ENABLE_HF_TRANSFER=1
For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/download#download-files-to-local-folder.
config.json: 100%|█████████████████████████████| 560/560 [00:00<00:00, 7.65MB/s]
generation_config.json: 100%|██████████████████| 129/129 [00:00<00:00, 1.17MB/s]
pytorch_model.bin: 100%|████████████████████| 4.40G/4.40G [00:24<00:00, 181MB/s]
tokenizer.json: 100%|██████████████████████| 1.84M/1.84M [00:00<00:00, 62.1MB/s]
tokenizer.model: 100%|████████████████████████| 500k/500k [00:00<00:00, 261MB/s]
tokenizer_config.json: 100%|███████████████████| 776/776 [00:00<00:00, 12.1MB/s]
Converting checkpoint files to LitGPT format.
{'checkpoint_dir': PosixPath('checkpoints/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T'),
 'debug_mode': False,
 'dtype': None,
 'model_name': None}
Loading weights: pytorch_model.bin: 100%|███████████████| 00:06<00:00, 14.56it/s
Saving

In [37]:
!litgpt finetune_lora TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T \
--data JSON \
--data.val_split_fraction 0.1 \
--data.json_path train.json \
--train.epochs 3 \
--train.log_interval 100

{'checkpoint_dir': PosixPath('checkpoints/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T'),
 'data': JSON(json_path=PosixPath('train.json'),
              mask_prompt=False,
              val_split_fraction=0.1,
              prompt_style=<litgpt.prompts.Alpaca object at 0x7fba8ee8a6b0>,
              ignore_index=-100,
              seed=42,
              num_workers=4),
 'devices': 1,
 'eval': EvalArgs(interval=100,
                  max_new_tokens=100,
                  max_iters=100,
                  initial_validation=False,
                  final_validation=True),
 'logger_name': 'csv',
 'lora_alpha': 16,
 'lora_dropout': 0.05,
 'lora_head': False,
 'lora_key': False,
 'lora_mlp': False,
 'lora_projection': False,
 'lora_query': True,
 'lora_r': 8,
 'lora_value': True,
 'optimizer': 'AdamW',
 'out_dir': PosixPath('out/finetune/lora'),
 'precision': None,
 'quantize': None,
 'seed': 1337,
 'train': TrainArgs(save_interval=1000,
                    log_interval=100,
        


# Exercise 1: Generate and save the test set model responses of the base model

- In this excercise, we are collecting the model responses on the test dataset so that we can evaluate them later


- Starting with the original model before finetuning, load the model using the LitGPT Python API (`LLM.load` ...)
- Then use the `LLM.generate` function to generate the responses for the test data
- The following utility function will help you to format the test set entries as input text for the LLM

In [38]:
def format_input(entry):
    instruction_text = (
        f"Below is an instruction that describes a task. "
        f"Write a response that appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )

    input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""

    return instruction_text + input_text

print(format_input(test_data[0]))

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Rewrite the sentence using a simile.

### Input:
The car is very fast.


In [42]:
test_data[1]

{'instruction': 'What type of cloud is typically associated with thunderstorms?',
 'input': '',
 'output': 'The type of cloud typically associated with thunderstorms is cumulonimbus.'}

In [14]:
print(format_input(test_data[1]))

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What type of cloud is typically associated with thunderstorms?


- Using this utility function, generate and save all the test set responses generated by the model and add them to the `test_set`
- For example, if `test_data[0]` entry is as follows before:
    
```
{'instruction': 'Rewrite the sentence using a simile.',
 'input': 'The car is very fast.',
 'output': 'The car is as fast as lightning.'}
```

- Modify the `test_data` entry so that it contains the model response:
    
```
{'instruction': 'Rewrite the sentence using a simile.',
 'input': 'The car is very fast.',
 'output': 'The car is as fast as lightning.',
 'base_model': 'The car is as fast as a cheetah sprinting across the savannah.'
}
```

- Do this for all test set entries, and then save the modified `test_data` dictionary as `test_base_model.json`


In [43]:
from litgpt import LLM

llm = LLM.load("TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T")

In [44]:
from tqdm import tqdm

for i in tqdm(range(len(test_data))):
    response = llm.generate(format_input(test_data[i]))
    test_data[i]["base_model"] = response

with open("test_base_model.json", mode="wt", encoding="utf-8") as f:
    json.dump(test_data, f)

100%|██████████| 165/165 [01:57<00:00,  1.41it/s]


In [45]:
test_data[1]

{'instruction': 'What type of cloud is typically associated with thunderstorms?',
 'input': '',
 'output': 'The type of cloud typically associated with thunderstorms is cumulonimbus.',
 'base_model': '\n\n### Instruction:\nWhich provides power and efficiency for a huge amount of devices? \n\n### Instruction:\nWhat is the most common way people use solar power?\n\n### Answer:\nThe website'}

<br>
<br>
<br>
<br>

# Exercise 2: Generate and save the test set model responses of the finetuned model

In [46]:
!litgpt generate TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T \
  --max_new_tokens 256

{'checkpoint_dir': PosixPath('checkpoints/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T'),
 'compile': False,
 'max_new_tokens': 256,
 'num_samples': 1,
 'precision': None,
 'prompt': 'What food do llamas eat?',
 'quantize': None,
 'temperature': 0.8,
 'top_k': 50,
 'top_p': 1.0}
Loading model 'checkpoints/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T/lit_model.pth' with {'name': 'tiny-llama-1.1b', 'hf_config': {'name': 'TinyLlama-1.1B-intermediate-step-1431k-3T', 'org': 'TinyLlama'}, 'scale_embeddings': False, 'block_size': 2048, 'vocab_size': 32000, 'padding_multiple': 64, 'padded_vocab_size': 32000, 'n_layer': 22, 'n_head': 32, 'head_size': 64, 'n_embd': 2048, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'lm_head_bias': False, 'n_query_groups': 4, 'shared_attention_norm': False, 'norm_class_name': 'RMSNorm', 'norm_eps': 1e-05, 'mlp_class_name': 'LLaMAMLP', 'gelu_approximate': 'none', 'intermediate_size': 5632, 'rope_condense_ratio': 1, 'rope_base'

- Repeat the steps from the previous exercise but this time collect the responses of the finetuned model
- Save the resulting `test_data` dictionary as `test_base_and_finetuned_model.json`

In [47]:
from litgpt import LLM

# Save memory by removing the previous model
# del llm

llm_lora = LLM.load("/teamspace/studios/this_studio/out/finetune/lora/final")

In [48]:
from tqdm import tqdm

for i in tqdm(range(len(test_data))):
    response = llm_lora.generate(format_input(test_data[i]))
    test_data[i]["finetuned_model"] = response

with open("test_base_and_finetuned_model.json", mode="wt", encoding="utf-8") as f:
    json.dump(test_data, f)

100%|██████████| 165/165 [00:39<00:00,  4.13it/s]


In [49]:
test_data[1]

{'instruction': 'What type of cloud is typically associated with thunderstorms?',
 'input': '',
 'output': 'The type of cloud typically associated with thunderstorms is cumulonimbus.',
 'base_model': '\n\n### Instruction:\nWhich provides power and efficiency for a huge amount of devices? \n\n### Instruction:\nWhat is the most common way people use solar power?\n\n### Answer:\nThe website',
 'finetuned_model': 'Clouds that form when rain and thunderstorms combine are called wet-lift clouds.'}

**LLM Workshop 2024 by Sebastian Raschka**

---

# 6) Instruction finetuning (part 3; benchmark evaluation)

- In the previous notebook, we finetuned the LLM; in this notebook, we evaluate it using popular benchmark methods

- There are 3 main types of model evaluation

  1. MMLU-style Q&A
  2. LLM-based automatic scoring
  3. Human ratings by relative preference
  
  


<img src="https://raw.githubusercontent.com/rasbt/LLM-workshop-2024/main/06_finetuning/figures/10.png" width=800px>

<img src="https://raw.githubusercontent.com/rasbt/LLM-workshop-2024/main/06_finetuning/figures/11.png" width=800px>


<br>
<br>
<br>


<img src="https://raw.githubusercontent.com/rasbt/LLM-workshop-2024/main/06_finetuning/figures/13.png" width=800px>



<img src="https://raw.githubusercontent.com/rasbt/LLM-workshop-2024/main/06_finetuning/figures/14.png" width=800px>



## https://tatsu-lab.github.io/alpaca_eval/

<img src="https://raw.githubusercontent.com/rasbt/LLM-workshop-2024/main/06_finetuning/figures/15.png" width=800px>

## https://chat.lmsys.org

# 6.2 Evaluation

- In this notebook, we do an MMLU-style evaluation in LitGPT, which is based on the [EleutherAI LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness)
- There are hundreds if not thousands of benchmarks; using the command below, we filter for MMLU subsets, because running the evaluation on the whole MMLU dataset would take a very long time

- Let's say we are intrested in the `mmlu_philosophy` subset, we can evaluate the LLM on MMLU as follows

# Exercise 3: Evaluate the finetuned LLM

In [53]:
!litgpt evaluate out/finetune/lora/final \
    --batch_size 4 \
    --tasks "mmlu_philosophy" \
    --out_dir "eval_finetuned"

{'batch_size': 4,
 'checkpoint_dir': PosixPath('out/finetune/lora/final'),
 'device': None,
 'dtype': None,
 'force_conversion': False,
 'limit': None,
 'num_fewshot': None,
 'out_dir': PosixPath('eval_finetuned'),
 'save_filepath': None,
 'seed': 1234,
 'tasks': 'mmlu_philosophy'}
2024-09-14 05:15:50.484328: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:479] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-14 05:15:50.500770: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:10575] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-14 05:15:50.500808: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1442] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-09-14 05:15:50.511345: I tensorflow/core/platform/cpu_feature_guard.cc:210] 

**LLM Workshop 2024 by Sebastian Raschka**


---

# 6) Instruction finetuning (part 4; evaluating instruction responses locally using a Llama 3 model)

- This notebook uses an 8 billion parameter Llama 3 model through LitGPT to evaluate responses of instruction finetuned LLMs based on a dataset in JSON format that includes the generated model responses, for example:



```python
{
    "instruction": "What is the atomic number of helium?",
    "input": "",
    "output": "The atomic number of helium is 2.",               # <-- The target given in the test set
    "base_model": "\nThe atomic number of helium is 3.0", # <-- Response by an LLM
    "finetuned_model": "\nThe atomic number of helium is 2."    # <-- Response by a 2nd LLM
},
```

- The code doesn't require a GPU and runs on a laptop (it was tested on a M3 MacBook Air)

In [54]:
from importlib.metadata import version

pkgs = [
    "tqdm",    # Progress bar
]

for p in pkgs:
    print(f"{p} version: {version(p)}")

tqdm version: 4.66.4


## 6.1 Load JSON Entries

- Now, let's get to the data evaluation part
- Here, we assume that we saved the test dataset and the model responses as a JSON file that we can load as follows:

In [1]:
import json

json_file = "test_base_and_finetuned_model.json"

with open(json_file, "r") as file:
    json_data = json.load(file)

print("Number of entries:", len(json_data))

Number of entries: 165


- The structure of this file is as follows, where we have the given response in the test dataset (`'output'`) and responses by two different models (`'base_model'` and `'finetuned_model'`):

In [2]:
json_data[1]

{'instruction': 'What type of cloud is typically associated with thunderstorms?',
 'input': '',
 'output': 'The type of cloud typically associated with thunderstorms is cumulonimbus.',
 'base_model': '\n\n### Instruction:\nWhich provides power and efficiency for a huge amount of devices? \n\n### Instruction:\nWhat is the most common way people use solar power?\n\n### Answer:\nThe website',
 'finetuned_model': 'Clouds that form when rain and thunderstorms combine are called wet-lift clouds.'}

- Below is a small utility function that formats the input for visualization purposes later:

In [3]:
def format_input(entry):
    instruction_text = (
        f"Below is an instruction that describes a task. Write a response that "
        f"appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )

    input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""
    instruction_text + input_text

    return instruction_text + input_text

print(format_input(json_data[0])) # input

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Rewrite the sentence using a simile.

### Input:
The car is very fast.


In [4]:
json_data[0]["output"]

'The car is as fast as lightning.'

In [5]:
json_data[0]["base_model"]

'  He is more aloof than Bronco.\n\n### Output:\n> The car is very fast and is more aloof than Bronco.\n\n### Programming Explanation:\n\n### [Le'

In [6]:
json_data[1]["finetuned_model"]

'Clouds that form when rain and thunderstorms combine are called wet-lift clouds.'

- Now, let's try LitGPT to compare the model responses (we only evaluate the first 5 responses for a visual comparison):

In [7]:
from litgpt import LLM

llm = LLM.load("meta-llama/Meta-Llama-3-8B-Instruct")

In [10]:
from tqdm import tqdm


def generate_model_scores(json_data, json_key):
    scores = []
    for entry in tqdm(json_data, desc=f"Scoring entries ({json_key})"):
        prompt = (
            f"Given the input `{format_input(entry)}` "
            f"and correct output `{entry['output']}`, "
            f"score the model response `{entry[json_key]}`"
            f" on a scale from 0 to 100, where 100 is the best score. "
            f"Respond with the integer number only."
        )
        score = llm.generate(prompt, max_new_tokens=50)
        try:
            scores.append(int(score))
        except ValueError:
            continue

    return scores

# Exercise: Evaluate the LLMs

- Now using the `generate_model_scores` function above, evaluate the finetuned (`base_model`) and non-finetuned model (`finetuned_model`)
- Apply this evaluation to the whole dataset and compute the average score of each model

In [11]:
model_scores = {
    "base_model": [],
    "finetuned_model": []
}

for model in ["base_model", "finetuned_model"]:
    model_scores[model] = generate_model_scores(json_data, model)
    print(
        f"Model: {model} | "
        f"score count: {len(model_scores[model])}/{len(json_data)} | "
        f"average score: {sum(model_scores[model]) / len(model_scores[model])}"
    )

Scoring entries (base_model): 100%|██████████| 165/165 [02:23<00:00,  1.15it/s]


Model: base_model | score count: 124/165 | average score: 75.35483870967742


Scoring entries (finetuned_model): 100%|██████████| 165/165 [00:49<00:00,  3.32it/s]

Model: finetuned_model | score count: 157/165 | average score: 76.828025477707





<br>
<br>
<br>
<br>

# Solution

In [None]:
for model in ("base_model", "finetuned_model"):

    scores = generate_model_scores(json_data, model)
    print(f"\n{model}")
    print(f"Number of scores: {len(scores)} of {len(json_data)}")
    print(f"Average score: {sum(scores)/len(scores):.2f}\n")

Scoring entries: 100%|██████████| 165/165 [00:30<00:00,  5.50it/s]



response_before
Number of scores: 161 of 165
Average score: 84.02



Scoring entries: 100%|██████████| 165/165 [00:29<00:00,  5.58it/s]


response_after
Number of scores: 160 of 165
Average score: 81.88




