## Instruction Finetuning 

![alt text](dataset_finetune_prev.jpg) 
Dataset from: https://huggingface.co/datasets/axiong/pmc_llama_instructions

#### Setup and Imports

In [1]:
import torch
import tiktoken
from matplotlib import pyplot as plt
from litgpt import LLM
import json
from tqdm import tqdm

from importlib.metadata import version
pkgs = ["matplotlib", "numpy", "tiktoken", "torch"]
for p in pkgs:
    print(f"{p} version: {version(p)}")

matplotlib version: 3.8.2
numpy version: 1.26.4
tiktoken version: 0.7.0
torch version: 2.2.1+cu121


#### Load and Preprocess Dataset

In [2]:

with open("instruction_data.json", "r") as file:
    data = json.load(file)

processed_data = [
    {
        "instruction": item["instruction"],
        "input": item["input"],
        "output": item["output"]
    }
    for item in data
]

print("Number of entries:", len(processed_data))

Number of entries: 1100


#### Create Training and Test Sets

In [3]:
import random

random.shuffle(processed_data)

train_ratio = 0.85
train_size = int(len(processed_data) * train_ratio)

train_data = processed_data[:train_size]
test_data = processed_data[train_size:]

print("Training set length:", len(train_data))
print("Test set length:", len(test_data))


with open("train.json", "w") as json_file:
    json.dump(train_data, json_file, indent=4)
    
with open("test.json", "w") as json_file:
    json.dump(test_data, json_file, indent=4)

Training set length: 935
Test set length: 165


#### Finetuning

In [4]:

!litgpt finetune_lora microsoft/phi-2 \
--data JSON \
--data.val_split_fraction 0.1 \
--data.json_path train.json \
--train.epochs 3 \
--train.log_interval 100

{'checkpoint_dir': PosixPath('checkpoints/microsoft/phi-2'),
 'data': JSON(json_path=PosixPath('train.json'),
              mask_prompt=False,
              val_split_fraction=0.1,
              prompt_style=<litgpt.prompts.Alpaca object at 0x7f88cb5a5a20>,
              ignore_index=-100,
              seed=42,
              num_workers=4),
 'devices': 1,
 'eval': EvalArgs(interval=100,
                  max_new_tokens=100,
                  max_iters=100,
                  initial_validation=False,
                  final_validation=True),
 'logger_name': 'csv',
 'lora_alpha': 16,
 'lora_dropout': 0.05,
 'lora_head': False,
 'lora_key': False,
 'lora_mlp': False,
 'lora_projection': False,
 'lora_query': True,
 'lora_r': 8,
 'lora_value': True,
 'optimizer': 'AdamW',
 'out_dir': PosixPath('out/finetune/lora'),
 'precision': None,
 'quantize': None,
 'seed': 1337,
 'train': TrainArgs(save_interval=1000,
                    log_interval=100,
                    global_batch_size=16,
  

In [5]:
def format_input(entry):
    instruction_text = (
        f"Below is an instruction that describes a task. "
        f"Write a response that appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )
    input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""
    return instruction_text + input_text

llm = LLM.load("microsoft/phi-2")

for i in tqdm(range(len(test_data))):
    response = llm.generate(format_input(test_data[i]))
    test_data[i]["base_model"] = response

with open("test_base_model.json", "w") as json_file:
    json.dump(test_data, json_file, indent=4)

100%|██████████| 165/165 [00:51<00:00,  3.20it/s]


In [12]:
# del llm
llm_finetuned = LLM.load("/teamspace/studios/this_studio/out/finetune/lora/final/")

for i in tqdm(range(len(test_data))):
    response = llm_finetuned.generate(format_input(test_data[i]))
    test_data[i]["finetuned_model"] = response


with open("test_base_and_finetuned_model.json", "w") as json_file:
    json.dump(test_data, json_file, indent=4)

 72%|███████▏  | 118/165 [00:34<00:12,  3.81it/s]

In [13]:
for i in range(5):
    print(f"Sample {i+1}:")
    print("Instruction:", test_data[i]["instruction"])
    print("Base model output:", test_data[i]["base_model"])
    print("Finetuned model output:", test_data[i]["finetuned_model"])
    print("\n")

Sample 1:
Instruction: Arrange the following events in chronological order: Invention of the airplane, Fall of the Berlin Wall, Discovery of America.
Base model output:  The correct order of these events is Discovery of America, Invention of the airplane, Fall of the Berlin Wall.

Finetuned model output: Italicize the correct order in the following sentence.

### Response:
Concentrate on providing a good answer that appropriately completes the request.


Sample 2:
Instruction: Find a synonym for the given verb.
Base model output:  Start

Finetuned model output: Prologue


Sample 3:
Instruction: Translate the phrase 'Life is beautiful' into Italian.
Base model output:  L'aiei sono belli.

Finetuned model output: Lifo semplice


Sample 4:
Instruction: Convert the following verb to its gerund form: 'eat'
Base model output:  Eating

Finetuned model output: Eating



Sample 5:
Instruction: Look up the freezing point of water.
Base model output:  The freezing point of water is 0 degrees Cels