# Fine tuning to follow instructions

In order to the llm to follow instructions, we need to generate a dataset the represents that and can be used to fine tune the model

Stage 1 _preparing the dataset_:
- Download/find a dataset and formating the data
- Batching the dataset
- Create Dataloaders

Stage 2 _fine-tuning the llm_:
- Load pretrained LLM
- Execute the fine tuning Instruct or Classification
- Inspect the modeling loss

Stage 3 _evaluation_:
- Extracting responses
- Qualitative evaluation
- Scoring responses

In [6]:
# Download the instruction dataset - from book

import json
import os
import urllib

def download_and_load_file(file_path, url):
    if not os.path.exists(file_path):
        with urllib.request.urlopen(url) as response:
            text_data = response.read().decode("utf-8")
        with open(file_path, "w", encoding="utf-8") as file:
            file.write(text_data)

    with open(file_path, "r") as file:
        data = json.load(file)
    
    return data


file_path = "data/instruction-data.json"
url = (
    "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch"
    "/main/ch07/01_main-chapter-code/instruction-data.json"
)


data = download_and_load_file(file_path, url)
print("Number of entries:", len(data))
print("\nexample entry: \n", data[50])

Number of entries: 1100

example entry: 
 {'instruction': 'Identify the correct spelling of the following word.', 'input': 'Ocassion', 'output': "The correct spelling is 'Occasion.'"}


In [7]:
def format_input(entry):
    instruction_text = (
        f"Bellow is an instruction that describes a task. "
        f"Write a response that appropriately completes the request." 
        f"\n\n### Instruction:\n{entry['instruction']}"
    )

    input_text = (
        f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""
    )

    return instruction_text + input_text

In [10]:
inst = data[50]
model_input = format_input(inst)
desired_response = f"\n\n### Response:\n{inst['output']}"

print(model_input + desired_response)

Bellow is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Identify the correct spelling of the following word.

### Input:
Ocassion

### Response:
The correct spelling is 'Occasion.'


In [13]:
# Set splits for training
train_portion = int(len(data) * 0.85)
test_portion = int(len(data) * 0.1)
val_portion = len(data) - train_portion - test_portion

train_data = data[:train_portion]
test_data = data[train_portion:train_portion + test_portion]
val_data = data[train_portion + test_portion:]

print("Train size: ", len(train_data))
print("Test size: ", len(test_data))
print("Val size: ", len(val_data))

Train size:  935
Test size:  110
Val size:  55


## Batching for instruction

We need to implement our own batching process, given that instruction training is different

Batching process for instruction fine tuning:
- Apply the prompt template
- Tokenize text
- Padding tokens for normalization
- Create target token IDs
- Replacing -100 tokens to mask padding in the loss function
