# Chapter 7: Fine-tuning to follow instructions

## 7.1 Introduction to instruction fine-tuning

#### We know that a pretrained LLM is capable of text completion, meaning it can finish sentences or write text paragraphs given a fragment as input. However, pretrained LLMs often struggle with specific instructions, such as, "Fix the grammer in this text" or "Convert this text to passive voice". 

#### We will focus on improving the LLMs ability to follow such instruction and generate a desired response. 

<div style="max-width:700px">
    
![](images/7.1_1.png)

</div>

#### We we will begin with the first stage in the three stage instruction fine-tuning process.

<div style="max-width:700px">
    
![](images/7.1_2.png)

</div>


## 7.2 Preparing a dataset for supervised instruction fine-tuning

#### We will download and format the instruction dataset for instruction fine-tuning a pre-trained LLM. This dataset contains 1,100 instruction-response pairs similar to those in the first figure.

In [1]:
import json
import os
import urllib

def download_and_load_file(file_path, url):
    if not os.path.exists(file_path):
        with urllib.request.urlopen(url) as response:
            text_data = response.read().decode("utf-8")
        with open(file_path, "w", encoding="utf-8") as file:
            file.write(text_data)
    
    with open(file_path, "r") as file:
        data = json.load(file)
    return data

file_path = "instruction-data.json"
url = (
    "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch"
    "/main/ch07/01_main-chapter-code/instruction-data.json"
)


data = download_and_load_file(file_path, url)
print("Number of entries:", len(data))

Number of entries: 1100


#### This code implements and executes a function that downloads the dataset and puts it into JSON format.

In [2]:
print("Example entry:\n", data[50])

Example entry:
 {'instruction': 'Identify the correct spelling of the following word.', 'input': 'Ocassion', 'output': "The correct spelling is 'Occasion.'"}


#### As we can see, each example is a dictionary containing 'instruction', 'input', and 'output'.

In [3]:
print("Another example entry:\n", data[999])

Another example entry:
 {'instruction': "What is an antonym of 'complicated'?", 'input': '', 'output': "An antonym of 'complicated' is 'simple'."}


#### As shown above, the 'input' field may occasionally be empty.

#### Instruction fine-tuning involves training a model by providing it explicit input-output pairs. There are various ways to format these entires for LLMs. The figure below showcases two different example formats, known as 'prompt styles', used in the training of LLMs such as Alpaca and Phi-3.

<div style="max-width:800px">
    
![](images/7.2_1.png)

</div>

#### We will use the Alpaca prompty style because it is one of the most popular ones.

#### Lets define a 'format_input' function that we can use to convert the entires in the data list into the Alpaca-style input format.

In [6]:
def format_input(entry):
    instruction_text = f"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n###Instruction:\n{entry['instruction']}"
    input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else " "
    return instruction_text + input_text

In [9]:
model_input = format_input(data[50])
desired_response = f"\n\n### Response:\n{data[50]['output']}"
print(model_input + desired_response)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

###Instruction:
Identify the correct spelling of the following word.

### Input:
Ocassion

### Response:
The correct spelling is 'Occasion.'


#### Before we set up dataloaders, lets partition the dataset into a training, validation and test set.

In [10]:
train_portion = int(len(data) * 0.85)
test_portion = int(len(data) * 0.1)
val_portion = len(data) - train_portion - test_portion

train_data = data[:train_portion]
test_data = data[train_portion:train_portion + test_portion]
val_data = data[train_portion + test_portion:]

print("Training set length:", len(train_data))
print("Validation set length:", len(val_data))
print("Test set length:", len(test_data))

Training set length: 935
Validation set length: 55
Test set length: 110


#### Now we can focus on developing the method for construction training batches for fine-tuning the LLM

## 7.3 Organizing data into training batches

<div style="max-width:700px">
    
![](images/7.3_1.png)

</div>

#### Previously, the training batches were created automatically by the PyTorch DataLoader class, which employs a 'collate' function to combine lists of samples into batches. A collate function takes a list of individual samples and merges them into a single batch.

#### However, for instruction fine-tuning, we are required to create our own collate function that plug into the DataLoader. We implement this custom collate function to handle the specific requirements and formatting our instruction fine-tuning dataset.

<div style="max-width:700px">
    
![](images/7.3_2.png)

</div>

#### The figure above breaks down the batching process into several steps. 

#### First, to implement steps 2.1 and 2.2, we create an InstructionDataset class that applies format_input and pretokenizes all inputs in the dataset. This two step process is detailed below, and is implemented in the __init__ constructor method of the InstructionDataset class.

<div style="max-width:800px">
    
![](images/7.3_3.png)

</div>

In [11]:
import torch
from torch.utils.data import Dataset

class InstructionDataset(Dataset):
    def __init__(self, data, tokenizer):
        self.data = data
        self.encoded_texts = []
        for entry in data:
            instruction_plus_input = format_input(entry)
            response_text = f"\n\n###Response:\n{entry['output']}"
            full_text = instruction_plus_input + response_text
            self.encoded_texts.append(tokenizer.encode(full_text))

    def __getitem__(self, index):
        return self.encoded_texts[index]

    def __len__(self):
        return len(self.data)

#### Similar to the approach used for classification fine-tuning, we want to collection multiple training examples in a batch, which necessitates padding all inputs to the same length which we do by appending the token ID for the <|endoftext|> token to the pretokenized inputs directly.\

#### Moving onto step 2.3 of the process, we develop a custom collate function that we can pass to the data loader. This custom collate function pads the training example in each batch to the same length while allowing different batches to have different lengths. This way, we avoid unncessary padding by only extending sequences to the longest one in the batch, not the whole dataset.

<div style="max-width:700px">
    
![](images/7.3_4.png)

</div>

In [16]:
def custom_collate_draft_1(batch, pad_token_id=50256, device="cpu"):
    batch_max_length = max(len(item)+1 for item in batch) # finds longest sequences in the batch
    inputs_lst = []

    for item in batch:
        new_item = item.copy()
        new_item += [pad_token_id]

        padded = new_item + [pad_token_id] * (batch_max_length - len(new_item))
        inputs = torch.tensor(padded[:-1])
        inputs_lst.append(inputs)

    inputs_tensor = torch.stack(inputs_lst).to(device)
    return inputs_tensor

In [17]:
inputs_1 = [0, 1, 2, 3, 4]
inputs_2 = [5, 6]
inputs_3 = [7, 8, 9]
batch = (
    inputs_1,
    inputs_2,
    inputs_3
)
print(custom_collate_draft_1(batch))

tensor([[    0,     1,     2,     3,     4],
        [    5,     6, 50256, 50256, 50256],
        [    7,     8,     9, 50256, 50256]])


#### We have implemented our first custom collate function to create batches from lists of inputs. Now, we must modify it to also create the target token IDs corresponding to the batch of input IDs.

<div style="max-width:700px">
    
![](images/7.3_5.png)

</div>

In [19]:
def custom_collate_draft_2(batch, pad_token_id=50256, device="cpu"):
    batch_max_length = max(len(item)+1 for item in batch) # finds longest sequences in the batch
    inputs_lst, target_lst = [], []

    for item in batch:
        new_item = item.copy()
        new_item += [pad_token_id]

        padded = new_item + [pad_token_id] * (batch_max_length - len(new_item))
        inputs = torch.tensor(padded[:-1])
        targets = torch.tensor(padded[1:])
        inputs_lst.append(inputs)
        target_lst.append(targets)

    inputs_tensor = torch.stack(inputs_lst).to(device)
    targets_tensor = torch.stack(target_lst).to(device)
    return inputs_tensor, targets_tensor

inputs, targets = custom_collate_draft_2(batch)
print(inputs)
print(targets)

tensor([[    0,     1,     2,     3,     4],
        [    5,     6, 50256, 50256, 50256],
        [    7,     8,     9, 50256, 50256]])
tensor([[    1,     2,     3,     4, 50256],
        [    6, 50256, 50256, 50256, 50256],
        [    8,     9, 50256, 50256, 50256]])
