# Chapter 6: Fine-tuning for classifcation

<h4>
    
Now we will fine-tune our LLM on a specific target task, such as classifying text.

<div style="max-width:600px">
    
![](images/6.0_1.png)

</div>

This figure highlights the two main ways of fine-tuning an LLM

</h4>

## 6.1 Different categories of fine-tuning

<h4>

Two most common ways to fine-tune language models 

1. Instruction fine-tuning
2. Classification fine-tuning

<div style="max-width:600px">
    
![](images/6.1_1.png)

</div>

Instruction fine-tuning involves training a language model on a set of tasks using specific instructions to improve its ability to understand and execute tasks described in natural language prompts, as illustrated above.

<div style="max-width:600px">
    
![](images/6.1_2.png)

</div>

In classification fine-tuning, the model is trained to recognize a specific set of class labels, such as "spam" and "not spam". Classification tasks can include things like identifying different species of plants from images, categorizing news articles into topics like sports, politics, and technology, and distinguishing between benign and malignant tumors in medical imaging. 

The key point is that a classification fine-tuned model is restricted to predicting classes it has encountered during its training. For example in the image above, the model can determine whether something is "spam" or "not spam", but cannot say anything else about the input.


Advantages and Disadvantages of each approach:

- classification fine-tuned models require less data and compute power, but its use is confined to the specific classes on which it has been trained
- classification fine-tuned models are easier to develop
- instruction fine-tuned models can undertake a broader range of tasks, but are harder to develop
- instruction fine-tuning requires a larger dataset and greater computational resources to develop models proficient in various tasks
  
    
</h4>

## 6.2 Preparing the dataset

<h4>

We will classification fine-tune the GPT model we previously implemented and pretrained. We begin by downloading a text message dataset that consists of spam and not spam messages.


<div style="max-width:600px">
    
![](images/6.2_1.png)

</div>

The three stage process for classfication fine-tuning:

1. Dataset preparation
2. Model setup
3. Model fine-tuning and evaluation

</h4>

In [2]:
import urllib.request
import zipfile
import os
from pathlib import Path

url = "https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip"
zip_path = "sms_spam_collection.zip"
extracted_path = "sms_spam_collection"
data_file_path = Path(extracted_path) / "SMSSpamCollection.tsv"

def download_and_unzip_spam_dataset(url, zip_path, extracted_path, data_file_path):
    if data_file_path.exists():
        print(f"{data_file_path} already exists. Skipping download and extraction")
        return
        
    with urllib.request.urlopen(url) as response: # downloads the file
        with open(zip_path, "wb") as out_file:
            out_file.write(response.read())

    with zipfile.ZipFile(zip_path, "r") as zip_ref: # unzips file
        zip_ref.extractall(extracted_path)

    original_file_path = Path(extracted_path) / "SMSSpamCollection"
    os.rename(original_file_path, data_file_path)
    print(f"File downloaded and saved as {data_file_path}")

download_and_unzip_spam_dataset(url, zip_path, extracted_path, data_file_path)

sms_spam_collection/SMSSpamCollection.tsv already exists. Skipping download and extraction


In [3]:
import pandas as pd
df = pd.read_csv(data_file_path, sep="\t", header=None, names=["Label", "Text"])
df

Unnamed: 0,Label,Text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


<h4>
Lets examine the class label distribution
</h4>

In [4]:
print(df["Label"].value_counts())

Label
ham     4825
spam     747
Name: count, dtype: int64


<h4>
Because the data contains "ham" (not spam) far more than "spam", we will undersample the dataset to include 747 instances for each class
</h4>

In [5]:
def create_balanced_dataset(df):
    num_spam = df[df["Label"] == "spam"].shape[0] # Counts instances of spam 
    ham_subset = df[df["Label"] == "ham"].sample(num_spam, random_state=123) # randomly samples "ham" instances to match number of spam instances
    balanced_df = pd.concat([ham_subset, df[df["Label"] == "spam"]]) # combines ham subset with "spam"
    return balanced_df

balanced_df = create_balanced_dataset(df)
print(balanced_df["Label"].value_counts())

Label
ham     747
spam    747
Name: count, dtype: int64


<h4>
Now we convert the string class labels to integers:
    
- "ham" -> 0
- "spam" -> 1
</h4>

In [6]:
balanced_df["Label"] = balanced_df["Label"].map({"ham":0, "spam":1})

<h4>
Now we create a random_split function to split our dataset into train, validation, and test portions
</h4>

In [7]:
def random_split(df, train_frac, validation_frac):
    df = df.sample(frac=1, random_state=123).reset_index(drop=True) # shuffles entire dataframe

    # calculates split indices
    train_end = int(len(df) * train_frac)            
    validation_end = train_end + int(len(df) * validation_frac) 

    # splits df
    train_df = df[:train_end]
    validation_df = df[train_end:validation_end]
    test_df = df[validation_end:]

    return train_df, validation_df, test_df

train_df, validation_df, test_df = random_split(balanced_df, 0.7, 0.1)

In [8]:
# save df's to CSV files to be used later
train_df.to_csv("train.csv", index=None)
validation_df.to_csv("validation.csv", index=None)
test_df.to_csv("test.csv", index=None)

<h4>
So far, we have downloaded the dataset, balanced it, and split it into training, validation, and testing subsets. Now we set up the PyTorch dataloaders that will be used to train the model.
</h4>

## 6.3 Creating data loaders

<h4>

Previously, we utilized a sliding window to generate uniformly sized text chunks, which were grouped into batches for more efficient model training. Each chunk functioned as an individual training instance. 

However, we are now working with a dataset that contains text messages of varying lengths. To batch these messages as we did with the text chunks, we have two options:

1. Truncate all messages to the shortest message in the dataset or batch
2. Pad all messages to the length of the longest message in the dataset or batch

The first option is computationally cheaper, but we can lose significant information if shorter messages are much smaller than average or longest messages, which can potentially reduce model performance. Therefore, we opt for the second option, which preserves the entire content of all messages.

We can add padding tokens to all shorter messsages until their length matches that of the longest message in the dataset. We can use "<|endoftext|>" as a padding token.

<div style="max-width:800px">
    
![](images/6.3_1.png)

</div>

However, instead of appending the string "<|endoftext|>" to each of the text messages directly, we can add the token ID corresponding to "<|endoftext|>" to the encoded messages, as shown in the figure above.

</h4>

In [9]:
# Check to see what the encoding of <|endoftext|> is 

import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")
print(tokenizer.encode("<|endoftext|>", allowed_special={"<|endoftext|>"}))

[50256]


<h4>

We need to implement a Pytorch dataset which specifies how the data is loaded and processed before we can instantiate the data loaders.

For this process, we instantiate the SpamDataset class which will implement the concept in the figure above. It handles several key tasks:

- identifies the longest sequence in the training dataset,
- encodes the text messages
- ensures that all other sequences are padded with a padding token to match the length of the longest sequence
</h4>

In [10]:
import torch
from torch.utils.data import Dataset

In [11]:
class SpamDataset(Dataset):
    def __init__(self, csv_file, tokenizer, max_length=None, pad_token_id=50256):
        self.data = pd.read_csv(csv_file)
        
        self.encoded_texts = [tokenizer.encode(text) for text in self.data["Text"]] # tokenizes the texts

        if max_length is None:
            self.max_length = self._longest_encoded_length()
        else:
            self.max_length = max_length
            self.encoded_texts = [encoded_text[:self.max_length] for encoded_text in self.encoded_texts] # truncates text sequence if longer than max_length

        self.encoded_texts = [encoded_text + [pad_token_id] * (self.max_length - len(encoded_text)) for encoded_text in self.encoded_texts] # adds padding

    def __getitem__(self, index):
        encoded = self.encoded_texts[index]
        label = self.data.iloc[index]["Label"]
        return (torch.tensor(encoded, dtype=torch.long), torch.tensor(label, dtype=torch.long))

    def __len__(self):
        return len(self.data)

    def _longest_encoded_length(self):
        max_length = 0
        for encoded_text in self.encoded_texts:
            max_length = max(max_length, len(encoded_text))
        return max_length

<h4>
    
This class loads the data from csv, tokenizes the text messages, allows us to pad or truncate the sequences to a uniform length determined by either the longest sequence of maximum length parameter. 

This ensures each tensor is of the same size, which is necessary to create batches in the training data loader we implement next

</h4>

In [12]:
train_dataset = SpamDataset(
    csv_file="train.csv",
    max_length=None,
    tokenizer=tokenizer
)

print(train_dataset.max_length)

120


<h4>

The output 120 shows that the longest text message is only 120 tokens long. The model can handle sequences up to 1024 tokens, given its context_length limit. If the dataset included texts surpassing that length, you could pass 1024 as the max_length parameter when calling this function ensuring that the data does not exceed the models supported max input length
    
</h4>

In [13]:
val_dataset = SpamDataset(
    csv_file="validation.csv",
    max_length=train_dataset.max_length,
    tokenizer=tokenizer
)

test_dataset = SpamDataset(
    csv_file="test.csv",
    max_length=train_dataset.max_length,
    tokenizer=tokenizer
)

<h4>

Using the datasets as inputs, we can instantiate the data loaders similarly to when we were working with text data. However, in this case the input is a text sequence and the target is a class label (rather than the next token in the text). 

For instance, if we choose a batch size of 8, each batch will consist of eight training examples of length 120 and the corresponding class label of each example

<div style="max-width:700px">
    
![](images/6.3_2.png)

</div>

</h4>

In [14]:
from torch.utils.data import DataLoader

num_workers = 0
batch_size = 8
torch.manual_seed(123)

train_loader = DataLoader(
    dataset=train_dataset,
    batch_size=batch_size,
    shuffle=True,
    num_workers=num_workers,
    drop_last=True
)

val_loader = DataLoader(
    dataset=val_dataset,
    batch_size=batch_size,
    num_workers=num_workers,
    drop_last=False
)

test_loader = DataLoader(
    dataset=test_dataset,
    batch_size=batch_size,
    num_workers=num_workers,
    drop_last=False
)

In [16]:
# for input_batch, target_batch in train_loader:
#     pass
# print("Input batch dimensions:",input_batch.shape)
# print("Target batch dimensions:", target_batch.shape)

it = iter(train_loader)
inputs, label = next(it)

print("The first training sample out of the eight in the batch:\n",inputs[0, :])
print("\nShape of inputs in whole train batch:\n", inputs.shape)
print("\nThe first target label out of the eight in the batch:\n", label[0])
print("\nShape of target labels in whole train batch:\n", label.shape)

The first training sample out of the eight in the batch:
 tensor([ 4805,  3824,  6158,     0,  3406,  5816, 10781, 21983,   329,   657,
         3695, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256])

Shap

<h4>

You can see the input batch consists of eight training examples with 120 tokens each, as expected. The label tensor stores the class labels corresponding to the eight training examples.

Lastly, to get an idea of the dataset size, lets print the total number of batches in each dataset.

</h4>

In [17]:
print(f"{len(train_loader)} training batches")
print(f"{len(val_loader)} validation batches")
print(f"{len(test_loader)} testing batches")

130 training batches
19 validation batches
38 testing batches


## 6.4 Initializing a model with pretrained weights

<h4>

We start the process of preparing the model for classification fine-tuning by initializing our pretrained model.

<div style="max-width:600px">
    
![](images/6.4_1.png)

</div>
    
</h4>

In [18]:
BASE_CONFIG = {
    "vocab_size": 50257,    # Vocabulary size
    "context_length": 1024, # Context length
    "drop_rate": 0.0,       # Dropout rate
    "qkv_bias": True        # Query-key-value bias
}

model_configs = {
    "gpt2-small-124M": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
    "gpt2-medium-355M": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
    "gpt2-large-774M": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
    "gpt2-xl-1558M": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}

CHOOSE_MODEL = "gpt2-small-124M"
BASE_CONFIG.update(model_configs[CHOOSE_MODEL])

In [19]:
import os
from previous_chapters import GPTModel, generate_text_simple, text_to_token_ids, token_ids_to_text

file_name = f"{CHOOSE_MODEL}.pth"
url = f"https://huggingface.co/rasbt/gpt2-from-scratch-pytorch/resolve/main/{file_name}"

if not os.path.exists(file_name):
    urllib.request.urlretrieve(url, file_name)
    print(f"Downloaded to {file_name}")


model = GPTModel(BASE_CONFIG)
model.load_state_dict(torch.load(file_name, weights_only=True))
model.eval()
model.to("cpu")
print("")




In [20]:
text_1 = "Every effort moves you"
token_ids = generate_text_simple(
    model=model,
    idx=text_to_token_ids(text_1, tokenizer),
    max_new_tokens=15,
    context_size=BASE_CONFIG["context_length"]
)
print(token_ids_to_text(token_ids, tokenizer))

Every effort moves you forward.

The first step is to understand the importance of your work


<h4>
    
The output shows the model generates coherent text, which indicates that the model weights have been loaded correctly.

Before we start fine-tuning the model as a spam classifier, lets see whether the model already classifies spam messages by prompting it with instructions
    
</h4>

In [21]:
text_2 = (
    "Is the following text 'spam'? Answer with a 'yes' or 'no':'You are a winner you have been specially selected to recieve $1000 cash or a $2000 award.'"
)

token_ids = generate_text_simple(
    model=model,
    idx=text_to_token_ids(text_2, tokenizer),
    max_new_tokens=23,
    context_size=BASE_CONFIG["context_length"]
)

print(token_ids_to_text(token_ids, tokenizer))

Is the following text 'spam'? Answer with a 'yes' or 'no':'You are a winner you have been specially selected to recieve $1000 cash or a $2000 award.'

The following text 'spam'? Answer with a 'yes' or 'no':'You are a


<h4>

Based on the output, it is clear the model is having trouble following instructions. This is expected, as it has only undergone pretraining and lacks instruction fine-tuning. Now, lets prepare the model for classification fine-tuning.
  
</h4>

## 6.5 Adding a classification head

<h4>

Now we must modify the pretrained LLM to prepare it for classification fine-tuning. To do this, we must replace the original output layer (that maps the last layer to a vocabulary of 50,257) to a smaller output layer that maps to two classes (0 and 1)

<div style="max-width:800px">
    
![](images/6.5_1.png)

</div>
    
</h4>

In [26]:
def print_gpt_model_summary(model):
    print("GPTModel(")
    print(f"  (tok_emb): {model.tok_emb}")
    print(f"  (pos_emb): {model.pos_emb}")
    print(f"  (drop_emb): {model.drop_emb}")
    print(f"  (trf_blocks): Sequential(")
    print(f"    (0..11): TransformerBlock × 12")
    print("      (att): MultiHeadAttention(")
    print(f"        (W_query): {model.trf_blocks[0].att.W_query}")
    print(f"        (W_key): {model.trf_blocks[0].att.W_key}")
    print(f"        (W_value): {model.trf_blocks[0].att.W_value}")
    print(f"        (out_proj): {model.trf_blocks[0].att.out_proj}")
    print(f"        (dropout): {model.trf_blocks[0].att.dropout}")
    print("      )")
    print("      (ff): FeedForward(")
    print(f"        {model.trf_blocks[0].ff.layers}")
    print("      )")
    print(f"      (norm1): {model.trf_blocks[0].norm1}")
    print(f"      (norm2): {model.trf_blocks[0].norm2}")
    print(f"      (drop_resid): {model.trf_blocks[0].drop_resid}")
    print("  )")
    print(f"  (final_norm): {model.final_norm}")
    print(f"  (out_head): {model.out_head}")
    print(")")

print_gpt_model_summary(model)

GPTModel(
  (tok_emb): Embedding(50257, 768)
  (pos_emb): Embedding(1024, 768)
  (drop_emb): Dropout(p=0.0, inplace=False)
  (trf_blocks): Sequential(
    (0..11): TransformerBlock × 12
      (att): MultiHeadAttention(
        (W_query): Linear(in_features=768, out_features=768, bias=True)
        (W_key): Linear(in_features=768, out_features=768, bias=True)
        (W_value): Linear(in_features=768, out_features=768, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
        (dropout): Dropout(p=0.0, inplace=False)
      )
      (ff): FeedForward(
        Sequential(
  (0): Linear(in_features=768, out_features=3072, bias=True)
  (1): GELU()
  (2): Linear(in_features=3072, out_features=768, bias=True)
)
      )
      (norm1): LayerNorm()
      (norm2): LayerNorm()
      (drop_resid): Dropout(p=0.0, inplace=False)
  )
  (final_norm): LayerNorm()
  (out_head): Linear(in_features=768, out_features=50257, bias=False)
)


<h4>
    
#### We want to replace "out_head" with a new output layer that we will fine-tune. 

#### To get the model ready for classification fine-tuning, we first freeze the model, meaning we make all layers nontrainable.

</h4>

In [27]:
for param in model.parameters():
    param.requires_grad = False

<h4>
After freezing, we now replace the output layer (model.out_head) 
</h4>

In [28]:
torch.manual_seed(123)
num_classes = 2
model.out_head = torch.nn.Linear(
    in_features=BASE_CONFIG["emb_dim"],
    out_features=num_classes
)


#### This new output layer has the attribute requires_grad set to True by default, therefore, it will be the only layer in the model that will be updated during training.

#### Technically, this can be sufficient for fine-tuning, but the performance is increased when you also include the last transformer block and the final 