# Instruction Fine Tuning

We'll now take our base model and fine tune it to a chat bot like ChatGPT :).

AFAIK, OpenAI never created a fine-tuned instruct model of GPT2, so we're in uncharted territory now. It also means that this can either turn out great, or just horrible.

## Preparing the Dataset

I have a dataset that I synthethically generated by the included `dataset_generation.py` script using `gpt-4o` with the OpenAI API. 

You can run this script yourself to generate as many conversations as you need. Do note, that this does cost money, but you can also generate the data using a local LLM such as Llama, or Qwen. 

I have included a small version of the dataset in this repo called `shared/data/conversations-sm.jsonl`. It contains about 10K multi-turn conversations up to 1000 tokens -- which is just hitting th maximum context length for GPT2.

In [None]:
import jsonl

# dataset_file_path = "data/small-conversations/conversations.jsonl"
dataset_file_path = "shared/data/conversations-sm.jsonl"

with open(dataset_file_path, "r") as f:
  all_data = list(jsonl.load(f))

len(all_data), all_data[0]

(79808,
 [{'role': 'user', 'content': "Hi there, how's your day going?"},
  {'role': 'assistant',
   'content': 'Hello! My day is going great, thank you for asking. How about yours?'},
  {'role': 'user',
   'content': "It's been a bit busy, but I'm managing. Any interesting facts to share?"},
  {'role': 'assistant',
   'content': 'Did you know that honey never spoils? Archaeologists have found pots of honey in ancient Egyptian tombs that are over 3000 years old and still edible!'},
  {'role': 'user',
   'content': "Wow, that's fascinating! I never would have guessed. What makes honey last so long?"},
  {'role': 'assistant',
   'content': "Honey's longevity is due to its low moisture content and acidic nature, which make it inhospitable for bacteria and microorganisms to grow."},
  {'role': 'user',
   'content': "That makes sense. Nature really is amazing, isn't it?"},
  {'role': 'assistant',
   'content': "Absolutely! There's so much complexity and wonder in the natural world. Is there

We'll also shuffle the entire dataset in place to get a good mix of long and short conversations.

In [70]:
import random

random.shuffle(all_data)

all_data[0]

[{'role': 'user',
  'content': "Hey there, I'm curious about how rainbows are formed. Can you explain that to me?"},
 {'role': 'assistant',
  'content': 'Of course! Rainbows are a meteorological phenomenon caused by reflection, refraction, and dispersion of light in water droplets, resulting in a spectrum of light appearing in the sky. It happens when sunlight breaks into its constituent colors: red, orange, yellow, green, blue, indigo, and violet.'},
 {'role': 'user',
  'content': "That's interesting! So, is it just sunlight that creates rainbows, or can artificial light make them too?"},
 {'role': 'assistant',
  'content': 'Great question! While natural sunlight is the most common source for rainbows, artificial light can also create similar effects. However, natural sunlight contains a full spectrum of colors, which makes natural rainbows more vivid compared to those created by most artificial light sources.'},
 {'role': 'user',
  'content': "How do the angles work in forming a rain

I'll do a training split of 90%. The validation and test portions of the dataset will be the remaining 10%. 

Of the remaining 10%, 90% of that 10% will go in as validation data, and the rest, test data.

In [71]:
train_end_idx = int(0.9 * len(all_data))
train_data = all_data[:train_end_idx]

val_and_test_data = all_data[train_end_idx:]
val_end_idx = int(0.9 * len(val_and_test_data))

val_data = val_and_test_data[:val_end_idx]
test_data = val_and_test_data[val_end_idx:]

len(train_data), len(val_data), len(test_data)

(71827, 7182, 799)

I have personally done my best to ensure the data set is clean, but let's filter out any strange entries just in case!

In [72]:
train_data =  list(filter(lambda x: x is not None, train_data))
val_data = list(filter(lambda x: x is not None, val_data))
test_data = list(filter(lambda x: x is not None, test_data))

len(train_data), len(val_data), len(test_data)

(71827, 7182, 799)

For good measure, let's shuffle the dataset one more time!

In [73]:
import random
random.shuffle(train_data)

## How to Create a Dataset and DataLoader for Multi-Turn Fine-Tuning

**Note**: The most important implementations: `InstructionDataset` and `custom_collate_fn` are also found in the `fine_tuning.py` script. If you want to run the fine tuning outside of the notebook, you can import the modules into a separate script. 

Here is the most interesting part.

In [74]:
device = 'cuda:1'

## "Alpaca" Style Template



## Masking Examples for Multi-Turn Conversations

Good resources: https://www.together.ai/blog/fine-tuning-llms-for-multi-turn-conversations-a-technical-deep-dive

Okay, this is an explanation in which I want to get right, so it's going to be a longer discussion before we move on.

First, masking doesn't mean to "hide" data from the model. The point of masking is to take some tokens out of consideration when computing the loss, In this case for our training run, we choose the number `-100`. This basically tells PyTorch cross entropy loss function to not penalize us on the tokens that have `-100` as the value.

The key is to maintain the principle that an LLM is to just generate a single token at a time. Repeatedly. Don't focus on getting the model to generate a response, but focus on getting the model to predict the token which can eventually form the response when running the prediction in a loop, over and over again. Eventually, after many token generations, the model will have completely generated a response. It's all an illusion.

**Single Turn Conversations**

In the single-turn conversation scenario, it's easier to form the dataset and mask. It's simple as just masking everything _not_ the response in the template. Then in every single training batch, the model is tasked to predict only what the response may be, and will only be penalized for its attempts on those tokens. (Error calculation)

Together AI has this graphic which I found useful to understand:

<img src="img/mask-techniques.png" width="800">

**Multi-Turn Conversations**

Now if you imagine that the conversation is  flat Alpaca document, the concept is pretty much the same. Instead of just keeping the last response, in a multi turn conversation, we're going to force the model to estimate _all_ the responses.





### Visualization of Multi-Turn Conversation Masking

Here's an example to visualize the masking:

```text
Below is an instruction that describes a task. Write a response that appropriately completes the request.
^^^^^^^^^^ MASK THIS ^^^^^^^^^^

### Instruction:
^^ MASK THIS ^^
Hello
^^ MASK THIS ^^

### Response:
^^ MASK THIS ^^
Hello! How can I help you today?
^^ DON'T MASK - THIS IS WHAT WE TRAIN ON ^^

### Instruction:
^^ MASK THIS ^^
I'm learning about endangered species.
^^ MASK THIS ^^

### Response:
^^ MASK THIS ^^
That's great! Endangered species are...
^^ DON'T MASK - THIS IS WHAT WE TRAIN ON ^^
```

Notice the response _content_ is what we want to train on. So the targets will have these tokens exposed, while the rest of the content will be represented as `-100`.

## InstructionDataset

Alright finally, here's the big one. Basically as we build up the message, we're going to keep 2 data structures that will allow us to keep track of the tokens and mask. 

- `encoded_texts` - this is the array that will store all the tokenized versions of the Alpaca-transformed conversations
- `masks` - this is the array that will store the masked representation of the `encoded_texts` when applicable. Response content in this case will not be masked.

**The Approach**

For every item in the dataset, we're going to build up _pieces_ of the Alpaca document. These pieces are:

- "Below is an instruction that describes a task. Write a response that appropriately completes the request"
- Then for every turn in the conversation:
  - "\n\n### Instruction:\n"
  - Content for user
  - "\n\n### Response:\n"
  - Content for assistant
  - \n\n

Notice the `\n\n` after the assistant response. This is very important as it will train the model to understand when its response should end.

Now, for everyone of the parts of the document, we're going to tokenize the parts and add it to a current tokens and mask array. The **only part** that shouldn't be masked is the content for the assistant with the `\n\n` characters. 

Then add the tokenized conversation and corresponding masked message to the `encoded_texts` and `masks` arrays.

Finally, we ensure that `__getitem__` iterator method will return a tuple of the current tokenized text and its corresponding mask. This will be used in the collate function.

In [75]:
import torch
from torch.utils.data import Dataset

class InstructionDataset(Dataset):
  def __init__(self, data, tokenizer):
    self.data = data

    self.encoded_texts = []
    self.masks = []

    for item in self.data:
      tokens = []
      mask = []

      _instruction_base = f"Below is an instruction that describes a task. Write a response that appropriately completes the request"
      _instruction_base_tokens = tokenizer.encode(_instruction_base, allowed_special={"<|endoftext|>"})
      tokens.extend(_instruction_base_tokens)
      mask.extend([True] * len(_instruction_base_tokens))

      for message in item:
        if message["role"] == "user":
          _instruction_content_header = f"\n\n### Instruction:\n"
          _instruction_content_header_tokens = tokenizer.encode(_instruction_content_header, allowed_special={"<|endoftext|>"})
          tokens.extend(_instruction_content_header_tokens)
          mask.extend([True] * len(_instruction_content_header_tokens))

          _instruction_content_content = message["content"]
          _instruction_content_content_tokens = tokenizer.encode(_instruction_content_content, allowed_special={"<|endoftext|>"})
          tokens.extend(_instruction_content_content_tokens)
          mask.extend([True] * len(_instruction_content_content_tokens))
        else:
          _instruction_response_header = f"\n\n### Response:\n"
          _instruction_response_header_tokens = tokenizer.encode(_instruction_response_header, allowed_special={"<|endoftext|>"})
          tokens.extend(_instruction_response_header_tokens)
          mask.extend([True] * len(_instruction_response_header_tokens))

          # Dont mask the assistant response
          _instruction_response_content = message["content"] + "\n\n"
          _instruction_response_content_tokens = tokenizer.encode(_instruction_response_content, allowed_special={"<|endoftext|>"})
          tokens.extend(_instruction_response_content_tokens)
          mask.extend([False] * len(_instruction_response_content_tokens))

      self.encoded_texts.append(tokens)
      self.masks.append(mask)

      
  def __getitem__(self, index):
    return self.encoded_texts[index], self.masks[index]
  
  def __len__(self):
    return len(self.data)

### Collation

This isn't very necessary if you're processing 1 item in a batch at a time. Essentially, a collation function is just a processor that allows us to take multiple items for a batch and resize them to make them all th esame length. We _pad_ with the `<|endoftext|>` token id to make sure every item in the batch is the same size.

How do we determine the size of each item? That's just the maximum length of a conversation in the batch. Therefore, every batch should be slightly different in length -- it's dependent on the longest conversation in the batch. 

In [None]:
def custom_collate_fn(
   batch,
   pad_token_id=50256,
   ignore_index=-100,
   allowed_max_length=None,
   device="cpu" 
):
  tokens_batch = [item[0] for item in batch]
  masks_batch = [item[1] for item in batch]

  # find the longest sequence in the batch
  batch_max_length = max(len(tokens) + 1 for tokens in tokens_batch)

  inputs_lst, targets_lst = [], []

  for tokens, mask in zip(tokens_batch, masks_batch):
    new_tokens = tokens.copy()
    new_tokens += [pad_token_id]
    padded_tokens = (
      new_tokens + ([pad_token_id] * (batch_max_length - len(new_tokens)))
    )

    new_mask = mask.copy()
    new_mask += [True] # mask the added padded token
    padded_mask = (
      new_mask + ([True] * (batch_max_length - len(new_mask)))
    )

    inputs = torch.tensor(padded_tokens[:-1])
    targets = torch.tensor(padded_tokens[1:])

    pad_mask = targets == pad_token_id
    indices = torch.nonzero(pad_mask).squeeze()

    if indices.numel() > 1:
      targets[indices[1:]] = ignore_index

    # add the mask
    for j in range(min(len(padded_mask), len(targets))):
      if j + 1 < len(padded_mask) and padded_mask[j + 1]:
        targets[j] = ignore_index

    if allowed_max_length is not None:
      inputs = inputs[:allowed_max_length]
      targets = targets[:allowed_max_length]

    inputs_lst.append(inputs)
    targets_lst.append(targets)

  inputs_tensor = torch.stack(inputs_lst).to(device)
  targets_tensor = torch.stack(targets_lst).to(device)

  return inputs_tensor, targets_tensor

### Create a wrapper

Took this off of Sebastian's book, but this is just building a simplified function of the custom_collate_fn.

In [77]:
from functools import partial

customized_collate_fn = partial(
  custom_collate_fn,
  device=device,
  allowed_max_length=1024
)

Finally, now we can build our Dataloaders!

Notice we don't shuffle the validation, and test data loaders. 

In [80]:
from torch.utils.data import DataLoader
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")

num_workers = 0
batch_size = 32 

train_dataset = InstructionDataset(train_data, tokenizer)
train_loader = DataLoader(
  train_dataset,
  batch_size=batch_size,
  collate_fn=customized_collate_fn,
  shuffle=True,
  drop_last=True,
  num_workers=num_workers
)

val_dataset = InstructionDataset(val_data, tokenizer)
val_loader = DataLoader(
  val_dataset,
  batch_size=batch_size,
  collate_fn=customized_collate_fn,
  shuffle=False,
  drop_last=False,
  num_workers=num_workers
)

test_dataset = InstructionDataset(test_data, tokenizer)
test_loader = DataLoader(
  test_dataset,
  batch_size=batch_size,
  collate_fn=customized_collate_fn,
  shuffle=False,
  drop_last=False,
  num_workers=num_workers
)

For sanity check, we can print out the first 2 batches of our inputs and targets to see if we have padded and masked things correctly. We should see a lot of `50256` in the inputs to indicate end of sequence (end of conversation), and quite a few `-100` to indicate masked tokens for the LLM to train on.

In [87]:
batch = iter(train_loader)
inputs, targets = next(batch)
inputs[:2], targets[:2]

(tensor([[21106,   318,   281,  ..., 50256, 50256, 50256],
         [21106,   318,   281,  ..., 50256, 50256, 50256]], device='cuda:1'),
 tensor([[-100, -100, -100,  ..., -100, -100, -100],
         [-100, -100, -100,  ..., -100, -100, -100]], device='cuda:1'))

## Fine-Tuning for Instruction

We will now start by setting up the GPU and clearing any garbage we have accumulated in memory during our session so far.

I chose `cuda:1` as my device since I have 2 GPUs in my system, and I would like to only use the second one.

In [82]:
import gc
import torch

# Setup your device here!
device = 'cuda:1'

if torch.cuda.is_available():
    torch.cuda.empty_cache()

gc.collect()

775

## Calculate Current Loss

Let's take our current pre-trained model and calculate the current training, and validation loss. It should be somewhere in the 3.5-3.8 range, since that's where we left off with our raw, unstructured data last time.

We'll compute only on the first 20 batches.

In [None]:
from scripts.train import calc_loss_loader
from scripts.model_loader import load_model_from_path

# Our pre-trained model path
model_path = "models/10b/gpt2-355M-bfloat16.pth"

model = load_model_from_path(
  model_path,
  device=device 
)
model = model.to(device).to(torch.bfloat16)
model.eval()

num_batches = 20
with torch.no_grad():
  train_loss = calc_loss_loader(train_loader, model, device, num_batches=num_batches)
  val_loss = calc_loss_loader(val_loader, model, device, num_batches=num_batches)

print("Training loss", train_loss)
print("Validation loss", val_loss)

Training loss 3.821875
Validation loss 3.7328125


## Train!

Now it is time to fine-tune our model for instruction. This is very exciting because now we can now realize that a pre-trained model is foundation for many other type of models we can create. Creating our own little ChatGPT is one of them. 

This section is going to take the longest to complete since we're bounded by compute, and also the fact that we're going to attempt to train the model **3 times** to demonstrate something interesting: Learning rates.

You'll notice very quickly that it will be difficult to converge with training and validation loss on our dataset. This is primarily because our dataset is relatively small ~10K examples, but also that our model is simply quite small to have useful conversations. 

We will make it work.


## Constant Learning Rate

Our first attempt at fine-tuning will be using a constant learning rate of `5e-5` to train the model. It's basically the same way we performed pre-training. 

The number of epochs will change to **5** because we have much less data, so we need to make several passes to reduce the loss on the same learning rate. 

**Stop Sequence** - We will now also introduce a new concept... the **stop sequence**. 

Currently, our model is unaware of ending each response. It will have the tendency to just continuosly generate tokens.

The one thing we have considered when inferencing is that since we have trained the model on the `<|endoftext|>` token (`50256`), it will generate these from time to time. 

When we detect this token, our `generate` function stops generating tokens and breaks out of the loop. We have many occurrences of 50256 at the end of the conversations.

However, what happens when we want the model to learn how to _stop_ generating a response, and not the entire conversation? 

If you noticed earlier from our mask construction for each conversation, we had intentionally ended each assistant response with `\n\n`. We will train the model to always end its responses with `\n\n` since we'll many occurrences of these.

Now, coincidentially, `\n\n`, is actually a _*single*_ token in the GPT2 encoding for our tokenizer. So it is represented as a single number! This is convenient. 

We can modify our `generate` function to _stop_ generating tokens once we have seen this specific token ID.

In [68]:
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")
double_new_line_id = tokenizer.encode("\n\n", allowed_special={"<|endoftext|>"})[0]

double_new_line_id

628

So we have confirmed that `\n\n` is `628`.


First, we load the pretrained model, `gpt2-355M-it-bfloat16.pth`. Again, I am assuming a CUDA device. Sorry!

We'll run this training loop for 3 epochs. As mentioned before, in every epoch, we'll stick with a learning rate of `5e-5`.

Notice how we passed in a defined `stop_sequence` to help with generating correct samples during our training. We'll also checkpoint every 200 iterations.

This training for 5 epochs:
- Took 10 minutes, 14 seconds to complete
- Ended with a **training_loss** of 1.499, and a **validation_loss** of 1.909 on its last evaluation.

In [None]:
import time
import tiktoken
import torch
from scripts.train import train_model_simple
from scripts.model_loader import load_model_from_path
from scripts.fine_tune import format_input

tokenizer = tiktoken.get_encoding("gpt2")
double_new_line_id = tokenizer.encode("\n\n", allowed_special={"<|endoftext|>"})[0]

model_path = "models/10b/gpt2-355M-bfloat16.pth"

model = load_model_from_path(
  model_path,
  device=device 
)

if torch.cuda.is_available() and device.startswith("cuda"):
  model = model.to(device).to(torch.bfloat16)
else:
  model = model.to(device)
model.train()

num_epochs = 3

# Note the constant 5e-5 learning rate for all 3 epochs!
optimizer = torch.optim.AdamW(
  model.parameters(),
  lr=5e-5,
  weight_decay=0.1,
  fused=True
)

start_time = time.time()

train_losses, val_losses = train_model_simple(
  model=model,
  train_loader=train_loader,
  val_loader=val_loader,
  optimizer=optimizer,
  num_epochs=num_epochs,
  eval_freq=100,
  eval_iter=50,
  start_context=format_input(val_data[0]),
  tokenizer=tokenizer,
  device=device,
  save_iters=200,
  stop_sequence=[double_new_line_id]
)

end_time = time.time()
execution_time_minutes = (end_time - start_time) / 60
print(f"Training completed in {execution_time_minutes:.2f} minutes.")

Ep 1 (Step 000000 of 1125): Train loss 3.471, Val loss 3.474
Below is an instruction that describes a task. Write a response that appropriately completes the request.  ### Instruction: Hi! I'm looking for a new hobby. Any suggestions?  ### Response: Sure! What kind of activities are you interested in? Arts, sports, or something else?    ### Instruction: I think something creative could be fun.  ### Response: How about trying painting or learning to play a musical instrument? Both are great creative outlets.    ### Instruction: Painting sounds interesting! Do I need any special supplies?  ### Response: You'll need some basic supplies like paints, brushes, and canvas or paper. You could start with acrylic paints as they're versatile and easy to use.    ### Instruction: Thanks for the advice! I'll look into getting some supplies this weekend.  ### Response: You'll be able to find a large variety of brushes available, including an amazing array of pencil tools.░░ ### Response: Your compute

Notice with the 3 epochs of training, the validation loss starts to drop slower. We see that the training and validation losses start to diverge with the training loss consistently decreasing.

This is a sign of overfitting. The model gets better with the training, whle it stays more or less, the same with validation data. 

It's clear that 3 epochs of the same learning rate may not scale. 

<img src="img/constant-lr-loss_curves.png" width="1000px">

### Save the Model - Constant Learning Rate

**SAVE THE MODEL** We'll need it for inferencing later.

In [None]:
from scripts.gpt2_common import save_model_and_optimizer

model_directory = "/home/rngo/code/ttnn-sandbox/notebooks/models"
save_model_and_optimizer(
  model_path=f"{str(model_directory)}/gpt2-355M-model-it-constant-lr.pth",
  model=model,
  optimizer_path=f"{str(model_directory)}/optimizer-gpt2-355M-model-it-constant-lr.pth",
  optimizer=optimizer
)

## Varying Learning Rates

Now instead of straight up using a learning rate of 5e-5 for 3 epochs, let's actually run 3 _separate_ epochs with different learning rates. How would that perform?

Our learning rates here will be `[5e-5, 2e-5, 1e-5]`. Essentially, the strategy here is to start aggressive, bringing the losses down while it is still very easy to.

On the second epoch, we should be pretty close to where we want to be in our loss, and in order to not create an overfitting scenario, we will lower the learning rate. 

Finally, on the third epoch, we lower the learning rate to the point where the changes will be minimal.

This approach ends up:
- Training took 10 minutes and 45 seconds to complete
- Training loss was: 1.749, Validation loss was: 1.987.

Overall, this created less overfitting, but we didn't go as low as we could because of the lack of epochs.

In [None]:
import time
import tiktoken
import torch
from scripts.train import train_model_simple
from scripts.model_loader import load_model_from_path
from scripts.fine_tune import format_input

tokenizer = tiktoken.get_encoding("gpt2")
double_new_line_id = tokenizer.encode("\n\n", allowed_special={"<|endoftext|>"})[0]

model_path = "models/10b/gpt2-355M-bfloat16.pth"
model = load_model_from_path(
    model_path,
    device=device 
)

if torch.cuda.is_available() and device.startswith("cuda"):
    model = model.to(device).to(torch.bfloat16)
else:
    model = model.to(device)
model.train()

lrs = [5e-5, 2e-5, 1e-5]

start_time = time.time()
for epoch, lr in enumerate(lrs):
    print(f"Starting epoch {epoch + 1} with learning rate: {lr}")

    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=lr,
        weight_decay=0.1,
        fused=True
    )

    train_losses, val_losses = train_model_simple(
        model=model,
        train_loader=train_loader,
        val_loader=val_loader,
        optimizer=optimizer,
        num_epochs=1, # single epoch as wel will just loop
        eval_freq=100,
        eval_iter=50,
        start_context=format_input(val_data[0]),
        tokenizer=tokenizer,
        device=device,
        save_iters=200,
        stop_sequence=[double_new_line_id]
    )

end_time = time.time()
execution_time_minutes = (end_time - start_time) / 60
print(f"Training completed in {execution_time_minutes:.2f} minutes.")

Starting epoch 1 with learning rate: 5e-05
Ep 1 (Step 000000 of 1125): Train loss 3.497, Val loss 3.465
Below is an instruction that describes a task. Write a response that appropriately completes the request.  ### Instruction: Hi! I'm looking for a new hobby. Any suggestions?  ### Response: Sure! What kind of activities are you interested in? Arts, sports, or something else?    ### Instruction: I think something creative could be fun.  ### Response: How about trying painting or learning to play a musical instrument? Both are great creative outlets.    ### Instruction: Painting sounds interesting! Do I need any special supplies?  ### Response: You'll need some basic supplies like paints, brushes, and canvas or paper. You could start with acrylic paints as they're versatile and easy to use.    ### Instruction: Thanks for the advice! I'll look into getting some supplies this weekend.  ### Response: No! This is an area with a large pool and water sports. You can play either acoustic or ac

### Save the Model - Varying Learning Rate

Save the model!

In [35]:
from scripts.gpt2_common import save_model_and_optimizer
model_directory = "/home/rngo/code/ttnn-sandbox/notebooks/models"
save_model_and_optimizer(
  model_path=f"{str(model_directory)}/gpt2-355M-model-it-varying-lr.pth",
  model=model,
  optimizer_path=f"{str(model_directory)}/optimizer-gpt2-355M-model-it-varying-lr.pth",
  optimizer=optimizer
)

## LR Scheduler

Now, let's perform a pretty advanced trick. :) 

You'll notice immediately that our training is _much_ faster. We only need a _single_ epoch this time and we converge much faster. 

The difference between the training loss and validation loss is also more reasonable, close to 0.3. This is a sign of the model overfitting less to the training data. 


- only need a single epoch to converge faster

After this method, I think we achieved our _best results_!

This training method:
- Took only 3 minutes, 48 seconds to train
- Training loss was 1.513, and validation loss was 1.815.

The difference here became around 0.3, but with lower loss values than our "varying LR" method.

**Note** - This is pretty close to how industry does training. We want to do learning rate scheduling in our training loop. 


In [83]:
import time
import tiktoken
import torch
from scripts.train import train_model_simple
from scripts.model_loader import load_model_from_path
from scripts.fine_tune import format_input
from torch.optim import lr_scheduler

tokenizer = tiktoken.get_encoding("gpt2")
double_new_line_id = tokenizer.encode("\n\n", allowed_special={"<|endoftext|>"})[0]

model_path = "models/10b/gpt2-355M-bfloat16.pth"
model = load_model_from_path(
    model_path,
    device=device 
)

if torch.cuda.is_available() and device.startswith("cuda"):
    model = model.to(device).to(torch.bfloat16)
else:
    model = model.to(device)

model.train()

num_epochs = 2

start_time = time.time()

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=1e-3,
    weight_decay=0.1,
    fused=True
)

scheduler = lr_scheduler.LinearLR(
    optimizer,
    start_factor=0.3,
    end_factor=0.01,
    total_iters=len(train_loader)
)

train_losses, val_losses = train_model_simple(
    model=model,
    train_loader=train_loader,
    val_loader=val_loader,
    optimizer=optimizer,
    num_epochs=num_epochs, # single epoch as wel will just loop
    eval_freq=100,
    eval_iter=50,
    start_context=format_input(val_data[0]),
    tokenizer=tokenizer,
    device=device,
    save_iters=200,
    stop_sequence=[double_new_line_id],
    scheduler=scheduler
)

end_time = time.time()
execution_time_minutes = (end_time - start_time) / 60
print(f"Training completed in {execution_time_minutes:.2f} minutes.")

Ep 1 (Step 000000 of 2244): Train loss 7.798, Val loss 7.861
Below is an instruction that describes a task. Write a response that appropriately completes the request.  ### Instruction: Hey, what's the weather like today?  ### Response: The weather is sunny with a few clouds. Perfect for a walk!    ### Instruction: Great! Do you think I should carry an umbrella, just in case?  ### Response: There's no rain in the forecast, so you should be fine without one.    ### Instruction: Thanks for the update! Any good parks to recommend for a stroll?  ### Response: Sure! Jefferson Park is lovely this time of year with its blooming flowers.    ### Instruction: Sounds perfect. I'll check it out. Thank you!  ### Response: ░░ overnight accommodations���░░░░░░░░░░░░░░ towel░░ਸ਼� ���░░�░░░░░░░░░░ � accommodations░░ accommodations�░░░░░░░░░░ arrakeley░░���� javascript░░=~=~░░�=~=~thereumrafted░░� Adinida�ম��░░░░� �░░ �░░�░░░░��░░�� サーティワン░░░░░░░░���� dstg░░░░ਫ�══aneous ��اؿannouncedvisenvanunav
Ep 1 (Ste

### Visualize the Loss

This is overall a great improvement than naively training on the same learning rate for multiple epochs. We converge much faster only on a single epoch.

<img src="img/scheduler-lr-loss_curves.png" width="1000">

### Save the model!

In [84]:
from scripts.gpt2_common import save_model_and_optimizer
model_directory = "/home/rngo/code/ttnn-sandbox/notebooks/models"
save_model_and_optimizer(
  model_path=f"{str(model_directory)}/gpt2-355M-model-it-scheduled-lr-lg.pth",
  model=model,
  optimizer_path=f"{str(model_directory)}/optimizer-gpt2-355M-model-it-scheduled-lr-lg.pth",
  optimizer=optimizer
)

## Inferencing

Let's gather up all our models, and perform some inferencing against models that were trained with a constant learning rate, manual varying learning rate, and scheduled learning rate. 



In [44]:
base_model_path = "/home/rngo/code/ttnn-sandbox/notebooks/models"

constant_lr_model_path = f"{base_model_path}/gpt2-355M-model-it-constant-lr.pth"
varying_lr_model_path = f"{base_model_path}/gpt2-355M-model-it-varying-lr.pth"
scheduled_lr_model_path = f"{base_model_path}/gpt2-355M-model-it-scheduled-lr.pth"

In [45]:
device = "cuda:1"

In [None]:
import tiktoken
from scripts.model_loader import load_model_from_path
from scripts.generate import generate
from scripts.util import text_to_token_ids, token_ids_to_text

def transform_message(message: str) -> str:
  message = message.replace("Below is an instruction that describes a task. Write a response that appropriately completes the request.", "")
  message = message.replace("\n\n### Instruction:\n", "\nUser: ")
  message = message.replace("\n\n### Response:\n", "\nAssistant: ")
  message = message.replace("\n\n", "\n")
  
  return message

tokenizer = tiktoken.get_encoding("gpt2")
double_new_line_id = tokenizer.encode("\n\n", allowed_special={"<|endoftext|>"})[0]

model_paths = [
  constant_lr_model_path,
  varying_lr_model_path,
  scheduled_lr_model_path
]

for model_path in model_paths:
  print("-" * 20)
  print(f"Testing a message for model: {model_path}")

  model = load_model_from_path(path=model_path, device=device)
  model.eval()

  for i, test in enumerate(test_data[10:13]): # take only a subset of 3 messages in the middle of test
    print(f"## Test message: {i}")
    test_message = format_input(test)

    token_ids = generate(
      model,
      idx=text_to_token_ids(test_message, tokenizer).to(device),
      max_new_tokens=512,
      context_size=1024,
      temperature=1.0,
      top_k=20,
      eos_id=50256,
      device=device,
      stop_sequence=[double_new_line_id]
    )

    text = token_ids_to_text(token_ids, tokenizer)
    print("-" * 20)
    print(transform_message(text))
    print()



--------------------
Testing a message for model: /home/rngo/code/ttnn-sandbox/notebooks/models/gpt2-355M-model-it-constant-lr.pth
## Test message: 0
--------------------

User: Hey, can you help me figure out my schedule? I've been feeling overwhelmed.
Assistant: Of course! Why don't we start by listing your tasks and appointments for the week?

User: Sure. I have a meeting on Monday, a project deadline on Wednesday, and a friend's birthday on Saturday.
Assistant: Let's break it down day by day. On Monday, perhaps you can prepare notes for your meeting early to ease the pressure.

User: Good idea, and I'll allocate time over the weekend to ensure I'm ready for the project deadline too.
Assistant: Perfect! For your friend's birthday, you might consider getting a gift earlier in the week to avoid last-minute stress.

User: That sounds like a plan. Thanks for your help!
Assistant: You're welcome! Let me know if you need anything else. Good luck!


## Test message: 1
--------------------


### Basic Evals

The response I would like to highlight is test message 1 in all three models. 

Let's have some fun and pass these messages to Claude Sonnet 4 and see what it thinks as to which response our instruction fine-tuned LLM produced is the best.

I only decided to extract the final responses from the fine-tuned model for Test Message 1:

**Constant Learning Rate**
```
User: Thanks, this was really helpful! I feel much more confident to start my garden now.
Assistant: You're very welcome! Happy gardening, and feel free to reach out anytime you need more advice and advice. Happy gardening!
```

**LLM Eval Asssessment**
The model output shows a small repetition issue - "advice and advice" and "Happy gardening!" repeated twice. This is typical of instruction-tuned models and shows your model is generating coherent, helpful responses while still having minor repetition artifacts that are common in fine-tuned language models.

**Varying Learning Rate**
```
User: Thanks, this was really helpful! I feel much more confident to start my garden now.
Assistant: You're welcome! I'm delighted you're looking forward to seeing your new garden complete. Good luck, and feel free to share your successes too!
```

**LLM Eval Assessment**
Much better! This response is:

- ✅ No repetition issues
- ✅ Natural and encouraging
- ✅ Appropriate tone and length
- ✅ Good conversation closure

**Learning Rate Scheduler**
```
User: Thanks, this was really helpful! I feel much more confident to start my garden now.
Assistant: You're very welcome! Best of luck with your gardening journey, and have fun with it. Feel free to reach out if you have more questions along the way. Happy gardening!
```

**LLM Eval Assessment** 
Excellent response quality! This output demonstrates:

- ✅ Perfect conversational flow - natural and engaging
- ✅ No repetition - each phrase is unique and purposeful
- ✅ Appropriate length - comprehensive but not verbose
- ✅ Helpful tone - encouraging and supportive
- ✅ Good closure - invites future interaction while ending positively

**BEST OVERALL (LLM EVAL)**
Sample 3 is the best (the last one):

"You're very welcome! Best of luck with your gardening journey, and have fun with it. Feel free to reach out if you have more questions along the way. Happy gardening!"

Why it's superior:

Sample 1: ❌ Has clear repetition artifacts ("advice and advice", "Happy gardening!" twice)

Sample 2: ✅ Good quality, but slightly awkward phrasing ("delighted you're looking forward to seeing your new garden complete")

Sample 3: ✅✅ Best overall because:

Natural flow - reads like human conversation
Perfect balance - encouraging but not excessive
Comprehensive - covers luck, fun, future support, and positive closure
No artifacts - clean, polished response
Appropriate length - thorough without being verbose
Varied vocabulary - "journey", "along the way" shows good language diversity

Sample 3 demonstrates your model at its peak performance - this is exactly the quality you achieved with your optimized 1-epoch training! It shows sophisticated understanding of conversational context and tone. 🏆

## Up Next...

Up next, we'll now transfer the weights of the model and create a version that will work for Tenstorrent hardware. Remember that? Yep that's the whole reason why we started these series of notebooks to begin with! 

Let's go back to our roots...