# Prepare your Instruction Dataset

An Instruction dataset is a list of instructions/outputs pairs that are relevant to your own domain. For instance it could be question and answers from an specific domain, problems and solution for a technical domain, or just instruction and outputs. A typical example is "Write me a Python script to read a jsonL file and print the first 5 lines" and the model would output something like:

```python
import json

fname = "my_file.json"

# read file from fname
with open(fname, "r") as f:
    data = json.load(f)

print(data[0:5])
```

So let's explore how one could do this?

After grabbing a finetuned model and curated your own dataset, how do I create a dataset that has the right format to fine tune a model?

Let's grab the Alpaca (GPT-4 curated instructions and outputs) dataset:

In [11]:
# !wget https://raw.githubusercontent.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM/main/data/alpaca_gpt4_data.json

In [17]:
import json

dataset_file = "alpaca_gpt4_data.json"

with open(dataset_file, "r") as f:
    alpaca = json.load(f)

In [18]:
type(alpaca), alpaca[0:3], len(alpaca)

(list,
 [{'instruction': 'Give three tips for staying healthy.',
   'input': '',
   'output': '1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.\n\n2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.\n\n3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.'},
  {'instruction': 'What are the three primary colors?',
   'input': '',
   'output': 'The three primary colors are red, blue, and yellow. These

So the dataset has instruction and outputs. The model is trained to predict the next token, so one option would be just to concat both, and train on that. We ideally format the prompt in a way that we make explicit where is the input and output.

In [19]:
import wandb

# log to wandb
with wandb.init(project="alpaca_ft"):
    at = wandb.Artifact(
        name="alpaca_gpt4", 
        type="dataset",
        description="A GPT4 generated Alpaca like dataset for instruction finetunning",
        metadata={"url":"https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM#how-good-is-the-data"},
    )
    at.add_file(dataset_file)

    # log as a table
    table = wandb.Table(columns=list(alpaca[0].keys()))
    for row in alpaca:
        table.add_data(*row.values())
    wandb.log({"alpaca_gpt4_table": table})

Let's log the dataset also as a table so we can inspect it on the workspace.

In [20]:
def prompt_no_input(row):
    return ("Below is an instruction that describes a task. "
            "Write a response that appropriately completes the request.\n\n"
            "### Instruction:\n{instruction}\n\n### Response:\n").format_map(row)

In [21]:
row = alpaca[0]
print(prompt_no_input(row))

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Give three tips for staying healthy.

### Response:



Some other instruction have some context in the `input` variable`

In [22]:
row

{'instruction': 'Give three tips for staying healthy.',
 'input': '',
 'output': '1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.\n\n2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.\n\n3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.'}

In [23]:
def prompt_input(row):
    return ("Below is an instruction that describes a task, paired with an input that provides further context. "
            "Write a response that appropriately completes the request.\n\n"
            "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n").format_map(row)

In [24]:
row = alpaca[232]
print(prompt_input(row))

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Sort the following list in alphabetical order.

### Input:
Camouflage, Furniture, Plaster

### Response:



> But you are leaving the output out!!! Yes, but we can just concat that afterwards. Let's deal with the prompt now, we can add the output later with the right amount of padding.

And the refactored function

In [25]:
def create_prompt(row):
    return prompt_no_input(row) if row["input"] == "" else prompt_input(row)

## Why are we doing all this?

Because we need to tokenize this dataset in a very particular way, if we want the model to learn to predict the output.

In [26]:
prompts = [create_prompt(row) for row in alpaca]

In [27]:
print(prompts[0])

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Give three tips for staying healthy.

### Response:



We need to process the targets and add the End Of String token (EOS) to the results. For LLama this is: `"</s>"`

In [28]:
EOS_TOKEN = "</s>"
outputs = [f"{row['output']}{EOS_TOKEN}" for row in alpaca]

In [29]:
outputs[0]

'1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.\n\n2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.\n\n3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.</s>'

Cool! but why we have everything separated? Let's sore the "final" version on a variable called `examples`

In [30]:
dataset = [{"prompt":s, "output":t, "example": s + t} for s, t in zip(prompts, outputs)]

This is what the model need to see and learn =)

In [31]:
print(dataset[0]["example"])

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Give three tips for staying healthy.

### Response:
1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.

2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.

3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.</s>


## We can actually already train the model! Lets train a baseline

In [32]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

In [33]:
model_id = 'meta-llama/Llama-2-7b-hf'
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

Downloading (…)okenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

we will sort them by lenght, so we get as little padding as possible.

In [34]:
tokenizer.encode("My experiments are going strong!")

[1, 1619, 15729, 526, 2675, 4549, 29991]

In [35]:
tokenizer.encode("My experiments are going strong!", padding='max_length', max_length=10)

[1, 1619, 15729, 526, 2675, 4549, 29991, 2, 2, 2]

In [36]:
tokenizer.encode("My experiments are going strong!", 
                 padding='max_length', 
                 max_length=10,
                 return_tensors="pt")

tensor([[    1,  1619, 15729,   526,  2675,  4549, 29991,     2,     2,     2]])

In [39]:
tokenizer(["My experiments are going strong!", 
           "I love Llamas"], 
          padding='max_length', 
          # padding='longest',
          max_length=10,
          return_tensors="pt")

{'input_ids': tensor([[    1,  1619, 15729,   526,  2675,  4549, 29991,     2,     2,     2],
        [    1,   306,  5360,   365,  5288,   294,     2,     2,     2,     2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])}

In [42]:
import random
random.shuffle(dataset)

In [43]:
dataset.sort(key=lambda x: len(x["example"]))

In [44]:
train_dataset = dataset[:-10]
eval_dataset = dataset[-10:]

### Packing

We will pack multiple short examples into a longer chunk, so we can train more efficiently!

In [60]:
max_seq_len = 1024

In [61]:
tkds_ids = tokenizer([s["example"] for s in train_dataset])["input_ids"]

In [62]:
all_token_ids = []
for tokenized_input in tkds_ids:
    all_token_ids.extend(tokenized_input)

In [72]:
packed_ds = []
for i in range(0, len(all_token_ids), max_seq_len):
    input_ids = all_token_ids[i : i + max_seq_len]
    if len(input_ids) == max_seq_len:
        packed_ds.append({"input_ids": input_ids[:-1], "labels": input_ids[1:]})

In [73]:
len(packed_ds)

11435

In [67]:
from transformers import default_data_collator

# Collate function for DataLoaders
# def collate_fn(examples):
#     batch_size = len(examples)
#     batch = {'input_ids': input_ids[:, :-1], 'labels': input_ids[:, 1:]}
#     return batch

In [74]:
from torch.utils.data import DataLoader

batch_size = 16

train_dataloader = DataLoader(
    packed_ds,
    batch_size=batch_size,
    collate_fn=default_data_collator,
    shuffle=False, #this way we keep the lenght together
)

In [75]:
b = next(iter(train_dataloader))
# b.keys(), b["input_ids"][0][:25], b["labels"][0][:25]

In [76]:
b

{'input_ids': tensor([[    1, 13866,   338,  ...,  3414, 29889, 14350],
         [ 2933,   393,  7128,  ..., 29908,    13,    13],
         [29937, 13291, 29901,  ..., 13291, 29901,    13],
         ...,
         [ 2009, 29889,    13,  ...,    13,  2277, 29937],
         [29901,    13, 29953,  ...,  4080, 29901,    13],
         [  403,   263,   716,  ..., 29889, 14350,   263]]),
 'labels': tensor([[13866,   338,   385,  ..., 29889, 14350,   263],
         [  393,  7128,  2486,  ...,    13,    13,  2277],
         [13291, 29901,    13,  ..., 29901,    13,  6716],
         ...,
         [29889,    13,    13,  ...,  2277, 29937, 13291],
         [   13, 29953,     2,  ..., 29901,    13,  5631],
         [  263,   716,  1024,  ..., 14350,   263,  2933]])}

In [80]:
tokenizer.decode(b["input_ids"][0])[:250]

'<s> Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nProduce a random noun.\n\n### Response:\nLamp</s><s> Below is an instruction that describes a task. Write a response that app'

In [81]:
tokenizer.decode(b["labels"][0])[:250]

'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nProduce a random noun.\n\n### Response:\nLamp</s><s> Below is an instruction that describes a task. Write a response that appropr'

## Training

In [82]:
def save_model(model_name, log=False):
    "Save pytorch model todisk and wandb"
    model_name = f"{wandb.run.id}_{self.model_name}"
    torch.save(learn.model.state_dict(), f"models/{self.model_name}.pth")
    if log:
        at = wandb.Artifact(model_name, type="model")
        at.add_file(f"models/{self.model_name}.pth")
        wandb.log_artifact(at)

## Train

In [83]:
from types import SimpleNamespace

config = SimpleNamespace(
    model_id='meta-llama/Llama-2-7b-hf',
    dataset_name="alpaca-gpt4",
    precision="bf16",
    n_freeze=24,
    lr=1e-3,
    epochs=1,
    gradient_accumulation_steps=4,
    batch_size=batch_size,
    epoch_sz=len(train_dataloader),
    log_every=len(train_dataloader)/5,
    save_model=False,
    mom=0.9,
    gradient_checkpointing = True,
    freeze_embed = True,
)

In [84]:
model = AutoModelForCausalLM.from_pretrained(
    config.model_id,
    device_map=0,
    # use_flash_attention_2=True,
    trust_remote_code=True,
    low_cpu_mem_usage=True,
    torch_dtype=torch.bfloat16 if config.precision == "bf16" else torch.float32,
)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

AssertionError: Torch not compiled with CUDA enabled

In [85]:
def param_count(m):
    params = sum([p.numel() for p in m.parameters()])/1_000_000
    trainable_params = sum([p.numel() for p in m.parameters() if p.requires_grad])/1_000_000
    print(f"Total params: {params:.2f}M, Trainable: {trainable_params:.2f}M")
    return params, trainable_params

params, trainable_params = param_count(model)

NameError: name 'model' is not defined

Let's just train the last 8 layers of the model (Llama2-7B has 32)

In [70]:
n_freeze = 24

# freeze layers (disable gradients)
for param in model.parameters(): param.requires_grad = False
for param in model.lm_head.parameters(): param.requires_grad = True
for param in model.model.layers[n_freeze:].parameters(): param.requires_grad = True

In [71]:
params, trainable_params = param_count(model)

Total params: 6738.42M, Trainable: 1750.14M


In [72]:
# Just freeze embeddings for small memory decrease
if config.freeze_embed:
    model.model.embed_tokens.weight.requires_grad_(False);

In [73]:
# save more memory
if config.gradient_checkpointing:
    model.gradient_checkpointing_enable()

## Testing

Let's compute some generations during training, we can sample form the validation dataset

In [74]:
from types import SimpleNamespace
from transformers import GenerationConfig

gen_config = GenerationConfig.from_pretrained(config.model_id)
test_config = SimpleNamespace(
    max_new_tokens=90,
    gen_config=gen_config)

In [75]:
def generate(prompt, max_new_tokens=test_config.max_new_tokens, gen_config=gen_config):
    with torch.inference_mode():
        tokenized_prompt = tokenizer(prompt, return_tensors='pt')['input_ids'].cuda()
        output = model.generate(tokenized_prompt, 
                            max_new_tokens=max_new_tokens, 
                            generation_config=gen_config)
    return tokenizer.decode(output[0][len(tokenized_prompt[0]):], skip_special_tokens=True)

LoL 🤷

In [76]:
prompt = eval_dataset[0]["prompt"]
print(prompt + generate(prompt, 128))

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Write a 500 word article about the benefits of organic food.

### Response:
Organic food is produced in a way that does not involve modern synthetic chemicals and pesticides. Organic food is produced without the use of genetic engineering, sewage sludge, irradiation, or synthetic fertilizers. Organic farming practices are designed to maintain and improve the natural resources and environment.

Organic food is grown in a way that promotes biodiversity and conserves water and soil fertility. It is produced without the use of synthetic pesticides, herbicides, or fertilizers. Organic food is grown without the use of gen


We can log a Table with those results to the project every X steps

In [77]:
import wandb
from fastprogress import progress_bar

def prompt_table(examples, log=False):
    table = wandb.Table(columns=["prompt", "generation", "concat", "output", "max_new_tokens", "temperature", "top_p"])
    for example in progress_bar(examples):
        prompt, gpt4_output = example["prompt"], example["output"]
        out = generate(prompt, test_config.max_new_tokens, test_config.gen_config)
        table.add_data(prompt, out, prompt+out, gpt4_output, test_config.max_new_tokens, test_config.gen_config.temperature, test_config.gen_config.top_p)
    if log:
        wandb.log({"predictions":table})
    return table

def to_gpu(tensor_dict):
    for key in tensor_dict.keys():
        if torch.is_tensor(tensor_dict[key]):
            tensor_dict[key] = tensor_dict[key].to('cuda')
    return tensor_dict

class TokenAccuracy:
    "A simple Accuracy function compatible with HF models"
    def __init__(self):
        self.count = 0
        self.tp = 0.
    def update(self, logits, labels):
        logits, labels = logits.argmax(dim=-1).view(-1), labels.view(-1)
        tp = (logits == labels).sum()
        self.count += len(logits)
        self.tp += tp
        return tp / len(logits)
    def compute(self):
        return self.tp / self.count

Setup optimizer and else =)

In [78]:
from transformers import get_cosine_schedule_with_warmup

optim = torch.optim.Adam(model.parameters(), betas=(0.9,0.99), eps=1e-5)
scaler = torch.cuda.amp.GradScaler(enabled=(config.precision == "fp16")) # no-op if enabled=False
scheduler = get_cosine_schedule_with_warmup(
    optim,
    num_training_steps=config.epoch_sz,
    num_warmup_steps=100,
)

In [None]:
wandb.init(project="alpaca_ft", # the project I am working on
           tags=["baseline"],
           config=config) # the Hyperparameters I want to keep track of

# Training
acc = TokenAccuracy()

model.train()
for step, batch in enumerate(progress_bar(train_dataloader)):
    optim.zero_grad(set_to_none=True)
    for micro_step in range(config.gradient_accumulation_steps):
        with torch.amp.autocast("cuda", dtype=torch.bfloat16):
            out = model(**to_gpu(batch))
            loss = out.loss / config.gradient_accumulation_steps
    scaler.scale(loss).backward()
    scaler.step(optim)
    scaler.update()
    scheduler.step()

    # we can log the metrics to W&B
    wandb.log({"loss": loss.item() * config.gradient_accumulation_steps,
               "accuracy": acc.update(out.logits, batch["labels"])})

    if step%config.log_every==0 or step%config.epoch_sz==0:
        prompt_table(eval_dataset, log=True)
    
# we save the model checkpoint at the end
if config.save_model:
    save_model(model, model_name=config.model_id.replace("/", "_"), models_folder="models/")
    
wandb.finish()

0,1
accuracy,▁
loss,▁

0,1
accuracy,0.38625
loss,10.26825


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011112440788888813, max=1.0…



`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


In [None]:
# let's free the GPU
model.to("cpu")
del model
torch.cuda.empty_cache()

## The Right way now!

We actually need to feed the model in a different way, with the right attention mask and penalising only the tokens on the answer:
- We will mask the question tokens
- We will put a large negative value on the labels we want to penalise
- Pre-tokenize the dataset for speed

In [None]:
example = dataset[0]
example

Let's tokenize the prompt and the output separately:

In [None]:
tokenized_prompt = tokenizer.encode(example["prompt"]) 
tokenized_output = tokenizer.encode(example["output"])

Now, let's set `input_ids` as the concatenation and `labels` as a copy of `input_ids` but masking the `prompt` 

In [None]:
import copy
input_ids = tokenized_prompt + tokenized_output
labels = copy.deepcopy(input_ids)
labels[:len(tokenized_prompt)] = [-100] * len(tokenized_prompt)  # we mask the tokens from the prompt

So we will feed the model this `input_ids` and make it predict the `labels`, we also have to pass the corresponding attention mask so we don't attend to this tokens

Let's process the dataset now:

In [None]:
def process_dataset(dataset):
    tokenized_dataset = []
    for example in progress_bar(dataset):
        tokenized_prompt = tokenizer.encode(example["prompt"]) 
        tokenized_output = tokenizer.encode(example["output"])
    
        input_ids = tokenized_prompt + tokenized_output
        labels = copy.deepcopy(input_ids)
        labels[:len(tokenized_prompt)] = [-100] * len(tokenized_prompt)
        tokenized_dataset.append({"input_ids": torch.tensor(input_ids), "labels": torch.tensor(labels)})
    return tokenized_dataset

In [None]:
tokenized_dataset = process_dataset(dataset)

> You should save this tokenized dataset to W&B and reload from the artifact to save some time!

We now need to form batches, but the tokenized inputs have different sizes, so we will pad them

In [None]:
tokenized_dataset[0]

In [None]:
def collate_fn_tokenized(examples, pad_token=tokenizer.pad_token_id):
    input_ids, labels = tuple([example[key] for example in examples] for key in ("input_ids", "labels"))
    input_ids = torch.nn.utils.rnn.pad_sequence(
        input_ids, batch_first=True, padding_value=pad_token
    )
    labels = torch.nn.utils.rnn.pad_sequence(labels, batch_first=True, padding_value=-100)
    return dict(
        input_ids=input_ids,
        labels=labels,
        attention_mask=input_ids.ne(pad_token),
    )

So we pad labels with `-100` and the input_ids with the EOS token

In [None]:
dummy_sample1 = {"input_ids":torch.tensor([1,2,3]), "labels":torch.tensor([-100,2,3])}
dummy_sample2 = {"input_ids":torch.tensor([5,6,7,8]), "labels":torch.tensor([-100,-100,7,8])}

collate_fn_tokenized([dummy_sample1, dummy_sample2], pad_token=-1)

In [None]:
config.batch_size = 16

train_dataloader = DataLoader(
    tokenized_dataset,
    batch_size=config.batch_size,
    collate_fn=collate_fn_tokenized,
    shuffle=True,
)

Let's create a fresh model =)

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    config.model_id,
    device_map=0,
    # use_flash_attention_2=True,
    trust_remote_code=True,
    low_cpu_mem_usage=True,
    torch_dtype=torch.bfloat16 if config.precision == "bf16" else torch.float32,
    use_cache=False
)


In [None]:
n_freeze = 24

# freeze layers (disable gradients)
for param in model.parameters(): param.requires_grad = False
for param in model.lm_head.parameters(): param.requires_grad = True
for param in model.model.layers[n_freeze:].parameters(): param.requires_grad = True

In [None]:
params, trainable_params = param_count(model)

In [None]:
# Just freeze embeddings for small memory decrease
if config.freeze_embed:
    model.model.embed_tokens.weight.requires_grad_(False);

In [None]:
# save more memory
if config.gradient_checkpointing:
    model.gradient_checkpointing_enable()

In [None]:
optim = torch.optim.Adam(model.parameters(), betas=(0.9,0.99), eps=1e-5)
scaler = torch.cuda.amp.GradScaler(enabled=(config.precision == "fp16")) # no-op if enabled=False
scheduler = get_cosine_schedule_with_warmup(
    optim,
    num_training_steps=config.epoch_sz,
    num_warmup_steps=100,
)

In [None]:
wandb.init(project="alpaca_ft", # the project I am working on
           tags=["instruct"],
           config=config) # the Hyperparameters I want to keep track of

# Training
acc = TokenAccuracy()

model.train()
for step, batch in enumerate(progress_bar(train_dataloader)):
    optim.zero_grad(set_to_none=True)
    for micro_step in range(config.gradient_accumulation_steps):
        with torch.amp.autocast("cuda", dtype=torch.bfloat16):
            out = model(**to_gpu(batch))
            loss = out.loss / config.gradient_accumulation_steps
    scaler.scale(loss).backward()
    scaler.step(optim)
    scaler.update()
    scheduler.step()

    # we can log the metrics to W&B
    wandb.log({"loss": loss.item() * config.gradient_accumulation_steps,
               "accuracy": acc.update(out.logits, batch["labels"])})

    if step%config.log_every==0 or step%config.epoch_sz==0:
        prompt_table(eval_prompts, log=True)
    
# we save the model checkpoint at the end
if config.save_model:
    save_model(model, model_name=config.model_id.replace("/", "_"), models_folder="models/")
    
wandb.finish()

In [None]:
!sudo poweroff