<h1> From Llama to Alpaca: Finetunning and LLM with Weights & Biases </h1>
In this notebooks you will learn how to finetune a model on an Instruction dataset. We will use an updated version of the Alpaca dataset that, instead of davinci-003 (GPT3) generations uses GPT4 to get an even better instruction dataset!

<a href="https://colab.research.google.com/drive/1bprbJ4HAKEg_1AGse6cmK7-xkMauuNoh?usp=sharing" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
<!--- @wandbcode{mmdetection-wandb-colab} -->

original github: https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM#how-good-is-the-data

In [None]:
!pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git
!pip install -q wandb
!pip install -q ctranslate2
!pip install -q bitsandbytes datasets accelerate loralib

In [None]:
import bitsandbytes as bnb
import copy
import glob
import os
import wandb
import json
from tqdm import tqdm
from types import SimpleNamespace
import datasets
from datasets import Dataset
import transformers
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, Trainer, TrainingArguments, default_data_collator,GenerationConfig
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
import pandas as pd
from peft import PeftModel, PeftConfig, LoraConfig, get_peft_model

os.environ["WANDB_LOG_MODEL"] = "checkpoint"
os.environ["CUDA_VISIBLE_DEVICES"]="0"

## Prepare your Instruction Dataset

An Instruction dataset is a list of instructions/outputs pairs that are relevant to your own domain. For instance it could be question and answers from an specific domain, problems and solution for a technical domain, or just instruction and outputs.


https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM#how-good-is-the-data

So let's explore how one could do this?
After grabbing a finetuned model and curated your own dataset, how do I create a dataset that has the right format to fine tune a model?

Let's grab the Alpaca (GPT-4 curated instructions and outputs) dataset:

In [None]:
!wget https://raw.githubusercontent.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM/main/data/alpaca_gpt4_data.json

In [None]:
dataset_file = "alpaca_gpt4_data.json"
with open(dataset_file, "r") as f:
    alpaca = json.load(f)
print(alpaca[0])

So the dataset has instruction and outputs. The model is trained to predict the next token, so one option would be just to concat both, and train on that. We ideally format the prompt in a way that we make explicit where is the input and output. Let's log the dataset to W&B so we keep everything organised

In [None]:
os.environ["WANDB_ENTITY"]="keisuke-kamata"
os.environ["WANDB_PROJECT"]="alpaca_finetuning_with_wandb"
wandb.login()

In [None]:
# log to wandb
with wandb.init():
    # log as a table
    table = wandb.Table(columns=list(alpaca[0].keys()))
    for row in alpaca:
        table.add_data(*row.values())
    wandb.log({"alpaca_gpt4_table": table})

    # log file with artifact
    artifact = wandb.Artifact(
        name="alpaca_gpt4",
        type="dataset",
        description="A GPT4 generated Alpaca like dataset for instruction finetunning",
        metadata={"url":"https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM#how-good-is-the-data"},
    )
    artifact.add_file(dataset_file)
    wandb.log_artifact(artifact)

Data split

In [None]:
import random
import pandas as pd

seed = 42
random.seed(seed)
random.shuffle(alpaca)  # this could also be a parameter
train_dataset_alpaca = alpaca[:10000]
val_dataset_alpaca = alpaca[-1000:]

We should save the split to W&B

In [None]:
artifact_path = f'{os.environ["WANDB_ENTITY"]}/{os.environ["WANDB_PROJECT"]}/alpaca_gpt4:latest'

In [None]:
with wandb.init(job_type="split_data") as run:
    artifact = run.use_artifact(artifact_path, type='dataset')
    #artifact_folder = artifact.download()

    train_df = pd.DataFrame(train_dataset_alpaca)
    eval_df = pd.DataFrame(val_dataset_alpaca)

    train_df.to_json("alpaca_gpt4_train.jsonl", orient='records', lines=True)
    eval_df.to_json("alpaca_gpt4_eval.jsonl", orient='records', lines=True)


    at = wandb.Artifact(
        name="alpaca_gpt4_splitted",
        type="dataset",
        description="A GPT4 generated Alpaca like dataset for instruction finetunning",
        metadata={"url":"https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM#how-good-is-the-data"},
    )
    at.add_file("alpaca_gpt4_train.jsonl")
    at.add_file("alpaca_gpt4_eval.jsonl")
    train_table = wandb.Table(dataframe=train_df)
    eval_table  = wandb.Table(dataframe=eval_df)
    run.log_artifact(at)
    run.log({"train_dataset":train_table, "eval_dataset":eval_table})

Let's log the dataset also as a table so we can inspect it on the workspace.

## Train

In [None]:
config = {
    "BASE_MODEL":"facebook/opt-125m",
    "lora_config":{
        "r":32,
        "lora_alpha":16,
        'target_modules': [f"model.decoder.layers.{i}.self_attn.{proj}_proj" for i in range(31) for proj in ['q', 'k', 'v']],
        "lora_dropout":.1,
        "bias":"none",
        "task_type":"CAUSAL_LM"
    },
    "training_args":{
        "dataloader_num_workers":16,
        "evaluation_strategy":"steps",
        "per_device_train_batch_size":8,
        "max_steps": 50,
        "gradient_accumulation_steps":2,
        "report_to":"wandb",#wandb integration
        "warmup_steps":10,
        "num_train_epochs":1,
        "learning_rate":2e-4,
        "fp16":True,
        "logging_steps":10,
        "save_steps":10,
        "output_dir":'./outputs'
    }
}

In [None]:
torch.cuda.empty_cache()
model = AutoModelForCausalLM.from_pretrained(
    config["BASE_MODEL"],
    #load_in_8bit=True,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(config["BASE_MODEL"])
tokenizer.pad_token = tokenizer.eos_token

In [None]:
PROMPT_DICT = {
    "prompt_input": (
        "Below is an instruction that describes a task. Write a response that appropriately completes the request."
        "### Instruction:{instruction} \n\n Input:{input} \n\n ###Response"
    ),
    "prompt_no_input": (
        "Below is an instruction that describes a task. Write a response that appropriately completes the request."
        "### Instruction:{instruction} \n\n ###Response"
    )
}

class InstructDataset(Dataset):
    def __init__(self, json_list, tokenizer, ignore_index=-100):
        self.tokenizer = tokenizer
        self.ignore_index = ignore_index
        self.features = []

        for j in tqdm(json_list):
            if 'input' in j:
                source_text = PROMPT_DICT['prompt_input'].format_map(j)
            else:
                source_text = PROMPT_DICT['prompt_no_input'].format_map(j)
            example_text = source_text + j['output'] + self.tokenizer.eos_token

            source_tokenized = self.tokenizer(
                source_text,
                padding='longest',
                truncation=True,
                max_length=512,
                return_length=True,
                return_tensors='pt'
            )

            example_tokenized = self.tokenizer(
                example_text,
                padding='longest',
                truncation=True,
                max_length=512,
                return_tensors='pt'
            )

            input_ids = example_tokenized['input_ids'][0]
            labels = copy.deepcopy(input_ids)
            source_len = source_tokenized['length'][0]
            labels[:source_len] = self.ignore_index

            self.features.append({
                'input_ids': input_ids,
                'labels': labels
            })

    def __len__(self):
        return len(self.features)

    def __getitem__(self, idx):
        return self.features[idx]


class InstructCollator():
    def __init__(self, tokenizer, ignore_index=-100):
        self.tokenizer = tokenizer
        self.ignore_index = -100

    def __call__(self, examples):
        input_batch = []
        label_batch = []
        for example in examples:
            input_batch.append(example['input_ids'])
            label_batch.append(example['labels'])
        input_ids = pad_sequence(
            input_batch, batch_first=True, padding_value=self.tokenizer.pad_token_id
        )
        labels = pad_sequence(
            label_batch, batch_first=True, padding_value=self.ignore_index
        )
        attention_mask = input_ids.ne(self.tokenizer.pad_token_id)
        return {
            'input_ids': input_ids,
            'labels': labels,
            'attention_mask': attention_mask
        }


train_dataset = InstructDataset(train_dataset_alpaca, tokenizer)
val_dataset = InstructDataset(val_dataset_alpaca , tokenizer)

# Create the collator with the device
collator = InstructCollator(tokenizer, ignore_index=-100)

In [None]:
# cast the small parameters (e.g. layernorm) to fp32 for stability
for param in model.parameters():
    param.requires_grad = False # freeze the model - train adapters later
    if param.ndim == 1:
        param.data = param.data.to(torch.float32)
model.gradient_checkpointing_enable()  # reduce number of stored activations
model.enable_input_require_grads()
class CastOutputToFloat(nn.Sequential):
    def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)

In [None]:
path_dataset_for_trainig = f'{os.environ["WANDB_ENTITY"]}/{os.environ["WANDB_PROJECT"]}/alpaca_gpt4_splitted:latest'

In [None]:
with wandb.init(config=config, job_type="training") as run:
    # track data
    run.use_artifact(path_dataset_for_trainig)
    # Setup for LoRa
    lora_config = LoraConfig(**wandb.config["lora_config"])
    model_peft = get_peft_model(model, lora_config)
    model_peft.print_trainable_parameters()
    model_peft.config.use_cache = False

    trainer = transformers.Trainer(
        model=model_peft,
        data_collator= collator,
        args=transformers.TrainingArguments(**wandb.config["training_args"]),
        train_dataset=train_dataset,
        eval_dataset=val_dataset
    )


    trainer.train()
    run.log_code()

## Full Eval Dataset evaluation

Let's log a table with model predictions on the eval_dataset (or at least the 250 first samples)

In [None]:
def create_prompt(row):
    return prompt_no_input(row) if row["input"] == "" else prompt_input(row)

def prompt_no_input(row):
    return ("Below is an instruction that describes a task. Write a response that appropriately completes the request."
        "### Instruction:{instruction} \n\n ###Response").format_map(row)
def prompt_input(row):
    return ("Below is an instruction that describes a task. Write a response that appropriately completes the request."
        "### Instruction:{instruction} \n\n Input:{input} \n\n ###Response").format_map(row)

def pad_eos(ds):
    EOS_TOKEN = "</s>"
    return [f"{row['output']}{EOS_TOKEN}" for row in ds]

eval_prompts = [create_prompt(row) for row in val_dataset_alpaca]
eval_outputs = pad_eos(val_dataset_alpaca)
eval_dataset = [{"prompt":s, "output":t, "example": s + t} for s, t in zip(eval_prompts, eval_outputs)]

In [None]:
model_artifact_path ='' # change here!

In [None]:
gen_config = GenerationConfig.from_pretrained(config["BASE_MODEL"])
test_config = SimpleNamespace(
    max_new_tokens=256,
    gen_config=gen_config)

def prompt_table(examples, log=False, table_name="predictions"):
    table = wandb.Table(columns=["prompt", "generation", "concat", "GPT-4 output"])
    for example in tqdm(examples, leave=False):
        prompt, gpt4_output = example["prompt"], example["output"]
        out = generate(prompt, test_config.max_new_tokens, test_config.gen_config)
        table.add_data(prompt, out, prompt+out, gpt4_output)
    if log:
        wandb.log({table_name:table})
    return table

def generate(prompt, max_new_tokens=test_config.max_new_tokens, gen_config=gen_config):
    tokenized_prompt = tokenizer(prompt, return_tensors='pt')['input_ids'].cuda()
    with torch.inference_mode():
        output = model.generate(tokenized_prompt,
                                max_new_tokens=max_new_tokens,
                                generation_config=gen_config,
                                temperature=0.9,
                                top_k=40,
                                top_p=0.70,
                                do_sample=True)
    return tokenizer.decode(output[0][len(tokenized_prompt[0]):], skip_special_tokens=True)


with wandb.init(entity=wandb_entity,
           project=wandb_project,
           job_type="eval",
           config=config):

    artifact = wandb.use_artifact(model_artifact_path)
    artifact_dir = artifact.download()

    merged_model = PeftModel.from_pretrained(model, artifact_dir)
    merged_model = merged_model.merge_and_unload()

    merged_model.eval();
    prompt_table(eval_dataset[:10], log=True, table_name="eval_predictions")

# (Advanced) Sweep

In [None]:
#wandb.sdk.wandb_setup._setup(_reset=True)

In [None]:
sweep_configuration= {
    "method": "random",
    "metric": {"goal": "minimize", "name": "eval/loss"},
    "parameters": {
        "r":{"values": [2,4,8,16,32]},
        "lora_alpha":{"values": [2,4,8,16]},
        "learning_rate":{'max': 2e-3, 'min': 2e-4}
    }
}

default_config = {
    "BASE_MODEL":"facebook/opt-125m",
    "lora_config":{
        "r":32,
        "lora_alpha":16,
        "target_modules":[f"model.decoder.layers.{i}.self_attn.{proj}_proj" for i in range(31) for proj in ['q', 'k', 'v']],
        "lora_dropout":.1,
        "bias":"none",
        "task_type":"CAUSAL_LM"
    },
    "training_args":{
        "dataloader_num_workers":16,
        "evaluation_strategy":"steps",
        "per_device_train_batch_size":8,
        "max_steps": 50,
        "gradient_accumulation_steps":2,
        "report_to":"wandb",#wandb integration
        "warmup_steps":10,
        "num_train_epochs":1,
        "learning_rate":2e-4,
        "fp16":True,
        "logging_steps":10,
        "save_steps":10,
        "output_dir":'./outputs'
    }
}


def train_func():
    with wandb.init(project=wandb_project, config=config, job_type="training") as run:
        # Setup for LoRa
        run.use_artifact(path_dataset_for_trainig)

        default_config["lora_config"]["r"] = wandb.config["r"]
        default_config["lora_config"]["lora_alpha"] = wandb.config["lora_alpha"]
        default_config["training_args"]["learning_rate"] = wandb.config["learning_rate"]

        lora_config = LoraConfig(**default_config["lora_config"])
        model_peft = get_peft_model(model, lora_config)
        model_peft.print_trainable_parameters()
        model_peft.config.use_cache = False

        trainer = transformers.Trainer(
            model=model_peft,
            data_collator=collator,
            args=transformers.TrainingArguments(**default_config["training_args"]),
            train_dataset=train_dataset,
            eval_dataset=val_dataset
        )
        trainer.train()
        run.log_code()

sweep_id = wandb.sweep(sweep=sweep_configuration)
wandb.agent(sweep_id, function=train_func, count=20)