## Finetune Llama-2-7b on a Google colab

this is a jupyter notebook that aims to fine-tune llama2 7b model with 122k code instructions in alpaca style
if you face any issues, dm me -  https://twitter.com/4evaBehindSOTA

## Setup

Run the cells below to setup and install the required libraries. For our experiment we will need `accelerate`, `peft`, `transformers`, `datasets`,`scipy` and `TRL` to leverage [`SFTTrainer`](https://huggingface.co/docs/trl/main/en/sft_trainer). We will use `bitsandbytes` to [quantize the base model into 4bit](https://huggingface.co/blog/4bit-transformers-bitsandbytes). We will also install `einops` but it was mainly used for loading falcon so I will remove it in later versions.

In [1]:
!pip install -q -U trl transformers accelerate git+https://github.com/huggingface/peft.git
!pip install -q -U datasets bitsandbytes einops scipy wandb

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


## Dataset



In [2]:
!git config --global credential.helper store

In [3]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [4]:
from datasets import load_dataset

dataset_name = 'TokenBender/python_evol_instruct_51k'
dataset = load_dataset(dataset_name, split="train")

Downloading readme:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/219M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

## Loading the model

In [5]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer

model_name = "stabilityai/stablecode-instruct-alpha-3b"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True
)
model.config.use_cache = False

Downloading (…)lve/main/config.json:   0%|          | 0.00/729 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/6.08G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

Let's also load the tokenizer below

In [6]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Downloading (…)okenizer_config.json:   0%|          | 0.00/146 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.06M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/67.0 [00:00<?, ?B/s]

In [7]:
from peft import LoraConfig, get_peft_model

lora_alpha = 16
lora_dropout = 0.1
lora_r = 64

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM"
)

## Loading the trainer

Here we will use the [`SFTTrainer` from TRL library](https://huggingface.co/docs/trl/main/en/sft_trainer) that gives a wrapper around transformers `Trainer` to easily fine-tune models on instruction based datasets using PEFT adapters. Let's first load the training arguments below.

In [8]:
from transformers import TrainingArguments

output_dir = "./results"
per_device_train_batch_size = 12
gradient_accumulation_steps = 12
optim = "paged_adamw_32bit"
save_steps = 100
logging_steps = 10
learning_rate = 4e-3
max_grad_norm = 0.3
max_steps = -1
warmup_ratio = 0.03
lr_scheduler_type = "constant"

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
    num_train_epochs=1,
)

Then finally pass everthing to the trainer

In [9]:
from trl import SFTTrainer

max_seq_length = 2048

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
)



Map:   0%|          | 0/51320 [00:00<?, ? examples/s]

We will also pre-process the model by upcasting the layer norms in float 32 for more stable training

In [10]:
for name, module in trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.float32)

## Train the model

Now let's train the model! Simply call `trainer.train()`

In [11]:
trainer.train()

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,1.3381
20,1.3319
30,1.898
40,1.1545
50,1.0489
60,1.1739
70,1.2021
80,1.1441
90,1.0673
100,1.0084


TrainOutput(global_step=356, training_loss=1.1291942221395086, metrics={'train_runtime': 12807.5095, 'train_samples_per_second': 4.007, 'train_steps_per_second': 0.028, 'total_flos': 2.3501700511617024e+17, 'train_loss': 1.1291942221395086, 'epoch': 1.0})

During QLoRA training, the training losses are spiking and falling sharply.
bf16 training and right eos padding was enabled in this script to ensure the training losses don't drop to zero.

The `SFTTrainer` will take care of properly saving only the adapters during training instead of saving the entire model.

In [11]:
model_to_save = trainer.model.module if hasattr(trainer.model, 'module') else trainer.model  # Take care of distributed/parallel training
model_to_save.save_pretrained("outputs")

In [12]:
lora_config = LoraConfig.from_pretrained('outputs')
model = get_peft_model(model, lora_config)

In [18]:
text = '''###Instruction\nGenerate a python function to print fibonacci sequence iteratively. ###Response\n'''
device = "cuda:0"

inputs = tokenizer(text, return_tensors="pt", return_token_type_ids=False).to(device)
outputs = model.generate(**inputs, max_new_tokens=2048)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


###Instruction
Generate a python function to print fibonacci sequence iteratively. ###Response

def fibonacci_iterative(n):
    a = 0
    b = 1
    if n < 0:
        print("Incorrect input")
    elif n == 0:
        return a
    elif n == 1:
        return b
    else:
        for i in range(2,n):
            c = a + b
            a = b
            b = c
        return b


In [None]:
# pip install -q transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "bigcode/starcoderbase-7b"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

inputs = tokenizer.encode("def binary_search(arr, target):", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

In [16]:
text = '''<s>[INST]<<SYS>> You are a helpful coding assistant, read user instruction carefully, identify all caveats of the request, plan the task and answer.
<</SYS>>
Implement a function that returns the factorial of a number using a for loop.[/INST]'''
device = "cuda:0"

inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=1024)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

[INST]<<SYS>> You are a helpful coding assistant, read user instruction carefully, identify all caveats of the request, plan the task and answer.
<</SYS>>
Implement a function that returns the factorial of a number using a for loop.[/INST]  Great! I'm happy to help you with that.
To implement a function that returns the factorial of a number using a for loop, you will need to follow these steps:
1. Read the user instruction carefully: The user has asked you to implement a function that returns the factorial of a number using a for loop.
2. Identify all caveats of the request: The user has not provided any specific requirements or constraints for the function, so you will need to determine the following:
* The type of number the function will operate on (e.g. integer, floating-point number)
* The range of values the function will handle (e.g. will it handle negative numbers, large integers)
* Any specific edge cases or corner cases that the function should handle (e.g. zero, very large 

In [19]:
model.push_to_hub("TokenBender-stableCodePy-3B-chat")

adapter_model.bin:   0%|          | 0.00/83.9M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/TokenBender/TokenBender-stableCodePy-3B-chat/commit/a43cea3c6b5b61c0904ad697b7120b779f7edc52', commit_message='Upload model', commit_description='', oid='a43cea3c6b5b61c0904ad697b7120b779f7edc52', pr_url=None, pr_revision=None, pr_num=None)

In [13]:
!pip install -q -U numba

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


In [14]:
from numba import cuda
device = cuda.get_current_device()
device.reset()