 # Finetuning Falcon 7B the first commercially opensource LLM on an opensource instruction dataset.

Recent papers are showing the ability to train smaller models to outperform GPT-4. Notably recent research papers have demonstrated the ability to finetune smaller 7B models some highlights include: 

*   **Gorilla** -  LLM Connected with Massive APIs- https://arxiv.org/abs/2305.15334
*   **Goat** - able to surpass GPT-4 on Arithmetic Tasks https://arxiv.org/pdf/2305.14201v1.pdf

Other recent papers also suggest that a quality small dataset can help smaller language models shine.

*   **LIMA** - less is more for alignment. https://arxiv.org/pdf/2305.11206.pdf

With the advancements announced in the QLora paper it is now possible to finetune smaller models on one GPU.

*   **QLoRA** https://arxiv.org/pdf/2305.14314v1.pdf










# Below is code on how you can finetune an LLM (in this case Falcon released by the UAE for free commercial use) on instruction data to achieve state of the art results on specific tasks. This code implements QLoRA so is capable of running on a single gpu with <24gb vram.

**As recent paper show the most important part is not the size of the model but the dataset you are training it on.**

In [None]:
!pip install -q -U bitsandbytes
!pip install einops
!pip install -q -U git+https://github.com/huggingface/transformers.git 
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets

In [5]:
!nvidia-smi

Fri Jun  2 09:24:02 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   44C    P0    27W /  70W |   6793MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "tiiuae/falcon-7b"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, trust_remote_code=True, device_map={"":0})

<h1> Prepare for peft training (parameter efficient fine tuning)</h1>

In [6]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [49]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [50]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8, 
    lora_alpha=32, 
    target_modules=["query_key_value"], 
    lora_dropout=0.05, 
    bias="none", 
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 2359296 || all params: 3611104128 || trainable%: 0.06533447711203746


#Load our commercially opensource instruction dataset (Dolly 15k)

In [None]:
from datasets import load_dataset

data = load_dataset("databricks/databricks-dolly-15k")

In [53]:
data

DatasetDict({
    train: Dataset({
        features: ['instruction', 'context', 'response', 'category'],
        num_rows: 15011
    })
})

# Prepare the dataset by tokenizing and also joining some of the columns

In [None]:
def tokenize_function(examples):
    # Concatenate instruction and input text
    input_text = [' '.join(t) for t in zip(examples["instruction"], examples["context"])]
    output_text = examples["response"]

    # Tokenize inputs and outputs
    inputs = tokenizer(input_text, padding="max_length", truncation=True, max_length=512)
    outputs = tokenizer(output_text, padding="max_length", truncation=True, max_length=512)
    
    return {"input_ids": inputs.input_ids, "attention_mask": inputs.attention_mask, "labels": outputs.input_ids}

dataset = data.map(tokenize_function, batched=True, remove_columns=["instruction", "context", "response"])

In [57]:
dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

In [58]:
dataset

DatasetDict({
    train: Dataset({
        features: ['category', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 15011
    })
})

# Split the dataset up into train and test

In [59]:
from datasets import DatasetDict

# Split the data into 80% for training and 20% for evaluation
split_data = dataset['train'].train_test_split(test_size=0.2)

# Now we update the dataset with the new split data
dataset = DatasetDict({
    'train': split_data['train'],
    'eval': split_data['test']
})


# Train the model

As per bitsandbytes example i have set this to run for 10 steps as a tester. If you uncomment out the **num_train_epochs** this will train properly, and based on the ETA would finetune in about 24 hours.

In [64]:
import transformers

trainer = transformers.Trainer(
    model=model,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        # num_train_epochs=3,
        max_steps=10,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    train_dataset=dataset['train'],
    eval_dataset=dataset['eval'],
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

Step,Training Loss
1,1.025
2,2.1154
3,1.3438
4,2.5447
5,0.5643
6,3.4959
7,1.9272
8,0.6405
9,0.7592
10,1.6569


TrainOutput(global_step=10, training_loss=1.6072865188121797, metrics={'train_runtime': 103.5639, 'train_samples_per_second': 0.386, 'train_steps_per_second': 0.097, 'total_flos': 407425237647360.0, 'train_loss': 1.6072865188121797, 'epoch': 0.0})