<a href="https://colab.research.google.com/github/sudhang/css-nlp/blob/master/falcon/Falcon_7B_QLORA_Train.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, we will use Falcon-7B from TII UAE, which was released a couple of months ago.  The model is freely available (including for commercial use), and will fine-tune it using QLORA.  We will do 4-bit quantization, enabling this 7Billion Parameter model to be trained on a free Google Colab

We rely a lot on the google colab notebooks and the tutorials provided by huggingface:  https://huggingface.co/blog/4bit-transformers-bitsandbytes

Apart form that, we used a number of tutorial blogs and even youtube videos:



1.   [Fine-tuning Alpaca and LLaMA: Training on a Custom Dataset](https://www.mlexpert.io/machine-learning/tutorials/alpaca-fine-tuning#user-content-fn-6)
2.   [LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS](https://arxiv.org/pdf/2106.09685.pdf)
3.   [How to Fine-Tune Open-Source LLMs Locally Using QLoRA!](https://youtu.be/2bkrL2ZcOiM)
4.   [QLORA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/pdf/2305.14314.pdf)
5. [How to generate text: using different decoding methods for language generation with Transformers](https://huggingface.co/blog/how-to-generate)
6. [Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA](https://huggingface.co/blog/4bit-transformers-bitsandbytes)







### Installations

In [None]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets
!pip install einops

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting einops
  Downloading einops-0.6.1-py3-none-any.whl (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: einops
Successfully installed einops-0.6.1


## Imports

To use the llama2 models from huggingface, we need to input an access token.

In [None]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
GDRIVEPATH = "/content/drive/MyDrive/TU/Sem 4/NLP"

In [None]:
import pandas as pd

import torch
import transformers
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM
from transformers import BitsAndBytesConfig       # For quantization
from peft import prepare_model_for_kbit_training

from peft import LoraConfig                       # For LORA
from peft import get_peft_model

from datasets import Dataset, load_dataset, DatasetDict

## Dataset

In [None]:
df = pd.read_csv(f"{GDRIVEPATH}/data/nyt_train.csv")

df = df[["content"]]
display(df)

data = Dataset.from_pandas(df)

Unnamed: 0,content
0,"WASHINGTON — Three years ago, President Barack..."
1,When Honda Motor Company said two months ago t...
2,WASHINGTON — Four months after a historic acco...
3,WASHINGTON — Within hours of opening an invest...
4,"WASHINGTON — The F.B.I. director, James B. Com..."
...,...
8580,Senator Marco Rubio of Florida won the Puerto ...
8581,"EXETER, N.H. — Eight years after aggressively ..."
8582,"PRINCETON, Ore. — The armed occupation of a wi..."
8583,"WASHINGTON — As North Korea’s reclusive ruler,..."


## PARAMS

In [None]:
TRAINING_MAX_STEPS = 300



LORAPARAM_RANK = 8
LORAPARAM_ALPHA = 32
LORAPARAM_DROPOUT = 0.05
LORAPARAM_BIAS = "none"

## Load the Model

In [None]:
base_model_id = "tiiuae/falcon-7b"


bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,         # nested quantization to preserve memory
    bnb_4bit_quant_type="nf4",              # NF4 gives higher precision than FP4
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(base_model_id)
base_model = AutoModelForCausalLM.from_pretrained(
                base_model_id,
                quantization_config=bnb_config,
                device_map={"":0}
              )


Loading tiiuae/falcon-7b requires to execute some code in that repo, you can inspect the content of the repository at https://hf.co/tiiuae/falcon-7b. You can dismiss this prompt by passing `trust_remote_code=True`.
Do you accept? [y/N] y
Loading tiiuae/falcon-7b requires to execute some code in that repo, you can inspect the content of the repository at https://hf.co/tiiuae/falcon-7b. You can dismiss this prompt by passing `trust_remote_code=True`.
Do you accept? [y/N] y


Downloading (…)model.bin.index.json:   0%|          | 0.00/16.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00002.bin:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00002.bin:   0%|          | 0.00/4.48G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

In [None]:
data = data.map(
    lambda samples: tokenizer(samples["content"], truncation=True, max_length=512),
    batched=True
)

Map:   0%|          | 0/8585 [00:00<?, ? examples/s]

## Preprocessing

In [None]:
base_model.gradient_checkpointing_enable()                  # TODO: what is this?
base_model = prepare_model_for_kbit_training(base_model)      # Prepare the model for our 4bit training

### Investigation

Now that we have decided to apply LORA, we obviously don't train "all" the parameters of the model.  In these two cells, we investigate what percentage of the original 7 Billion parameters are actually "trainable", now that we are doing LoRA

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} \n all params: {all_param} \n trainable%: {100 * trainable_params / all_param}"
    )

In [None]:
from peft import LoraConfig, get_peft_model

# https://huggingface.co/docs/peft/conceptual_guides/lora#common-lora-parameters-in-peft
config = LoraConfig(
    r=LORAPARAM_RANK,                                    # rank of the update matrices
    lora_alpha=LORAPARAM_ALPHA,                          # LoRA Scaling factor
    target_modules=["query_key_value"],                 # Attention layer projection matrices will be replaced by low rank matrices
    lora_dropout=LORAPARAM_DROPOUT,
    bias=LORAPARAM_BIAS,                                 # whether to train the bias params
    task_type="CAUSAL_LM"
)

adapter_model = get_peft_model(base_model, config)
print_trainable_parameters(adapter_model)

trainable params: 2359296 
 all params: 3611104128 
 trainable%: 0.06533447711203746


As we can see, only about `6.5%` of the model's 7 billions are trainable, which greatly reduces our work.

## Training

In [None]:
# needed for tokenizer
tokenizer.pad_token = tokenizer.eos_token

trainer = transformers.Trainer(
    model=adapter_model,
    train_dataset=data,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=TRAINING_MAX_STEPS,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
adapter_model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
1,1.9354
2,2.1078
3,2.1141
4,2.1873
5,1.9325
6,2.1343
7,1.9966
8,2.1361
9,2.0938
10,1.931


TrainOutput(global_step=300, training_loss=2.028775629599889, metrics={'train_runtime': 540.0799, 'train_samples_per_second': 2.222, 'train_steps_per_second': 0.555, 'total_flos': 1.2192498643753728e+16, 'train_loss': 2.028775629599889, 'epoch': 0.14})

In [None]:
adapter_model.config.use_cache = True

## Save The Model

To save the adapter, we rely on the Huggingface hub

In [None]:
adapter_model_id = "falcon_cssnlp"

# Push the model to the Hugging Face Hub
adapter_model.push_to_hub(f"sudhangshankar/{adapter_model_id}", use_auth_token=True)



adapter_model.bin:   0%|          | 0.00/9.46M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/sudhangshankar/falcon_cssnlp/commit/1438d089d6cf798db1c540ffcbda1c65d67da333', commit_message='Upload model', commit_description='', oid='1438d089d6cf798db1c540ffcbda1c65d67da333', pr_url=None, pr_revision=None, pr_num=None)