<a href="https://colab.research.google.com/github/sudhang/css-nlp/blob/master/llama/LLaMa2_7B_QLORA_Train.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, we will use LLaMa2 from meta, which was released very recently.  We have obtained access from meta, and will fine-tune it using QLORA.  We will do 4-bit quantization, enabling this 7Billion Parameter model to be trained on a free Google Colab

We rely a lot on the google colab notebooks and the tutorials provided by huggingface:  https://huggingface.co/blog/4bit-transformers-bitsandbytes

Apart form that, we used a number of tutorial blogs and even youtube videos:



1.   [Fine-tuning Alpaca and LLaMA: Training on a Custom Dataset](https://www.mlexpert.io/machine-learning/tutorials/alpaca-fine-tuning#user-content-fn-6)
2.   [LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS](https://arxiv.org/pdf/2106.09685.pdf)
3.   [How to Fine-Tune Open-Source LLMs Locally Using QLoRA!](https://youtu.be/2bkrL2ZcOiM)
4.   [QLORA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/pdf/2305.14314.pdf)
5. [How to generate text: using different decoding methods for language generation with Transformers](https://huggingface.co/blog/how-to-generate)
6. [Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA](https://huggingface.co/blog/4bit-transformers-bitsandbytes)







### Installations

In [None]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m81.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m83.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m4.1 MB/

## Imports

To use the llama2 models from huggingface, we need to input an access token.

In [None]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
GDRIVEPATH = "/content/drive/MyDrive/TU/Sem 4/NLP"

In [None]:
import pandas as pd

import torch
import transformers
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM
from transformers import BitsAndBytesConfig       # For quantization
from peft import prepare_model_for_kbit_training

from peft import LoraConfig                       # For LORA
from peft import get_peft_model

from datasets import Dataset, load_dataset, DatasetDict

## Dataset

In [None]:
df = pd.read_csv(f"{GDRIVEPATH}/data/nyt_train.csv")

df = df[["content"]]
display(df)

data = Dataset.from_pandas(df)

Unnamed: 0,content
0,"WASHINGTON — Three years ago, President Barack..."
1,When Honda Motor Company said two months ago t...
2,WASHINGTON — Four months after a historic acco...
3,WASHINGTON — Within hours of opening an invest...
4,"WASHINGTON — The F.B.I. director, James B. Com..."
...,...
8580,Senator Marco Rubio of Florida won the Puerto ...
8581,"EXETER, N.H. — Eight years after aggressively ..."
8582,"PRINCETON, Ore. — The armed occupation of a wi..."
8583,"WASHINGTON — As North Korea’s reclusive ruler,..."


## PARAMS

In [None]:
TRAINING_MAX_STEPS = 300



LORAPARAM_RANK = 8
LORAPARAM_ALPHA = 32
LORAPARAM_DROPOUT = 0.05
LORAPARAM_BIAS = "none"

## Load the Model

In [None]:
base_model_id = "meta-llama/Llama-2-7b-hf"


bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,         # nested quantization to preserve memory
    bnb_4bit_quant_type="nf4",              # NF4 gives higher precision than FP4
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(base_model_id)
base_model = AutoModelForCausalLM.from_pretrained(
                base_model_id,
                quantization_config=bnb_config,
                device_map={"":0}
              )


Downloading (…)okenizer_config.json:   0%|          | 0.00/749 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/630 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/197 [00:00<?, ?B/s]

In [None]:
data = data.map(
    lambda samples: tokenizer(samples["content"], truncation=True, max_length=512),
    batched=True
)

Map:   0%|          | 0/8585 [00:00<?, ? examples/s]

## Preprocessing

In [None]:
base_model.gradient_checkpointing_enable()                  # TODO: what is this?
base_model = prepare_model_for_kbit_training(base_model)      # Prepare the model for our 4bit training

### Investigation

Now that we have decided to apply LORA, we obviously don't train "all" the parameters of the model.  In these two cells, we investigate what percentage of the original 7 Billion parameters are actually "trainable", now that we are doing LoRA

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} \n all params: {all_param} \n trainable%: {100 * trainable_params / all_param}"
    )

In [None]:
from peft import LoraConfig, get_peft_model

# https://huggingface.co/docs/peft/conceptual_guides/lora#common-lora-parameters-in-peft
config = LoraConfig(
    r=LORAPARAM_RANK,                                    # rank of the update matrices
    lora_alpha=LORAPARAM_ALPHA,                          # LoRA Scaling factor
    target_modules=["q_proj", "v_proj"],                 # Attention layer projection matrices will be replaced by low rank matrices
    lora_dropout=LORAPARAM_DROPOUT,
    bias=LORAPARAM_BIAS,                                 # whether to train the bias params
    task_type="CAUSAL_LM"
)

adapter_model = get_peft_model(base_model, config)
print_trainable_parameters(adapter_model)

trainable params: 4194304 
 all params: 3504607232 
 trainable%: 0.11967971650867153


As we can see, only about `12%` of the model's 7 billions are trainable, which greatly reduces our work.

## Training

In [None]:
# needed for tokenizer
tokenizer.pad_token = tokenizer.eos_token

trainer = transformers.Trainer(
    model=adapter_model,
    train_dataset=data,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=TRAINING_MAX_STEPS,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
adapter_model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
1,1.6465
2,1.7389
3,1.8117
4,1.8475
5,1.6636
6,1.7836
7,1.6908
8,1.8814
9,1.7838
10,1.6611


TrainOutput(global_step=300, training_loss=1.7232818460464479, metrics={'train_runtime': 559.5514, 'train_samples_per_second': 2.145, 'train_steps_per_second': 0.536, 'total_flos': 1.2414117117616128e+16, 'train_loss': 1.7232818460464479, 'epoch': 0.14})

In [None]:
adapter_model.config.use_cache = True

## Save The Model

To save the adapter, we rely on the Huggingface hub

In [None]:
adapter_model_id = "llama2_cssnlp"

# Push the model to the Hugging Face Hub
adapter_model.push_to_hub(f"sudhangshankar/{adapter_model_id}", use_auth_token=True)

adapter_model.bin:   0%|          | 0.00/16.8M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/sudhangshankar/llama2_cssnlp/commit/6230551299daf374015d22c6a7148ee76927414a', commit_message='Upload model', commit_description='', oid='6230551299daf374015d22c6a7148ee76927414a', pr_url=None, pr_revision=None, pr_num=None)