<a href="https://colab.research.google.com/github/rawatnikhil857/knightRiders-hackon/blob/main/LoRA_Implementation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#LoRA Implementation

### Install Dependencies
Libraries used - accelerate, loralib, PEFT from hugging face.
PEFT gives access to LoRA(low rank adaptation) which is a way to efficiently fine tune Large Language models

In [None]:
!pip install -q bitsandbytes datasets accelerate loralib
!pip install -q git+https://github.com/huggingface/peft.git git+https://github.com/huggingface/transformers.git

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m51.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.1/258.1 kB[0m [31m32.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m35.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wh

#### Confirm CUDA

In [None]:
import torch
torch.cuda.is_available()

True

#### Load Base Model

Bloom 1b7 is used as a base LLM to fine tune upon. bloom-1b7 is a Transformer-based Language Model. And we're using bigscience/tokenizer which is a tokenizer used for all BLOOM models

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "bigscience/bloom-1b7",
    torch_dtype=torch.float16,
    device_map='auto',
)

tokenizer = AutoTokenizer.from_pretrained("bigscience/tokenizer")

Downloading (…)lve/main/config.json:   0%|          | 0.00/715 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/3.44G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/227 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

##### View Model Summary
we require model summary in order to choose which module will be used as a target for fine tuning which varies according to the base model used. In bloom-1b7 we'll be targetting "query_key_value" module.

In [None]:
print(model)

BloomForCausalLM(
  (transformer): BloomModel(
    (word_embeddings): Embedding(250880, 2048)
    (word_embeddings_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
    (h): ModuleList(
      (0-23): 24 x BloomBlock(
        (input_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (self_attention): BloomAttention(
          (query_key_value): Linear(in_features=2048, out_features=6144, bias=True)
          (dense): Linear(in_features=2048, out_features=2048, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (post_attention_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (mlp): BloomMLP(
          (dense_h_to_4h): Linear(in_features=2048, out_features=8192, bias=True)
          (gelu_impl): BloomGelu()
          (dense_4h_to_h): Linear(in_features=8192, out_features=2048, bias=True)
        )
      )
    )
    (ln_f): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
  )
  (

In [None]:
#pre-processing step to help improve training stability
for param in model.parameters():
  param.requires_grad = False  # freeze the model - train adapters later
  if param.ndim == 1:
    param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()
model.enable_input_require_grads()

class CastOutputToFloat(nn.Sequential):
  def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)

#### Helper Function

In [None]:
#visualize LoRA
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

#### Obtain LoRA Model


In [None]:
from peft import LoraConfig, get_peft_model
#LoraConfig used to set rank, module targeted for fine tuning
config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config) #we get our PEFT model with LoRA configurations
print_trainable_parameters(model)

trainable params: 1572864 || all params: 1723981824 || trainable%: 0.09123437254985815


We can see above how we're only required to train 0.09% of all parameters in our large language model using LoRA

#### Load Sample Dataset


In [None]:
from datasets import load_dataset
qa_dataset = load_dataset('csv', data_files='Context-Question-AnswerDatabase.csv')
type(qa_dataset)

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

datasets.dataset_dict.DatasetDict

Below is the structure of prompt used to fine tune our model. We make the language model to learn the structure of the query that will be made and the structure of the answer it's supposed to generate by providing it multiple such prompts and training the parameters accordingly.

```
### CONTEXT
{context}

### QUESTION
{question}

### ANSWER
{answer}</s>
```

In [None]:
# A function to map data to a specific prompt
def create_prompt(context, question, answer):
  prompt_template = f"### CONTEXT\n{context}\n\n### QUESTION\n{question}\n\n### ANSWER\n{answer}</s>"
  return prompt_template

mapped_qa_dataset = qa_dataset.map(lambda samples: tokenizer(create_prompt(samples['Context'], samples['Question'], samples['Answer'])))

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
mapped_qa_dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'Context', 'Question', 'Answer', 'input_ids', 'attention_mask'],
        num_rows: 1000
    })
})

#### Train LoRA

In [None]:
import transformers

trainer = transformers.Trainer(
    model=model,
    train_dataset=mapped_qa_dataset["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=100,
        max_steps=100,
        learning_rate=1e-3,
        fp16=True,
        logging_steps=1,
        output_dir='outputs',
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
1,1.4716
2,1.6636
3,1.4315
4,1.4775
5,1.384
6,1.6359
7,1.3372
8,1.4705
9,1.5183
10,1.395


TrainOutput(global_step=100, training_loss=1.060699276328087, metrics={'train_runtime': 493.102, 'train_samples_per_second': 0.811, 'train_steps_per_second': 0.203, 'total_flos': 2788108539985920.0, 'train_loss': 1.060699276328087, 'epoch': 0.4})

We can save above trained model to hugging face by logging in as shown below

In [None]:
HUGGING_FACE_USER_NAME = "rawatnikhil857"
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
model_name = "movieRec-bloom-1b7"
model.push_to_hub(f"{HUGGING_FACE_USER_NAME}/{model_name}", use_auth_token=True)



adapter_model.bin:   0%|          | 0.00/6.31M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/rawatnikhil857/movieRec-bloom-1b7-v2/commit/9ca1dfdd6f6df6e79322eefc5f25d075570758cc', commit_message='Upload model', commit_description='', oid='9ca1dfdd6f6df6e79322eefc5f25d075570758cc', pr_url=None, pr_revision=None, pr_num=None)

We can see how our parameter efficient fine tuned model is only 6.31 Mb and we can access it again from hugging face without needing to retrain entire base model
