<a href="https://colab.research.google.com/github/rhqrhq/Adversarial_Examples_Papers/blob/main/LoRA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LoRA
based on a combination of this:
https://huggingface.co/docs/peft/task_guides/semantic_segmentation_lora
and this:

https://www.youtube.com/watch?v=iYr1xZn26R8

https://github.com/huggingface/peft/issues/493


Recommended runtime: v100 high RAM. A100 high RAM if you use a larger BLOOM model.

Note, I generally prioritize ease of comparing a model and it's fine tuned counterpart over inference time.

---

## Downloading Dependencies
- **bitsandbytes:** for representing models using smaller datatypes, saving on memory.
- **datasets:** for downloading datasets
- **accelerate:** required dependency for machine learning interoperability
- **loralib:** LoRA implementation
- **peft:** a general "parameter efficient fine tuning" module, our interface for LoRA
- **transformers:** for downloading and using pre-trained transformers from huggingface.

In [1]:
!pip install -q bitsandbytes datasets accelerate loralib
!pip install -q git+https://github.com/huggingface/peft.git git+https://github.com/huggingface/transformers.git

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.1/59.1 MB[0m [31m37.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.0/521.0 kB[0m [31m26.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for peft (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
sentence-transformers 5.1.2 requires transformers<5.0.0,>=4.41.0, but you have transformer

## Loading Pre-Trained Model
in this model we're using the BLOOM, decoder only, causal language model. This is a permissively source language model trained on a variety of data.

We'll be using the 560m parameter version to save on GPU memory, but if you use an A100 instance you should be able to run the 3b parameter version. While not thoroughly tested, all code should work for any flavor of BLOOM

In [2]:
"""Importing dependencies and downloading pre-trained bloom model
"""

import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM

#loading model
model = AutoModelForCausalLM.from_pretrained(
    # "bigscience/bloom-3b",
    # "bigscience/bloom-1b1",
    "bigscience/bloom-560m",
    torch_dtype=torch.float16,
    device_map='auto',
)

#loading tokenizer for this model (which turns text into an input for the model)
tokenizer = AutoTokenizer.from_pretrained("bigscience/tokenizer")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/693 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/293 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/227 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

## Setting up LoRA
- **r:** the rank of the A and B matrices
- **lora_alpha:** this is a pretty controversial parameter. A lot of people hava a lot of ideas about it. You can consider it a scaling factor, and by default it should be equal to `r`, as far as I understand.
- **target_modules:** the portions of the model we want to optimize with LoRA. the BLOOM module has parameters named `query_key_value` which we want to optimize.
- **lora_dropout:** dropout is a technique which hides inputs to suppress the model from overfitting (called regularization). This is a probability of being hidden.
- **bias:** neural networks typically have two paramet per connection, a "weight" and a "bias". We're only training weights in this example.
- **task_type:** not super necessary, used in the superclass `PeftConfig`. Setting to `CAUSAL_LM` because the specific language model we're using is "causal".

In [3]:
"""Setting up LoRA using parameter efficient fine tuning
"""

from peft import LoraConfig, get_peft_model

#defining how LoRA will work in this particular example
config = LoraConfig(
    r=8,
    lora_alpha=8,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

#this actually overwrites the model in memory, so
#the rename is only for ledgibility.
peft_model = get_peft_model(model, config)

## Printing Trainable Parameter Difference

In [None]:
"""Comparing parameters before and after LoRA
"""

trainable_params = 0
all_param = 0

#iterating over all parameters
for _, param in peft_model.named_parameters():
    #adding parameters to total
    all_param += param.numel()
    #adding parameters to trainable if they require a graident
    if param.requires_grad:
        trainable_params += param.numel()

#printing results
print(f"trainable params: {trainable_params}")
print(f"all params: {all_param}")
print(f"trainable: {100 * trainable_params / all_param:.2f}%")

trainable params: 786432
all params: 560001024
trainable: 0.14%


## Loading Dataset
this is the stanford question answering dataset (SQUAD), which we'll use to fine tune BLOOM to improve performance on question answering.

In [None]:
"""Loading SQUAD dataset
"""

from datasets import load_dataset
qa_dataset = load_dataset("squad_v2")

## Re-Formatting
We're going to get the LLM to learn a specific format (a common use of fine tuning).

```
**CONTEXT:**
{context}

**QUESTION:**
{question}

**ANSWER:**
{answer}</s>
```

So, we'll reformat our SQUAD dataset to respect that format.

In [None]:
"""Reformatting SQUAD to respect our defined structure
"""

#defining a function for reformatting
def create_prompt(context, question, answer):
  if len(answer["text"]) < 1:
    answer = "Cannot Find Answer"
  else:
    answer = answer["text"][0]
  prompt_template = f"CONTEXT:\n{context}\n\nQUESTION:\n{question}\n\nANSWER:\n{answer}</s>"
  return prompt_template

#applying the reformatting function to the entire dataset
mapped_qa_dataset = qa_dataset.map(lambda samples: tokenizer(create_prompt(samples['context'], samples['question'], samples['answers'])))

## Training our LoRA model on SQUAD
Updating the decomposed matrices to improve the model on question answering, and teach it the desired structure.

In [None]:
"""Fine Tuning
This code is largly co-opted. In the absence of a rigid validation
procedure, the best practice is to just copy a successful tutorial or,
better yet, directly from the documentation.
"""

import transformers

trainer = transformers.Trainer(
    model=peft_model,
    train_dataset=mapped_qa_dataset["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        warmup_steps=100,
        max_steps=100,
        learning_rate=1e-3,
        fp16=True,
        logging_steps=1,
        output_dir='outputs',
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
peft_model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
1,3.4549
2,3.3474
3,3.3741
4,3.5238
5,3.4935
6,3.4134
7,3.2365
8,3.4875
9,3.5387
10,3.469


TrainOutput(global_step=100, training_loss=3.054349970817566, metrics={'train_runtime': 53.5128, 'train_samples_per_second': 29.899, 'train_steps_per_second': 1.869, 'total_flos': 740744642985984.0, 'train_loss': 3.054349970817566, 'epoch': 0.01})

## Saving Locally
saving our LoRA fine tune results.

In [None]:
"""Saving the LoRA fine tuning locally
"""
model_id = "BLOOM-560m-LoRA"
peft_model.save_pretrained(model_id)

## Checking File Size
Compare this to the size of the initial model download to get an idea of the memory savings.

In [None]:
!ls -lh {model_id}

total 3.1M
-rw-r--r-- 1 root root  482 Nov  6 14:17 adapter_config.json
-rw-r--r-- 1 root root 3.1M Nov  6 14:17 adapter_model.bin
-rw-r--r-- 1 root root 5.3K Nov  6 14:17 README.md


## Testing

In [None]:
"""Helper Function for Comparing Results
"""

from IPython.display import display, Markdown

def make_inference(context, question):

    #turn the input into tokens
    batch = tokenizer(f"**CONTEXT:**\n{context}\n\n**QUESTION:**\n{question}\n\n**ANSWER:**\n", return_tensors='pt', return_token_type_ids=False)
    #move the tokens onto the GPU, for inference
    batch = batch.to(device='cuda')

    #make an inference with both the fine tuned model and the raw model
    with torch.cuda.amp.autocast():
        #I think inference time would be faster if these were applied,
        #but the fact that LoRA is not applied allows me to experiment
        #with before and after fine tuning simultaniously

        #raw model
        peft_model.disable_adapter_layers()
        output_tokens_raw = model.generate(**batch, max_new_tokens=200)

        #LoRA model
        peft_model.enable_adapter_layers()
        output_tokens_qa = peft_model.generate(**batch, max_new_tokens=200)

    #display results
    display(Markdown("# Raw Model\n"))
    display(Markdown((tokenizer.decode(output_tokens_raw[0], skip_special_tokens=True))))
    display(Markdown("\n# QA Model\n"))
    display(Markdown((tokenizer.decode(output_tokens_qa[0], skip_special_tokens=True))))

In [None]:
context = "You are a monster, and you eat yellow legos."
question = "What is the best food?"

make_inference(context, question)

# Raw Model


**CONTEXT:**
You are a monster, and you eat yellow legos.

**QUESTION:**
What is the best food?

**ANSWER:**
The best food is the one that is not poisonous, but that is
not poisonous at all.

**QUESTION:**
What is the best food?

**ANSWER:**
The best food is the one that is not poisonous, but that is
not poisonous at all.

**QUESTION:**
What is the best food?

**ANSWER:**
The best food is the one that is not poisonous, but that is
not poisonous at all.

**QUESTION:**
What is the best food?

**ANSWER:**
The best food is the one that is not poisonous, but that is
not poisonous at all.

**QUESTION:**
What is the best food?

**ANSWER:**
The best food is the one that is not poisonous, but that is
not poisonous at all.

**QUESTION:**
What is the best food?

**ANSWER:**



# QA Model


**CONTEXT:**
You are a monster, and you eat yellow legos.

**QUESTION:**
What is the best food?

**ANSWER:**
yellow legos

In [None]:
context = "you are a math wizard"
question = "what is 1+1 equal to?"

make_inference(context, question)

# Raw Model


**CONTEXT:**
you are a math wizard

**QUESTION:**
what is 1+1 equal to?

**ANSWER:**
1+1 is equal to 1.0

**QUESTION:**
what is 1+1 equal to?

**ANSWER:**
1+1 is equal to 1.0

**QUESTION:**
what is 1+1 equal to?

**ANSWER:**
1+1 is equal to 1.0

**QUESTION:**
what is 1+1 equal to?

**ANSWER:**
1+1 is equal to 1.0

**QUESTION:**
what is 1+1 equal to?

**ANSWER:**
1+1 is equal to 1.0

**QUESTION:**
what is 1+1 equal to?

**ANSWER:**
1+1 is equal to 1.0

**QUESTION:**
what is 1+1 equal to?

**ANSWER:**
1+1 is equal to 1.0

**QUESTION:**
what is 1+1 equal to?

**ANSWER:**
1+1 is equal


# QA Model


**CONTEXT:**
you are a math wizard

**QUESTION:**
what is 1+1 equal to?

**ANSWER:**
1+1 = 1

In [None]:
context = "Answer the riddle"
question = "What gets bigger the more you take away?"

make_inference(context, question)

# Raw Model


**CONTEXT:**
Answer the riddle

**QUESTION:**
What gets bigger the more you take away?

**ANSWER:**
The answer is that the more you take away, the more you get away.


# QA Model


**CONTEXT:**
Answer the riddle

**QUESTION:**
What gets bigger the more you take away?

**ANSWER:**
Cannot Find Answer