<a href="https://colab.research.google.com/github/stevoslates/LoRA/blob/main/LoRa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tuning LLM using QLoRA.

## Loading in Model - Just for Inference, this is not used when fine-tuning.

In [None]:
pip install -U bitsandbytes

In [None]:
import torch
from transformers import LlamaTokenizer, LlamaForCausalLM

model_path = 'openlm-research/open_llama_3b_v2'
tokenizer = LlamaTokenizer.from_pretrained(model_path)
model = LlamaForCausalLM.from_pretrained(
model_path, device_map='cuda',
)


## Model Understanding

In [None]:
print(model.modules)

<bound method Module.modules of LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 3200, padding_idx=0)
    (layers): ModuleList(
      (0-25): 26 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=3200, out_features=3200, bias=False)
          (k_proj): Linear(in_features=3200, out_features=3200, bias=False)
          (v_proj): Linear(in_features=3200, out_features=3200, bias=False)
          (o_proj): Linear(in_features=3200, out_features=3200, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=3200, out_features=8640, bias=False)
          (up_proj): Linear(in_features=3200, out_features=8640, bias=False)
          (down_proj): Linear(in_features=8640, out_features=3200, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )

In [None]:
parameters, trainable = 0, 0

for _, p in model.named_parameters():
  parameters += p.numel()
  if p.requires_grad:
    trainable += p.numel()

print(f"parameters: {parameters:,}, trainable: {trainable:,}")

parameters: 3,426,473,600, trainable: 3,426,473,600


## Lets see all the linear layers that we can choose to fine-tune by adding a LoRA Adapter too.

In [None]:
import re
model_modules = str(model.modules)
pattern = r'\((\w+)\): Linear'
linear_layer_names = re.findall(pattern, model_modules)

names = []
# print the names of the Linear layers
for name in linear_layer_names:
    names.append(name)
target_modules = list(set(names))
target_modules

['gate_proj',
 'lm_head',
 'v_proj',
 'q_proj',
 'o_proj',
 'down_proj',
 'k_proj',
 'up_proj']

# Quick Inference Test - Pre Fine-Tuning

**Note:** Llama Model has not been fine-tuned for instructions, so need to provide a prompt in a format that it can understand and use otherwise it can generate very odd responses.

In [None]:
prompt = "Instruction: Your task is to answer the following question regarding finance.\nQuestion: Does bull/bear market actually make a difference?\nAnswer:"
#print(prompt)
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to('cuda')

generation_output = model.generate(
input_ids=input_ids, max_new_tokens=128, no_repeat_ngram_size=3, eos_token_id=tokenizer.eos_token_id)

#print(generation_output)
print("Prompt:")
print(prompt)
print("*" * 50)
print("Model Answer")
print(tokenizer.decode(generation_output[0]))
print("*" * 50)
print("Model Answer - Generated Text Only")
print(tokenizer.batch_decode(generation_output[:, input_ids.shape[1]:])[0])

Prompt:
Instruction: Your task is to answer the following question regarding finance.
Question: Does bull/bear market actually make a difference?
Answer:
**************************************************
Model Answer
<s>Instruction: Your task is to answer the following question regarding finance.
Question: Does bull/bear market actually make a difference?
Answer:
The answer is yes. The bull/ bear market is a very important factor in determining the performance of the stock market. The performance of a stock market is determined by the performance in the stock prices. The stock prices are determined by a number of factors. The most important factor is the performance and the expectations of the investors. The investors are the ones who decide the performance. The expectations of investors are determined based on the performance, the expectations and the performance are interrelated. The expectation of the investor is based on his/her knowledge of the market.
The performance of stock ma

**Models answer is**: The answer is yes. The bull/ bear market is a very important factor in determining the performance of the stock market. The performance of a stock market is determined by the performance in the stock prices. The stock prices are determined by a number of factors. The most important factor is the performance and the expectations of the investors. The investors are the ones who decide the performance. The expectations of investors are determined based on the performance, the expectations and the performance are interrelated. The expectation of the investor is based on his/her knowledge of the market. The performance of stock market depends on the expectations. The market is expected to

**Answer from Dataset**: If you know what you are doing, bear markets offer fantastic trading opportunities. I'm a futures and futures options trader, and am equally comfortable trading long or short, although I have a slight preference for the short side, in that moves are typically much quicker to the down side

**We would like the model to be able to speak as a financial expert and provide advice, currently it is very safe and generic, which is not a bad thing, however for the purpose of this fine-tuning, we want to change its outputs to be more alike the dataset.**


# Loading in Dataset
Using Finance-Alpaca Dataset. This dataset is a combination of Stanford's Alpaca (https://github.com/tatsu-lab/stanford_alpaca) and FiQA (https://sites.google.com/view/fiqa/) with another 1.3k pairs custom generated using GPT3.5.


In [None]:
!pip install datasets

In [None]:
from datasets import load_dataset

ds = load_dataset("gbharti/finance-alpaca")

## Putting into consistent format for Instruction Tuning.

In [None]:
import pandas as pd
df = pd.DataFrame(ds['train'])

# Define the template
template = """Below is an question about finance. Write a response that appropriately answers the question.

Q: {}

A: {}"""

df['text'] = df.apply(lambda row: template.format(row['instruction'], row['output']), axis=1)
df


Unnamed: 0,text,instruction,input,output
0,Below is an question about finance. Write a re...,"For a car, what scams can be plotted with 0% f...",,The car deal makes money 3 ways. If you pay in...
1,Below is an question about finance. Write a re...,Why does it matter if a Central Bank has a neg...,,"That is kind of the point, one of the hopes is..."
2,Below is an question about finance. Write a re...,Where should I be investing my money?,,"Pay off your debt. As you witnessed, no ""inve..."
3,Below is an question about finance. Write a re...,Specifically when do options expire?,,"Equity options, at least those traded in the A..."
4,Below is an question about finance. Write a re...,Negative Balance from Automatic Options Exerci...,,"Automatic exercisions can be extremely risky, ..."
...,...,...,...,...
68907,Below is an question about finance. Write a re...,Generate an example of what a resume should li...,,"Jean Tremaine\n1234 Main Street, Anytown, CA 9..."
68908,Below is an question about finance. Write a re...,Arrange the items given below in the order to ...,"cake, me, eating",I eating cake.
68909,Below is an question about finance. Write a re...,Write an introductory paragraph about a famous...,Michelle Obama,Michelle Obama is an inspirational woman who r...
68910,Below is an question about finance. Write a re...,Generate a list of five things one should keep...,,1. Research potential opportunities and carefu...


In [None]:
train_df.iloc[0]['text']

"Below is an question about finance. Write a response that appropriately answers the question.\n\nQ: For a car, what scams can be plotted with 0% financing vs rebate?\n\nA: The car deal makes money 3 ways. If you pay in one lump payment. If the payment is greater than what they paid for the car, plus their expenses, they make a profit. They loan you the money. You make payments over months or years, if the total amount you pay is greater than what they paid for the car, plus their expenses, plus their finance expenses they make money. Of course the money takes years to come in, or they sell your loan to another business to get the money faster but in a smaller amount. You trade in a car and they sell it at a profit. Of course that new transaction could be a lump sum or a loan on the used car... They or course make money if you bring the car back for maintenance, or you buy lots of expensive dealer options. Some dealers wave two deals in front of you: get a 0% interest loan. These tend 

## Only using the first 1000 to train, for speed.

In [None]:
train_df = df.iloc[:1000]
#test_df = df.iloc[5000:]

In [None]:
from datasets import Dataset
train_dataset = Dataset.from_pandas(train_df)

In [None]:
#drop everything but the text
train_dataset = train_dataset.remove_columns(['instruction', 'input', 'output'])

# LoRA - Using the PEFT Library
## Lora Config:
- Target Modules: the layers we want to fine-tune with LoRA adapters (currently doing just attention layers, as when did all was too large for the GPU.)
- r: the rank
- bias: How we handle the fine-tuning of biases ("none") means we dont fine-tune them. Only fine-tuning the low rank matrices.
- Task type: the type of task, we can set this to be "classification" etc, however most use "CASUAL_LM" for "Causal Language Modelling".

In [None]:
!pip install peft

In [None]:
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16, #the smaller the rank the less memory used.
    lora_alpha=16,
    target_modules = ['q_proj', 'k_proj','v_proj','o_proj'], #Just attention Layers (Query, Key, Value, Output)
    lora_dropout=0.05, #Conventional
    bias="none",
    #modules_to_save=["decode_head"],
    task_type="CAUSAL_LM",
)


## Training Args

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="outputs",
    learning_rate=5e-4,
    num_train_epochs=3,
    per_device_train_batch_size=4, #Small batch size, for memory
    save_total_limit=3,
    save_strategy="epoch",
    logging_steps=500,
    fp16=True, #faster
    remove_unused_columns=False
)

## Tokenize Dataset

In [None]:
from transformers import LlamaTokenizer, LlamaForCausalLM

model_path = 'openlm-research/open_llama_3b_v2'
tokenizer = LlamaTokenizer.from_pretrained(model_path)
tokenizer.pad_token = tokenizer.eos_token

train_dataset = train_dataset.map(
    lambda samples: tokenizer(
        samples['text'],
        padding=True,    # Pad to the max_length
        truncation=True,
        return_tensors="pt"),
    batched=True,
    remove_columns=train_dataset.column_names,
)


tokenizer_config.json:   0%|          | 0.00/593 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/512k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/330 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

# Quantizing Model using BitsAndBytes
**Note:** Not Letting me train on a quantized model. However seems to run okay now that I have reduced the amount of layers we want to apply the LoRA adapters too.

In [None]:
!pip install -U BitsAndBytes

In [None]:
#Not in use
from transformers import BitsAndBytesConfig
import torch
nf4_config = BitsAndBytesConfig(
  load_in_4bit=True,
  bnb_4bit_quant_type="nf4",
  bnb_4bit_use_double_quant=True,
  bnb_4bit_compute_dtype=torch.bfloat16
)

# Loading in Model

In [None]:
model = LlamaForCausalLM.from_pretrained(
    model_path, device_map='auto') #quantization_config=nf4_config,

model.gradient_checkpointing_enable()



config.json:   0%|          | 0.00/506 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/6.85G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

## Create LoRA Model

In [None]:
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

trainable params: 10,649,600 || all params: 3,437,123,200 || trainable%: 0.3098


**Note**: We can clearly see how many paramters we can save using LoRA.

## Train

In [None]:
!pip install trl

In [None]:
from trl import SFTTrainer
trainer = SFTTrainer(
    model,
    train_dataset=train_dataset,
    max_seq_length=256,
    args=training_args,
)

trainer.train()


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss


Step,Training Loss
500,2.1544


TrainOutput(global_step=750, training_loss=2.053173583984375, metrics={'train_runtime': 1600.5316, 'train_samples_per_second': 1.874, 'train_steps_per_second': 0.469, 'total_flos': 1.229312360448e+17, 'train_loss': 2.053173583984375, 'epoch': 3.0})

In [None]:
model.save_pretrained("./outputs")

## Inference Test on Fine-Tuned Model

In [None]:
basemodel = LlamaForCausalLM.from_pretrained(
    model_path, device_map='auto') #quantization_config=nf4_config,

basemodel.gradient_checkpointing_enable()

In [None]:
prompt = """
Below is an question about finance. Write a response that appropriately answers the question.

Q: What's the difference between Term and Whole Life insurance??

A:"""
#print(prompt)
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to('cuda')

generation_output_lora = model.generate(
input_ids=input_ids, max_new_tokens=128, no_repeat_ngram_size=3, eos_token_id=tokenizer.eos_token_id)

generation_output_base = basemodel.generate(
input_ids=input_ids, max_new_tokens=128, no_repeat_ngram_size=3, eos_token_id=tokenizer.eos_token_id)

#print(generation_output)
print("Prompt:")
print(prompt)
print("*" * 50)
print("Answer from Dataset")
print("")
print("*" * 50)
print("Base Model Answer")
print(tokenizer.batch_decode(generation_output_base[:, input_ids.shape[1]:])[0])
print("*" * 50)
print("Fine-Tuned Model Answer - Generated Text Only")
print(tokenizer.batch_decode(generation_output_lora[:, input_ids.shape[1]:])[0])



Prompt:

Below is an question about finance. Write a response that appropriately answers the question.

Q: What's the difference between Term and Whole Life insurance??

A:
**************************************************
Answer from Dataset

**************************************************
Base Model Answer
Term life insurance is a type of life insurance that provides coverage for a specific period of time. The policyholder pays a premium for the coverage, and the policy pays a death benefit if the policyholder dies during the term of the policy. Term life policies typically have a lower premium than whole life policies, but they do not provide any cash value.
Whole life insurance policies are similar to term life policies in that they provide coverage for an extended period of
time. However, whole life insurance also provides a cash value that can be used to pay premiums or to borrow against. Whole life policies also typically have higher premiums than term life
*****************

# TO DO
- Calculate ROUGE Scores between orignal model and new model for the question answering dataset.