<a href="https://colab.research.google.com/github/shirinyamani/mistral7b-lora-finetuning/blob/main/mistral-7B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Fine-tuning 🤗

## Can you explain the memory requirements for running this model for inference and for training?

A general answer to this question for this model is: it delpends on "How" you basically load the model (in what precision) for interface and for training! Before I elaborate more on my answer further, let's do a quick math for the example model of [Mistral-7b](https://huggingface.co/mistralai/Mistral-7B-v0.1). According to [official documentation](https://docs.mistral.ai/getting-started/open_weight_models/) of the model the version of [Mistral-7b](https://huggingface.co/mistralai/Mistral-7B-v0.1) has 7.3B parameter. The "parameter" here account for the memory to store the model **weights** so far, But REMEMBER if you want to **train** the model, there will be additional components that use GPU memory during, such as gradients, optimizer states (if Adam used then it takes 2 states ~ 8 byte), activations saved for gradient computation, temporary buffers and functionality-specific memory needed by your functions. This can easily lead to 20 extra bytes of memory **per** model **parameter**. See table below where I summerized the approximate memory needed to **train** 1B-param model.

<h3><b>GPU Memory Needed to *Train* 1B Parameter Model</b></h3>
<table>
  <tr style="background-color: #FFC0CB;">
    <th><font size="4"></font></th>
    <th><font size="4">Bytes per parameter</font></th>
  </tr>
  <tr>
    <td><font size="4">Model Parameters (Weights)</font></td>
    <td><font size="4">4 bytes per parameter</font></td>
  </tr>
  <tr>
    <td><font size="4">Adam optimizer (2 stats)</font></td>
    <td><font size="4">+ 8 bytes per parameter</font></td>
  </tr>
  <tr>
    <td><font size="4">Gradients</font></td>
    <td><font size="4">+ 4 bytes per parameter</font></td>
  </tr>
  <tr>
    <td><font size="4">Activations and temp memory</font></td>
    <td><font size="4">+ 8 bytes per parameter</font></td>
  </tr>
  <tr>
    <th><font size="4">TOTAL</font></th>
    <td><font size="4">4 bytes per parameter + 20 extra bytes per parameter  🤯</font></td>
  </tr>
</table>


Now Suppose we wanna train and do inference for Mistral-7B  parameters in full 32 bit precision per parameter to the model, let's do the math would be:

#### **Mistral 7B * inference* using full FP32 precision**
- 7B x 32 bit precision / (8 bit per byte) => 28 GB of Memory 🤯

#### **Mistral 7B * Training* using full FP32 precision**
- Model Parameters: 28 GB
- Gradients: 28 GB
- Optimizer States: 56 GB (28GB x 2 for Adam optimizer)
- Activations and Temporary Memory: 56GB (high-end estimate)
- Total approximate memory training in FP32:
  28 GB (model) + 28 GB (gradients) + 56 GB (optimizer states) + 56 GB (activations & temp memory)= 168 GB 🤯 🤯


Now given your restriction on only having access to **Colab T4 GPU**, which the memory associated to is only 15 GB GPU, this training and inference is ofcourse NOT doable!! SO obviously we cannot even load the model! And... that is where **Quantization** come in handy! 🤗

Quantization is basically a way to load the model in lower precision so that it fit in the memory restriction of the users. For instance, getting back to our Mistral-7B model, with quantization, instead of loading the model in full precision (32 floating point (~FP32)) which does not fit on my memory restriction (Colab T4 15 GB) we load the model in lower precision (e.g. FP16 or int8). This way we will save lots of memory! I summerize common precision points below before going forward...

**Precision**:

- FP32 (Full Precision): Highest memory usage.
- FP16/Bfloat16 (Half Precision): Reduces memory usage by approximately half compared to FP32.
- INT8/4-bit Quantization: Further reduces memory usage but introduces some loss in precision.

Now given this information, let's re-do the math for quantized version of `mistral-7B`;

**Inference**
- Model Size: 7B
- FP16 Precision:  14 GB (7 billion * 2 bytes/parameter).
- Quantized (4-bit): ONLY 3.5 GB (7 billion * 0.5 bytes/parameter).

**Training using FP16 precision**

- Model Parameters (weights): 14 GB
- Gradients: 14 GB
- Optimizer States: 28 GB (14 GB * 2 for Adam optimizer)
- activtion and temp memory: 28 GB
- Total: 14 GB (model) + 14 GB (gradients) + 28 GB (optimizer states) = 84 GB


**Observation**

As you can see when trainig with half precision or even more quantized precision we saved LOTS of the memory comparing to training in full bit precision!

**Take-out**

In general, quantization has largely focused on quantization for inference time!! You can use Quantization if you have memory restriction! BUT ofcourse the decision of loading the model in lower precision comes with trade-off. This trade-offs are often between efficiency and model quality. However, without it, given the memory restriction, we wouldnt be able to even load the model, right?
In the follwing (question 2) I covered the implementation of what we discussed here (quantization) with the use of [`bitsandbytes`](https://huggingface.co/docs/bitsandbytes/main/en/index) library from huggingface.



**Additional resourses on HF for Qantization? 🤗**

Regardless, if you are interested in Quantization topic, I suggest you check our two huggingface short course (1-hour each) which have covered Quatization in so much detail.
1. [Quantization Fundamentals with Huggingface](https://learn.deeplearning.ai/courses/quantization-fundamentals/)
2. [Quantization in depth](https://learn.deeplearning.ai/courses/quantization-in-depth/)
3. [MIT Han Lab](https://hanlab.mit.edu/) content on [SmoothQuant](https://hanlab.mit.edu/projects/smoothquant) and [AWQ](https://hanlab.mit.edu/projects/awq)
4. [LLM.int8() paper](https://arxiv.org/abs/2208.07339)


## Would you be able to provide a full notebook example with one of those implementations that makes the most sense for our needs? [code example required]

Oh absolutely! Please find the below code as an example of implementation of the  [Mistral-7b](https://huggingface.co/mistralai/Mistral-7B-v0.1) fine-tuned on [openassistant-guanaco](timdettmers/openassistant-guanaco) dataset of instructions, both derived directly from HF model hub and dataset. Given your computation restriction (colab T4) I used PEFT with QLoRa technique to be able to load the model.  But before jumping to the code, let's talk a little bit about the [PEFT](https://arxiv.org/abs/2312.12148) method. I hope this explanation makes it easier for you to see why we used this technique.



## PEFT (Parameter Effitient Fine-tuning)

As I explained above the memory requirement for training LLMs is computationally intensive. **Full** fine-tuning which means updating the weights of the model also requires memory not just to store the model, but various other parameters that are required during the training process like the optimizer states, gradients, forward activations, and temporary memory throughout the training process. Too large!!! 🤯 SO that's where **PEFT** comes in!!!

In contrast to **full** fine-tuning where every model weight is updated during supervised learning, **PEFT** methods only **update** a **small subset of parameters**.  It usually does this in two general way;
- **Some** PEFT techniques **freeze most of the model weights** and focus on **fine tuning a subset of existing model parameters**, like, particular layers or components.
- Other techniques **don't touch** the original model weights at all, and instead **add a small number of new parameters or layers** and **fine-tune only the new components**.

With PEFT, **most** if not all of the LLM weights are kept frozen. As a result, **the number of trained parameters is much smaller than the number of parameters in the original LLM**. In some cases, just 15-20% of the original LLM weights. And apparently this makes the memory requirements for training much more manageable, right?

<center>
<figure>
    <img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/0*SOHsccmwk52jgKzm.png" alt="drawing" width="700" class="center"/>
    <figcaption><a href="https://medium.com/@akriti.upadhyay/optimizing-performance-with-peft-a-deep-dive-into-prompt-tuning-2b9a17bc9851" target="_blank">img source</a></figcaption>
</figure>
</center>



Methods to implement PEFT;  
- Additive methods
- Selective Methods
- Reparameterization-based Methods (LoRa implementation)

<center>
<figure>
    <img src="https://miro.medium.com/v2/resize:fit:1400/1*4LOEgon8uwQrwAPU_pzY4w.png" alt="drawing" width="700" class="center"/>
    <figcaption><a href="https://arxiv.org/abs/2312.12148" target="_blank">peft paper</a></figcaption>
</figure>
</center>

each of these methods ofcourse comes with trade-offs on parameter efficiency, memory efficiency, training speed, model quality, and inference costs!

### Peft using reparametrization-based (LoRa)
Now in the following code we implemented PEFT using [LoRa](https://arxiv.org/abs/2106.09685) (Low-Rank Adaptation of Large Language Models)
 technique which is in the reparametrization-based approach.
<center>
<figure>
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/lora_diagram.png" alt="drawing" width="700" class="center"/>
    <figcaption><a href="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/peft/" target="_blank">source huggingface</a></figcaption>
</figure>
</center>

 In short, the way LoRa works is by freezing the pretrained model weights and then injecting a trainable pair of rank decomposition matrices into the layer of the Transformer architecture.  
 A very important aspect of LoRa method is that it it introduces **No Additional Inference Latency**! This is because, the LoRa layer can easily get switched between tasks. In other words, when deployed in production, one can explicitly compute and store W = W0 +BA and perform inference as usual. Since both W0 and BA are in Rd× k. When we need to switch to another downstream task, we can recover W0 by subtracting BA and then adding a different BA, a quick operation with very little memory overhead. Critically, this guarantees that we do not introduce any additional latency during inference compared to a fine-tuned model by construction. [LoRa paper](https://arxiv.org/abs/2106.09685).

### **Potential questions**:

**1. What part of the model Lora can be applied on ?**

  Researchers have found that applying LoRA to just the self-attention layers of the model is often enough to fine-tune for a task and achieve performance gains. However, in principle, you can also use LoRA on other components like the feed-forward layers. But since most of the parameters of LLMs are in the attention layers, you get the biggest savings in trainable parameters by applying LoRA to these weights matrices!

**1.1. Where did we applied LoRa in the below provided code?**

To the attention layer paramers of the model!
If you wanna see the attention and MLP parameters of your model, all you need to do is reading the doc of the model in use OR simply `print(model)`, this will show you all the detail of the model architechture!
Note that in the following code example, LoRa is applied on the `self-attention` block of the model, however you can set it to be applied on other block of your target model (e.g. `MLP`) to get more saving on the memory!


**2. Any way to even get more effiecint on memory usage? if yes, how?**

**[QLoRA](https://arxiv.org/abs/2305.14314) (4-bit quantization)** quantizes an LLM’s weights to 4-bit and leverages LoRA to finetune the quantized LLM. QLoRA reduces the memory footprint of the weights and optimizer states, and as a result, fir instance, finetuning a 65B LLM requires less than 48 GB of memory!!! 🫠

**NOTE:** In the following code I also used QLoRa to be on the safe side in terms of memory usage! The basic addition in the implementation of Lora vs Qlora is that you set the `load_in_4bit` to `True`, also the the `bnb_4bit_quant_type="nf4"`.  In addition, I provided a sample inference code for your ease to see all in once!

## Step 1) Implementation

In [1]:
# Install the libraries
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m39.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.4/309.4 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for peft (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.t

In [2]:
# Required when training models/data that are gated on HuggingFace, and required for pushing models to HuggingFace
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Loading the model and it's tokenizer
`oad_in_4bit=True` ---> QLoRa


In [3]:
# setting up the config for 4-bit quantization of Qlora
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "mistralai/Mistral-7B-v0.1"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

#### Prepare model for PEFT

In [4]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [5]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

#### Setup `LoRaConfig`

In [6]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], # parameters specific to Mistral attenition mechanism, need to be changed if you switch to llama or any other model
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 6815744 || all params: 3758886912 || trainable%: 0.18132346515244138


## Observation 💥 :
BOOM!! 💥 ONLY 18% of the params are getting trained which consumes only 5.2GB of the whole GPU memory!


## Step 2) Fine-tuning process


In [7]:
# Load the dataset from HF
from datasets import load_dataset

data = load_dataset("timdettmers/openassistant-guanaco")
data = data.map(lambda samples: tokenizer(samples["text"]), batched=True)

Downloading readme:   0%|          | 0.00/395 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data:   0%|          | 0.00/20.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9846 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/518 [00:00<?, ? examples/s]

Map:   0%|          | 0/9846 [00:00<?, ? examples/s]

Map:   0%|          | 0/518 [00:00<?, ? examples/s]

## Training

For the sake of the demo, we just ran it for 10 steps just to showcase how to use this integration with existing tools on the HF ecosystem.

In [8]:
import transformers

tokenizer.pad_token = tokenizer.eos_token

trainer = transformers.Trainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=10,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss
1,1.0799
2,1.6832
3,1.5365
4,1.7591
5,1.7915
6,1.4908
7,1.4289
8,0.8238
9,1.3245
10,1.6538


TrainOutput(global_step=10, training_loss=1.4572090446949004, metrics={'train_runtime': 95.0505, 'train_samples_per_second': 0.421, 'train_steps_per_second': 0.105, 'total_flos': 610850246492160.0, 'train_loss': 1.4572090446949004, 'epoch': 0.004062563477554337})

## Observation

We only used 5.2 GB of the 15 GB  of T4 on colab!!!

## Sample Example Code

In [9]:
model.config.use_cache = True
model.eval()

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralSdpaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_p

In [11]:
from transformers import GenerationConfig

max_new_tokens = 120
top_p = 0.9
temperature=0.7
user_question = "What is the purpose of quantization in LLMs?"


prompt = (
    "A chat between a curious human and an artificial intelligence assistant. "
    "The assistant gives helpful, detailed, and polite answers to the user's questions. "
    "### Human: {user_question}"
    "### Assistant: "
)

def generate(model, user_question, max_new_tokens=max_new_tokens, top_p=top_p, temperature=temperature):
    inputs = tokenizer(prompt.format(user_question=user_question), return_tensors="pt").to('cuda')

    outputs = model.generate(
        **inputs,
        generation_config=GenerationConfig(
            do_sample=True,
            max_new_tokens=max_new_tokens,
            top_p=top_p,
            temperature=temperature,
        )
    )

    text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    #print(text)
    return text

generate(model, user_question)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


"A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. ### Human: What is the purpose of quantization in LLMs?### Assistant:  The purpose of quantization in LLMs is to reduce the amount of memory and computational resources required to store and process the model's parameters. This is important because the number of parameters in large LLMs can be very large, making them difficult to train and deploy on resource-constrained devices.  By quantizing the parameters, we can reduce the amount of memory and computational resources required to store and process the model's parameters, making it easier to train and deploy the model on resource-constrained devices.  In addition, quantization can also improve the model'"

## Sample Inference Code

Before running the below code make sure to;
1. Save your updated weights and adaptor into your target repo on huggingface
2. Push your model and updated weights to the Huggingface model hub. Note that in the above implementation the LoRa config is only applied to the attention layer! You can get to choose to apply LoRa on any subset of weight matrices (e.g. MLP layers) in the model to reduce the number of trainable parameters. If you do `print(model)` it confirms the lora config is applied on the attention layes that we defined above `["q_proj", "k_proj", "v_proj", "o_proj"]`.

In [None]:
import os
from os.path import exists, join, isdir
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, GenerationConfig
from peft import PeftModel
from peft.tuners.lora import LoraLayer

# Update variables!
max_new_tokens = 100
top_p = 0.9
temperature=0.7
user_question = "What is Einstein's theory of relativity?"

# Base model
model_name_or_path = 'YOUR_BASE_MODEL'
adapter_path = 'YOUR_ADAPTER_PATH'

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
# if you wanna use LLaMA HF then fix the early conversion issues.
tokenizer.bos_token_id = 1

# Load the model (use bf16 for faster inference)
model = AutoModelForCausalLM.from_pretrained(
    model_name_or_path,
    torch_dtype=torch.bfloat16,
    device_map={"": 0},
    # Qlora -- 4-bit config
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type='nf4',
    )
)

model = PeftModel.from_pretrained(model, adapter_path)
model.eval()

prompt = (
    "A chat between a curious human and an artificial intelligence assistant. "
    "The assistant gives helpful, detailed, and polite answers to the user's questions. "
    "### Human: {user_question}"
    "### Assistant: "
)

def generate(model, user_question, max_new_tokens=max_new_tokens, top_p=top_p, temperature=temperature):
    inputs = tokenizer(prompt.format(user_question=user_question), return_tensors="pt").to('cuda')

    outputs = model.generate(
        **inputs,
        generation_config=GenerationConfig(
            do_sample=True,
            max_new_tokens=max_new_tokens,
            top_p=top_p,
            temperature=temperature,
        )
    )

    text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(text)
    return text

generate(model, user_question)

# Can you specify how the data should be formatted for this type of fine-tuning?

In general The data set you use for training includes many pairs of prompt completion examples for the task you're interested in, each of which includes an instruction. For example, if you want to fine tune your model to improve its summarization ability, you'd build up a data set of examples that begin with the instruction summarize, the following text or a similar phrase. And if you are improving the model's translation skills, your examples would include instructions like translate this sentence. These prompt completion examples allow the model to learn to generate responses that follow the given instructions.

There are many publicly available datasets that have been used to train earlier generations of language models, although most of them are not formatted as instructions. Luckily, developers have assembled prompt template libraries that can be used to take existing datasets, for example, the large data set of Amazon product reviews and turn them into instruction prompt datasets for fine-tuning. Prompt template libraries include many templates for different tasks and different data sets. [Prompt Template Source](https://github.com/bigscience-workshop/promptsource).


For the specific dataset and the model you provided to me, below I included an example of formatted data prompt in the code for the inference.

In [12]:
prompt = (
    "A chat between a curious human and an artificial intelligence assistant. "
    "The assistant gives helpful, detailed, and polite answers to the user's questions. "
    "### Human: {user_question}"
    "### Assistant: "
)

# What is the most optimal configuration of parameters to use for this technique?
General answer is it depend on so many factors like the size of the model you are using, the GPU memory you have access to and the nature of the dataset you are fine-tuning on! For instance in the `LoRaconfig` setup, `r` represents the rank of the low rank matrices learned during the finetuning process. As this value is increased, the number of parameters needed to be updated during the low-rank adaptation increases. Intuitively, a lower `r` may lead to a quicker, less computationally intensive training process, but may affect the quality of the model thus produced. However, increasing `r` beyond a certain value may not yield any discernible increase in quality of model output! So if you have only access to colab T4 maybe a lower rank is more efficient but it comes with the trade-off explained above! SO in general, the decision of the most optimal values, is often a factor of model size and memory in-use and can vary! [useful article](https://www.databricks.com/blog/efficient-fine-tuning-lora-guide-llms).

In the code, I tried to set the cofig param with respect to the computation restriction of your case!

# Additionally, would it be feasible to adapt this code for the fine-tuning of a larger model, such as llama-13B? What changes would we need to make? [No code required]

- `LoraConfig`: each model architechture has their specific blocks, so if you get to decide to apply PEFT through LoRa or QloRa, you need to change the `target_modules` in `LoraConfig` and set it to specific model you are targetting to use, for instance for llama you need to set it to  `target_modules=["self_attn.q_proj", "self_attn.k_proj", "self_attn.v_proj", "self_attn.o_proj"]` if you wanna apply it on the attention block! But you can get to choose to apply on other layers like the MLP part of the model as well!
- For inference, you can update the huperparameters again based on your computational resources! (see the inference code provided)
- Depending on your model and task you can get to update `trainer` hyperparams(e.g. `max_steps`)
- `model_name_or_path` set to your desired `llama-13B` version
- for specific behavior of llama, you need to set `tokenizer.bos_token_id = 1` to fix the early conversion issues.


# Deployment/cloud considerations

I'm also interested in your perspective on how we could deploy the model.. Our production workloads will be predominantly hosted on AWS.

### Could you suggest any deployment architectures, considering our need for the system to engage with customers in real-time?

- Again depending on the model you are targetting to use and your fine-tuning config approach, there are different services your can use. I highly recommend to check out their [Amazon Bedrock](https://aws.amazon.com/bedrock/?gclid=CjwKCAjwvIWzBhAlEiwAHHWgvTxddQo8OrFed-ovSNossKoIQiP_XGHwDRHBme8ach05RWD5N9hWrhoCwacQAvD_BwE&trk=8228be07-d2ee-417f-b236-33eb068829a6&sc_channel=ps&ef_id=CjwKCAjwvIWzBhAlEiwAHHWgvTxddQo8OrFed-ovSNossKoIQiP_XGHwDRHBme8ach05RWD5N9hWrhoCwacQAvD_BwE:G:s&s_kwcid=AL!4422!3!692006005915!e!!g!!bedrock!21054971261!157173597057). This service is accessible through APIs and you don’t manage the infrastructure. Now if your concern is data privacy, then according to their FAQ under [security](https://aws.amazon.com/bedrock/faqs/), the data is encrypted and stored at rest in the AWS Region where you are using Amazon Bedrock. BUT if you still want to own the fully own the infrastructure, then you can host using [jumpstart](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-jumpstart.html), link of the [FAQ](https://aws.amazon.com/blogs/machine-learning/meta-llama-3-models-are-now-available-in-amazon-sagemaker-jumpstart/). In this scenario you’ll be running your own endpoint, and this paying if you are running inference or not. Depending on the usage pattern of your service, this could translate to just straight up burning through your budget doing nothing.

- If none of the the above services worked, you can always go for [EC2](https://aws.amazon.com/pm/ec2/?gclid=CjwKCAjwvIWzBhAlEiwAHHWgvTBUUW0H3OsPq-xdoGcjmMzdZ5nuKLtz8XmX6TzjYcjaTla7tS4IDBoCbBcQAvD_BwE&trk=8c0f4d22-7932-45ae-9a50-7ec3d0775c47&sc_channel=ps&ef_id=CjwKCAjwvIWzBhAlEiwAHHWgvTBUUW0H3OsPq-xdoGcjmMzdZ5nuKLtz8XmX6TzjYcjaTla7tS4IDBoCbBcQAvD_BwE:G:s&s_kwcid=AL!4422!3!472464674288!e!!g!!aws%20ec2!11346198414!112250790958), by which you are able to monitor both (CPU/ GPU usage) and have root access to the instance! Hope this helps!

- Useful AWSxHF Implementation Notebook: [Huggingface Sagemaker-sdk - Getting Started Demo](https://github.com/huggingface/notebooks/blob/main/sagemaker/01_getting_started_pytorch/sagemaker-notebook.ipynb)

### Could you share insights on reducing latency and increase with such a deployment?

Given that you will be using Qlora config if you happen to re-use the same code, in general there won't be significant latency for your pipelines in production. Because the quantized version of the model (qlora-4-bit) the memory in-use will be way less also because this quantized version of the model still has the same number of parameters as the original, SO there is little to no impact on inference latency! According to [LoRa paper](https://arxiv.org/abs/2106.09685),



> **No Additional Inference Latency**. When deployed in production, we can explicitly compute and store W = W0 +BA and perform inference as usual. Note that both W0 and BA are in Rd× k. When we need to switch to another downstream task, we can recover W0 by subtracting BA and then adding a different BA, a quick operation with very little memory overhead. Critically, this guarantees that we do not introduce any additional latency during inference compared to a fine-tuned model by construction.



Also, a few useful resources for AWS implementation: [throughput1](https://go.aws/44fsTy9) & [throughput2](https://go.aws/49AFVan)

