## Mistral-7B-Instruct_GPTQ - Finetune on finance-alpaca dataset

### Checkout my [Twitter(@rohanpaul_ai)](https://twitter.com/rohanpaul_ai) for daily LLM bits

<a href="https://colab.research.google.com/github/rohan-paul/LLM-FineTuning-Large-Language-Models/blob/main/Mistral_7B_Instruct_GPTQ_finetune.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# !pip install --upgrade trl peft accelerate bitsandbytes datasets auto-gptq optimum -q

In [None]:
from accelerate import FullyShardedDataParallelPlugin, Accelerator
from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForLanguageModeling, BitsAndBytesConfig
from datasets import load_dataset
import pandas as pd
import logging
import os
from pathlib import Path
from typing import Optional, Tuple
from peft import LoraConfig, PeftConfig, PeftModel
from transformers import GPTQConfig
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model

In [3]:
fsdp_plugin = FullyShardedDataParallelPlugin(
    state_dict_config=FullStateDictConfig(offload_to_cpu=True, rank0_only=False),
    optim_state_dict_config=FullOptimStateDictConfig(offload_to_cpu=True, rank0_only=False),
)

accelerator = Accelerator(fsdp_plugin=fsdp_plugin)

dataset = load_dataset('gbharti/finance-alpaca')
# Split the dataset into train and test sets
train_test_split = dataset['train'].train_test_split(test_size=0.1)
train_dataset = train_test_split['train']
test_dataset = train_test_split['test']

# Further split the train dataset into train and validation sets
train_val_split = train_dataset.train_test_split(test_size=0.1)
train_dataset = train_val_split['train']
eval_dataset = train_val_split['test']



##############

pretrained_model_name_or_path = "TheBloke/Mistral-7B-Instruct-v0.1-GPTQ"

![](assets/2023-12-30-23-50-29.png)

In [4]:
def tokenize(prompt):
    result = tokenizer(
        prompt,
        truncation=True,
        max_length=512,
        padding="max_length",
    )
    result["labels"] = result["input_ids"].copy()
    return result

def format_input_data_to_build_model_prompt(data_point):
        instruction = str(data_point['instruction'])
        input_query = str(data_point['input'])
        response = str(data_point['output'])

        if len(input_query.strip()) == 0:
            full_prompt_for_model = f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction} \n\n### Input:\n{input_query}\n\n### Response:\n{response}"""

        else:
            full_prompt_for_model = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:\n{response}"""
        return tokenize(full_prompt_for_model)

## Need for input data formatting i.e. `format_input_data_to_build_model_prompt` method

ðŸ“Œ The `format_input_data_to_build_model_prompt` method processes the input DataFrame, which contains columns like 'instruction', 'input', and 'output', representing different components of a training sample. The method consolidates these components into a single 'text' column, formatted in a structured way that aligns with the training requirements of LLMs.

ðŸ“Œ Specifically, the method constructs each entry in the 'text' column as a concatenation of the instruction, the context (if provided), and the expected response. This formatting is key for fine-tuning models like LLMs that are based on transformer architectures. It ensures the correct associations between the prompts (instructions and input queries) and the expected responses.

==============

##  Prompt format for mistralai/Mixtral-8x7B-v0.1 ðŸ”¥

https://huggingface.co/mistralai/Mixtral-8x7B-v0.1/discussions/22


"Mixtral-8x7B-v0.1" is a base model, therefore it doesn't need to be prompted in a specific way in order to get started with the model. If you want to use the instruct version of the model, you need to follow the template that is on the model card: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1#instruction-format

The template used to build a prompt for the Instruct model is defined as follows:

```
<s> [INST] Instruction [/INST] Model answer</s> [INST] Follow-up instruction [/INST]
```

In [5]:
def build_qlora_model(
    pretrained_model_name_or_path: str = "TheBloke/Mistral-7B-Instruct-v0.1-GPTQ",
    gradient_checkpointing: bool = True,
    cache_dir: Optional[Path] = None,
) -> Tuple[AutoModelForCausalLM, AutoTokenizer, PeftConfig]:
    """
    Args:
        pretrained_model_name_or_path (str): The name or path of the pretrained model to use.
        gradient_checkpointing (bool): Whether to use gradient checkpointing or not.
        cache_dir (Optional[Path]): The directory to cache the model in.

    Returns:
        Tuple[AutoModelForCausalLM, AutoTokenizer]: A tuple containing the built model and tokenizer.
    """

    # If I am using any GPTQ model, then need to comment-out bnb_config
    # as I can not quantize an already quantized model

    # bnb_config = BitsAndBytesConfig(
    #     load_in_4bit=True,
    #     bnb_4bit_use_double_quant=True,
    #     bnb_4bit_compute_dtype=torch.bfloat16
    # )

    # In below as well, when using any GPTQ model
    # comment-out the quantization_config param

    tokenizer = AutoTokenizer.from_pretrained(
        pretrained_model_name_or_path,
        padding_side="left",
        add_eos_token=True,
        add_bos_token=True,
    )
    tokenizer.pad_token = tokenizer.eos_token

    quantization_config_loading = GPTQConfig(bits=4, use_exllama=False, tokenizer=tokenizer)

    model = AutoModelForCausalLM.from_pretrained(
        pretrained_model_name_or_path,
        # quantization_config=bnb_config,
        quantization_config=quantization_config_loading,
        device_map="auto",
        cache_dir=str(cache_dir) if cache_dir else None,
    )

    #disable tensor parallelism
    model.config.pretraining_tp = 1

    if gradient_checkpointing:
        model.gradient_checkpointing_enable()
        model.config.use_cache = (
            False  # Gradient checkpointing is not compatible with caching.
        )
    else:
        model.gradient_checkpointing_disable()
        model.config.use_cache = True  # It is good practice to enable caching when using the model for inference.

    return model, tokenizer

In [6]:
model, tokenizer = build_qlora_model(pretrained_model_name_or_path)

You passed `quantization_config` to `from_pretrained` but the model you're loading already has a `quantization_config` attribute and has already quantized weights. However, loading attributes (e.g. ['use_cuda_fp16', 'use_exllama', 'max_input_length', 'exllama_config', 'disable_exllama']) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored.


In [7]:

model = prepare_model_for_kbit_training(model)

In [8]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj"
    ],
    bias="none",
    lora_dropout=0.05,  # Conventional
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, config)


In [10]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

print_trainable_parameters(model)

# trainable params: 6815744 || all params: 269225984 || trainable%: 2.5316070532033046

# Apply the accelerator. You can comment this out to remove the accelerator.
model = accelerator.prepare_model(model)

trainable params: 6815744 || all params: 269225984 || trainable%: 2.5316070532033046


###########################3

In [None]:
tokenized_train_dataset = train_dataset.map(format_input_data_to_build_model_prompt)
tokenized_val_dataset = eval_dataset.map(format_input_data_to_build_model_prompt)

### Let's grab a single data point from our testset (both instruction and output) to see how the base model does on it.

In [12]:
print("Instruction Sentence: " + test_dataset[1]['instruction'])
print("Output: " + test_dataset[1]['output'] + "\n")

Instruction Sentence: Describe how a person's life might be different if he/she won the lottery.
Output: The person could easily afford their desired lifestyle, from buying luxury cars and homes to traveling the world and not having to worry about financial concerns. They could pursue their dream career or start a business or charity of their own, leaving them with a much more fulfilling life. They could give back to their communities and make a positive difference. They can use their wealth to make a lasting impact in the lives of family and friends. All in all, winning the lottery can drastically change a person's life for the better.



In [13]:
eval_prompt = """Given an instruction sentence construct the output.

### Instruction sentence:
Generate a sentence that describes the main idea behind a stock market crash.


### Output


"""

Now, to start our fine-tuning, we have to apply some preprocessing to the model to prepare it for training. For that use the `prepare_model_for_kbit_training` method from PEFT.

In [14]:
# Apply the accelerator. You can comment this out to remove the accelerator.
# prepare_model - Prepares a PyTorch model for training in any distributed setup.
model = accelerator.prepare_model(model)

In [15]:
# Re-init the tokenizer so it doesn't add padding or eos token
eval_tokenizer = AutoTokenizer.from_pretrained(
    pretrained_model_name_or_path,
    add_bos_token=True,
)

In [16]:
device = "cuda"
model_input = eval_tokenizer(eval_prompt, return_tensors="pt").to(device)

In [None]:
model.eval()
with torch.no_grad():
    print(eval_tokenizer.decode(model.generate(**model_input, max_new_tokens=128)[0], skip_special_tokens=True))

It actually did not do very well out of the box.

Let's print the model to examine its layers, as we will apply QLoRA to all the linear layers of the model. Those layers are `q_proj`, `k_proj`, `v_proj`, `o_proj`.

In [19]:
print(model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32000, 4096, padding_idx=0)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralAttention(
              (rotary_emb): MistralRotaryEmbedding()
              (k_proj): QuantLinear(
                (base_layer): QuantLinear()
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=1024, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (quant_linear_module): QuantLinear()
              )
     


ðŸ“Œ `LoraConfig` allows you to control how LoRA is applied to the base model:

ðŸ“Œ Rank of Decomposition r

r represents the rank of the low rank matrices learned during the finetuning process. As this value is increased, the number of parameters needed to be updated during the low-rank adaptation increases. Intuitively, a lower r may lead to a quicker, less computationally intensive training process, but may affect the quality of the model thus produced. However, increasing r beyond a certain value may not yield any discernible increase in quality of model output.

---------------

`target_modules` are the names of modules LoRA is applied to. Here it is set to query, key and value which are the names of inner layers of self attention layer from Transformer Architecture.

---------------

ðŸ“Œ Alpha Parameter for LoRA Scaling `lora_alpha`

According to the LoRA article Hu et. al., âˆ†W is scaled by Î± / r where Î± is a constant. When optimizing with Adam, tuning Î± is roughly the same as tuning the learning rate if the initialization was scaled appropriately. The reason is that the number of parameters increases linearly with r. As you increases r, the values of the entries in âˆ†W also scale linearly with r. We want âˆ†W to scale consistently with the pretrained weights no matter what r is used. Thatâ€™s why the authors set Î± to the first r and do not tune it. The default of Î± is 8.

---------

ðŸ“Œ `Dropout Rate (lora_dropout)`: This is the probability that each neuronâ€™s output is set to zero during training, used to prevent overfitting.

So Dropout is a general technique in Deep Learning, to reduce overfitting by randomly selecting neurons to ignore with a dropout probability during training. The contribution of those selected neurons to the activation of downstream neurons is temporally removed on the forward pass, and any weight updates are not applied to the neuron on the backward pass. The default of lora_dropout is 0.

### Training!

In [22]:
if torch.cuda.device_count() > 1: # If more than 1 GPU
    model.is_parallelizable = True
    model.model_parallel = True

In [23]:
torch.cuda.device_count()

1

In [None]:
import transformers
from datetime import datetime

project = "Mixtral-alpaca-finance-finetune"
base_model_name = "mixtral"
run_name = base_model_name + "-" + project
output_dir = "./" + run_name

tokenizer.pad_token = tokenizer.eos_token

trainer = transformers.Trainer(
    model=model,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    args=transformers.TrainingArguments(
        output_dir=output_dir,
        warmup_steps=5,
        per_device_train_batch_size=1,
        gradient_checkpointing=True,
        gradient_accumulation_steps=4,
        max_steps=1000,
        learning_rate=2.5e-5,
        logging_steps=25,
        fp16=True,
        optim="paged_adamw_8bit",
        logging_dir="./logs",        # Directory for storing logs
        save_strategy="steps",       # Save the model checkpoint every logging step
        save_steps=50,                # Save checkpoints every 50 steps
        evaluation_strategy="steps", # Evaluate the model every logging step
        eval_steps=50,               # Evaluate and save checkpoints every 50 steps
        do_eval=True,                # Perform evaluation at the end of training
        # report_to="wandb",           # Comment this out if you don't want to use weights & baises
        run_name=f"{run_name}-{datetime.now().strftime('%Y-%m-%d-%H-%M')}"          # Name of the W&B run (optional)
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

model.config.use_cache = False  # silence the warnings. Re-enable for inference!
trainer.train()

### Evaluate the Trained Model!

However, before going to the evaluation code, it's a good idea to kill the current process so that to avoid possible out of memory loading the base model again on top of the model we just trained. 

Hence, to kill the current process => Go to `Kernel > Restart Kernel` or kill the process via the Terminal (`nvidia smi` > `kill [PID]`). 

### By default, the PEFT library will only save the QLoRA adapters, so we need to first load the base Mixtral model from the Huggingface Hub:


In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

pretrained_model_name_or_path = "TheBloke/Mistral-7B-Instruct-v0.1-GPTQ"

# bnb_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_use_double_quant=True,
#     bnb_4bit_compute_dtype=torch.bfloat16
# )

base_model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path,  # Mixtral, same as before
    # quantization_config=bnb_config,  # Same quantization config as before, but commented out as its a GPTQ model (which is already quantized )
    quantization_config=quantization_config_loading,
    device_map="auto",
    trust_remote_code=True,
)

eval_tokenizer = AutoTokenizer.from_pretrained(
    pretrained_model_name_or_path,
    add_bos_token=True,
    trust_remote_code=True,
)

### Noting again, by default, the PEFT library will only save the QLoRA adapters, so we need to first load the base Mixtral model from the Huggingface Hub:

Now load the QLoRA adapter from the appropriate checkpoint directory, i.e. the best performing model checkpoint:

In [None]:
from peft import PeftModel

ft_model = PeftModel.from_pretrained(base_model, "mistral-finetune-alpaca-GPTQ/checkpoint-500")

# Here, "mistral-finetune-alpaca-GPTQ/checkpoint-500" is the adapter name

and run your inference!

Let's try the same `eval_prompt` and thus `model_input` as above, and see if the new finetuned model performs better.

In [None]:
eval_prompt = """"Given an instruction sentence construct the output.

### Instruction sentence:
Generate a sentence that describes the main idea behind a stock market crash.


### Output


"""

model_input = eval_tokenizer(eval_prompt, return_tensors="pt").to("cuda")

ft_model.eval()
with torch.no_grad():
    print(eval_tokenizer.decode(ft_model.generate(**model_input, max_new_tokens=50)[0], skip_special_tokens=True))

## `PeftModel.from_pretrained` - Explanations

https://huggingface.co/docs/peft/package_reference/peft_model#peft.PeftModel.from_pretrained


When I do the below line

`ft_model = PeftModel.from_pretrained(base_model, model_id)`


--------------------

### Source Code

https://github.com/huggingface/peft/blob/v0.7.1/src/peft/peft_model.py#L282

```
def from_pretrained(
        cls,
        model: torch.nn.Module,
        model_id: Union[str, os.PathLike],
        adapter_name: str = "default",
        is_trainable: bool = False,
        config: Optional[PeftConfig] = None,
        **kwargs: Any,
    ) -> "PeftModel":
        r"""
        Instantiate a PEFT model from a pretrained model and loaded PEFT weights.

        Note that the passed `model` may be modified inplace.

        Args:
            model ([`torch.nn.Module`]):
                The model to be adapted. For ðŸ¤— Transformers models, the model should be initialized with the
                [`~transformers.PreTrainedModel.from_pretrained`].
            model_id (`str` or `os.PathLike`):
                The name of the PEFT configuration to use. Can be either:
                    - A string, the `model id` of a PEFT configuration hosted inside a model repo on the Hugging Face
                      Hub.
                    - A path to a directory containing a PEFT configuration file saved using the `save_pretrained`
                      method (`./my_peft_config_directory/`).
            adapter_name (`str`, *optional*, defaults to `"default"`):
                The name of the adapter to be loaded. This is useful for loading multiple adapters.


```

