# This notebook has been tailored to specifically finetune a Mistral-7B-Instruct-v0.1 Model and then quantize it using AWQ algorithm

The following content were used as reference to get an idea about the fine tuning process and the necessary hyperparameters in order to get the best training model
- [Tuning-the-Finetuning](https://github.com/avisoori-databricks/Tuning-the-Finetuning)
- [Mistral Mastery: Fine-Tuning & Fast Inference Guide](https://medium.com/@parikshitsaikia1619/mistral-mastery-fine-tuning-fast-inference-guide-62e163198b06)
- [4-bit Transformers with Hugging Face](https://huggingface.co/blog/4bit-transformers-bitsandbytes)
- [Transformers for Legal Language](https://huggingface.co/docs/trl/en/sft_trainer)
- [AutoAWQ](https://github.com/casper-hansen/AutoAWQ)


## Install the necessary prerequisite modules

This notebook was run on the pytorch version 2.1 and cuda version at 12.1 
Do bear in mind to properly set up the virtual environment

In [2]:
!pip install -U -q trl accelerate bitsandbytes peft transformers autoawq torch

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


### Optional
Installing Weights and Biases to track the training progress
You can use the below template to create a wandb project where the training run will then be assigned

In [1]:
!pip install -q wandb -U

import wandb, os
wandb.login()

wandb_project = "mistral-finetune-v2.1"
if len(wandb_project) > 0:
    os.environ["WANDB_PROJECT"] = wandb_project
    os.environ["WANDB_RESUME"] = "must"
    os.environ["WANDB_RUN_ID"] = "f5rpfdmw"

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


[34m[1mwandb[0m: Currently logged in as: [33mvpmb[0m. Use [1m`wandb login --relogin`[0m to force relogin


### Perform the necessary imports to load the model to memory

In [4]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

In [2]:
#quantization config to load the model in 4 bit rather than 16 bit, reduces the memory usage and also enables qLora finetuning
nf4_config = BitsAndBytesConfig(
  load_in_4bit=True,
  bnb_4bit_quant_type="nf4",
  bnb_4bit_use_double_quant=True,
  bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1",  device_map='auto', quantization_config=nf4_config)

# Load the tokenizer, can use use_fast=True to enable fast tokenizers
tokenizer = AutoTokenizer.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.1")

#Set up padding token, since we are doing constricted generation, we have to stop generating at the end of the JSON
# tokenizer.pad_token = tokenizer.eos_token yielded errors so using unk_token as padding token
tokenizer.pad_token = tokenizer.unk_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Load the dataset
Make sure to prepare the dataset according to the appropriate jsonl format where each entry is formatted as a JSON string.  

The sample format of the dataset is ```{"prompt": "<s>[INST] ### Instruction : You are given a sequence of text. Generate a logical question, correct answer and three incorrect answers suitable for mcq from the input text.The output should be formatted as a json in the below format.{\"type\": \"object\", \"properties\": {\"Output\": {\"type\": \"object\", \"properties\": {\"question\": {\"type\": \"string\"}, \"correct_answer\": {\"type\": \"string\"}, \"incorrect_answers\": {\"type\": \"array\", \"items\": {\"type\": \"string\"}}}, \"required\": [\"question\", \"correct_answer\", \"incorrect_answers\"]}}, \"required\": [\"Output\"]} ### Input : a = 3\nif a==3 :\nb = a * 2  \nif a < 4:\nb = a + 2\nif a > 2:\nb = a * 2\nPage 21Exercise 2 - What is the final value in b ?  [/INST]", "completion": "### Output : {\n \"Output\": {\n  \"question\": \"In the given program, what is the final value in b ?\",\n  \"correct_answer\": \"6\",\n  \"incorrect_answers\": [\n   \"5\",\n   \"4\",\n   \"7\"\n  ]\n }\n} </s>"}``` Notice the use of ```<s>``` and ```</s>``` and the ```[INST]``` and ```[/INST] tokens and use of a single prompt field.

Prevent splitting the dataset into training and validation.

In [3]:
path_to_dataset = "dataset_mistral_instruct1.jsonl"

dataset = load_dataset("json", data_files=path_to_dataset, split='train')

## Load the model as a PEFT model
Can efficiently load the model as a peft model

In [4]:

from peft import get_peft_model

# target modules selects which will be used in qlora finetuning
target_modules = ['q_proj','k_proj','v_proj','o_proj','gate_proj','down_proj','up_proj','lm_head']

#sample lora config
lora_config = LoraConfig(
    r=8,#or r=16
    lora_alpha=8,
    lora_dropout=0.05,
    bias="none",
    target_modules = target_modules,
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)

## define training arguments to train the model
Do not recommend chaning the optimization algorithm.
per device train batch size and gradient accumulation steps can be lowered to 1 and 2 respectively. 
Using a lower learning rate yielded erroneous results.
Recommends fine tuning upto one epoch as after one epoch the model looses the ability to generate responses using constricted generation.
Have set to save the model at each epoch

In [5]:
from transformers import TrainingArguments
from datetime import datetime

output_dir = './mistral-finetune-v2.1'
run_name = 'v2.1.1'

per_device_train_batch_size = 2
gradient_accumulation_steps = 4
optim = 'paged_adamw_8bit'
learning_rate = 1e-5
max_grad_norm = 0.3
warmup_ratio = 0.03
lr_scheduler_type = "linear"

training_args = TrainingArguments(
   output_dir=output_dir,
    save_strategy="epoch",
    num_train_epochs = 3.0,
    logging_steps = 5,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    learning_rate=learning_rate,
    bf16=True,
    max_grad_norm=max_grad_norm,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
    report_to="wandb",           # Comment this out if you don't want to use weights & baises
    run_name=f"{run_name}-{datetime.now().strftime('%Y-%m-%d-%H-%M')}"          # Name of the W&B run (optional)
)

## Set up the sfttrainer
max sequence length was set to 1024 to enable autoawq quantization

In [6]:
model.config.use_cache = False 
trainer = SFTTrainer(
    model = model,
    train_dataset=dataset,
    tokenizer=tokenizer,
    peft_config=lora_config,
    max_seq_length=1024,
    args=training_args
)



# Commence the training

In [7]:
trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33mvpmb[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
5,2.1295
10,2.3588
15,2.2764
20,2.2489
25,2.2202
30,2.1308
35,2.0636
40,2.0017
45,1.8394
50,1.777




TrainOutput(global_step=696, training_loss=0.9421885708967844, metrics={'train_runtime': 2291.0674, 'train_samples_per_second': 2.434, 'train_steps_per_second': 0.304, 'total_flos': 9.670066722627994e+16, 'train_loss': 0.9421885708967844, 'epoch': 2.99})

## Load the model
Again load the model from scratch

In [5]:
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1",  device_map='auto', torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.1")
tokenizer.pad_token = tokenizer.unk_token
tokenizer.padding_side = "right"

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

## Load the peft model
Load the saved peft model, by merging the model and the peft layer

In [6]:
from peft import PeftModel
ft_model = PeftModel.from_pretrained(model, "mistral-finetune-v2.1/checkpoint-232")
ft_model.config.use_cache =False

## Inference the model 

In [7]:
eval_prompt = """[INST]### Instruction : You are given a sequence of text. Generate a logical question, correct answer and three incorrect answers suitable for mcq from the input text.The output should be formatted as a json in the below format." + "{\"type\": \"object\", \"properties\": {\"Output\": {\"type\": \"object\", \"properties\": {\"question\": {\"type\": \"string\"}, \"correct_answer\": {\"type\": \"string\"}, \"incorrect_answers\": {\"type\": \"array\", \"items\": {\"type\": \"string\"}}}, \"required\": [\"question\", \"correct_answer\", \"incorrect_answers\"]}}, \"required\": [\"Output\"]}"
### Input :Object Oriented Principles we saw so far
• Encapsulation
It keeps the data and the code safe from external interference. It is a
mechanism for restricting direct access to some of the object’s
component. Binding the data with the code that manipulates it.
• Inheritance
Inheritance allows a class to use the properties and methods of another
class. In other words, the derived class inherits the states and behaviors
from the base class.
• Polymorphism.
Polymorphism is the ability of an object to take on many forms. The most
common use of polymorphism in OOP occurs when a parent class
reference is used to refer to a child class object.[/INST]
### Output :
"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

ft_model.eval()
with torch.no_grad():
    print(tokenizer.decode(ft_model.generate(**model_input, max_new_tokens=128, )[0],skip_special_tokens=False))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> [INST]### Instruction : You are given a sequence of text. Generate a logical question, correct answer and three incorrect answers suitable for mcq from the input text.The output should be formatted as a json in the below format." + "{"type": "object", "properties": {"Output": {"type": "object", "properties": {"question": {"type": "string"}, "correct_answer": {"type": "string"}, "incorrect_answers": {"type": "array", "items": {"type": "string"}}}, "required": ["question", "correct_answer", "incorrect_answers"]}}, "required": ["Output"]}"
### Input :Object Oriented Principles we saw so far
• Encapsulation
It keeps the data and the code safe from external interference. It is a
mechanism for restricting direct access to some of the object’s
component. Binding the data with the code that manipulates it.
• Inheritance
Inheritance allows a class to use the properties and methods of another
class. In other words, the derived class inherits the states and behaviors
from the base class.
• Po

# Save the model by merging the qlora adapter with the base model

In [8]:
model = ft_model.merge_and_unload()

#Save the merged model in a directory in the safetensors format
model_dir = "./merged_model/"
model.save_pretrained(model_dir, safe_serialization=True)

#Save the custom tokenizer in the same directory
tokenizer.save_pretrained(model_dir)

('./merged_model/tokenizer_config.json',
 './merged_model/special_tokens_map.json',
 './merged_model/tokenizer.model',
 './merged_model/added_tokens.json',
 './merged_model/tokenizer.json')

# Setup autoawq for quantization

In [14]:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "merged_model"
model = AutoAWQForCausalLM.from_pretrained(model_path, safetensors=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

## Install autoawq
Make sure either the cuda version is 12.1 or otherwise use this to download the wheels

In [12]:
pip install https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.6/autoawq-0.1.6+cu118-cp310-cp310-linux_x86_64.whl

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting autoawq==0.1.6+cu118
  Downloading https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.6/autoawq-0.1.6+cu118-cp310-cp310-linux_x86_64.whl (20.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m76.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: autoawq
  Attempting uninstall: autoawq
    Found existing installation: autoawq 0.1.8
    Uninstalling autoawq-0.1.8:
      Successfully uninstalled autoawq-0.1.8
Successfully installed autoawq-0.1.6+cu118
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## AutoAWQ quantize
Quantize, train and save the model

In [15]:
quant_path="quant"
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

Downloading readme:   0%|          | 0.00/167 [00:00<?, ?B/s]



Downloading data:   0%|          | 0.00/471M [00:00<?, ?B/s]

Generating validation split: 0 examples [00:00, ? examples/s]

AWQ: 100%|██████████| 32/32 [15:51<00:00, 29.72s/it]


('quant/tokenizer_config.json',
 'quant/special_tokens_map.json',
 'quant/tokenizer.model',
 'quant/added_tokens.json',
 'quant/tokenizer.json')

## Load the quanitzed model

In [16]:
model = AutoAWQForCausalLM.from_quantized(quant_path, fuse_layers=True)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)

Replacing layers...: 100%|██████████| 32/32 [00:03<00:00,  8.71it/s]
Fusing layers...: 100%|██████████| 32/32 [00:05<00:00,  5.36it/s]


## Run inference on the quantized model

In [31]:
tokens = tokenizer(
    text=eval_prompt, 
    return_tensors='pt'
).input_ids.cuda()


# Generate output
generation_output = model.generate(
    tokens, 
    max_new_tokens=512
)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [34]:
tokenizer.decode(generation_output[0],skip_special_tokens=False)

'<s> [INST]### Instruction : You are given a sequence of text. Generate a logical question, correct answer and three incorrect answers suitable for mcq from the input text.The output should be formatted as a json in the below format." + "{"type": "object", "properties": {"Output": {"type": "object", "properties": {"question": {"type": "string"}, "correct_answer": {"type": "string"}, "incorrect_answers": {"type": "array", "items": {"type": "string"}}}, "required": ["question", "correct_answer", "incorrect_answers"]}}, "required": ["Output"]}"\n### Input :Object Oriented Principles we saw so far\n• Encapsulation\nIt keeps the data and the code safe from external interference. It is a\nmechanism for restricting direct access to some of the object’s\ncomponent. Binding the data with the code that manipulates it.\n• Inheritance\nInheritance allows a class to use the properties and methods of another\nclass. In other words, the derived class inherits the states and behaviors\nfrom the base c