# Generalized Post-Training Quantization (GPTQ) Template



GPTQ: Generalized Post-Training Quantization


The idea behind the method is that it will try to compress all weights to a 4-bit quantization by minimizing the mean squared error to that weight. During inference, it will dynamically dequantize its weights to float16 for improved performance whilst keeping memory low.

`I prepared this Supervised Fine-Tuning (SFT) template for my use case, but you could change it to suit your requirements.`



To View My Account:

* [Hugging Face ](https://huggingface.co/santhoshmlops)

* [Git Hub](https://github.com/santhoshmlops)

To View Some other Fine Tuning Template:

* [Fine Tuning Template ](https://github.com/santhoshmlops/MyHF_LLM_FineTuning/tree/main/FineTuningTemplate)


To View My Model Fine Tuning  NoteBook:

* [MY HF LLM Fine-Tuning](https://github.com/santhoshmlops/MyHF_LLM_FineTuning)



## Setting Up on Google Colab
Google Colab provides a convenient, cloud-based environment with access to powerful GPUs like the `T4`. If you choose Colab for this tutorial, make sure to select a GPU runtime by going to `Runtime > Change runtime type > T4 GPU`. This ensures that your notebook has access to the necessary computational resources.

## Setting Up Hugging Face Authentication

On Google Colab, you can safely store your Hugging Face token by using Colab's "Secrets" feature. This can be done by clicking on the "Key" icon in the sidebar, selecting "`Secrets`", and adding a new secret with the name `HF_TOKEN` and your Hugging Face token as the value. This method ensures that your token remains secure and is not exposed in your notebook's code.

# Step 1 - Install the required Python packages

In [None]:
!pip install -q -U transformers
!pip install -q -U peft
!pip install -q -U bitsandbytes
!pip install -q -U trl
!pip install -q -U accelerate
!pip install -q -U datasets
!pip install -q -U auto-gptq
!pip install -q -U optimum

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.9/190.9 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m39.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m43.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.1/14.1 MB[0m [31m39.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m731.7/731.7 MB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.6/410.6 MB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.6/121.6 MB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━

# Step 2 - Logging into Hugging Face Hub
Paste the Hugging Face Hub Write API KEY

In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Step 3 - Loading Required Libraries

In [None]:
import os
import torch
import pandas as pd
from datasets import load_dataset, Dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig, TrainingArguments,DataCollatorForLanguageModeling
from peft import LoraConfig,PeftModel, AutoPeftModelForCausalLM, prepare_model_for_kbit_training, get_peft_model
from trl import SFTTrainer
from accelerate import Accelerator

# Step 4 - Setting Model Parameters for SFT
`Note:` The parameter can be changed for fine tuning, or it can be left as it is and filled with the value of the empty parameter.

In [None]:
# Load Model for Tuning
model_ckpt = "TheBloke/zephyr-7B-alpha-GPTQ"  # Change the model_ckpt as your wish. For eg - "microsoft/phi-1_5"
hf_user_name = "santhoshmlops"
hub_model_ckpt = hf_user_name+"/"+ model_ckpt.split("/")[-1]+"-SFT" # Change the hub_model_ckpt as your wish. For eg - "santhoshmlops/microsoft_phi-1_5_merged-SFT"
dataset_name = "bitext/Bitext-customer-support-llm-chatbot-training-dataset"

# GPTQ Parameters
bits = 4
disable_exllama = True

# Lora Parameters
r= 16
lora_alpha = 32
lora_dropout = 0.05
bias = "none"
task_type = "CAUSAL_LM"
target_modules = ["q_proj", "v_proj"]   # Change the Target modules based on the model for tuning For eg - ["q_proj","k_proj"]

# Automodel Parameters
device_map = {"": Accelerator().local_process_index}
torch_dtype = torch.float16

# Tokenizer Parameters
trust_remote_code = True

# Training Parameters
output_dir = model_ckpt.split("/")[-1]+"-SFT"   # Change the model_ckpt as your wish. For eg - "microsoft_phi-1_5_merged-SFT"
num_train_epochs = 1
per_device_train_batch_size = 3
gradient_accumulation_steps = 3
gradient_checkpointing = True
max_grad_norm = 0.3
learning_rate = 2e-4
weight_decay = 0.003
optim = "paged_adamw_32bit"
lr_scheduler_type = "cosine"
max_steps = 250
warmup_ratio = 0.03
group_by_length = True
save_steps = 50
save_strategy = "epoch"
logging_steps = 50
logging_dir = "./logs"
fp16 = False
bf16 = False
push_to_hub = True
neftune_noise_alpha = 5
report_to = "tensorboard"

# SFT Training Parameters
train_cln_name = "text"
packing = False
max_seq_length = 1024

# Merge and push the model to Hub
low_cpu_mem_usage = True
return_dict = True

# Step 5 - Loading and Formatting the Dataset
`Note:` Prepare your dataset for fine tuning by defining and formatting it for your use case. The `def create_data():` function is an example for tuning the dataset.

In [None]:
def create_data():
    data = load_dataset(dataset_name, split="train")
    data_df = data.to_pandas()

    # Create a new DataFrame with combined text for each example
    processed_df = pd.DataFrame({
        "text": "<|system|>\n You are a support chatbot who helps with user queries chatbot who always responds in the style of a professional.\n<|user|>\n" + data_df["instruction"] + "\n<|assistant|>\n" + data_df["response"]
    })

    return Dataset.from_pandas(processed_df)

data = create_data()
print(data[0])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/11.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/19.2M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

{'text': "<|system|>\n You are a support chatbot who helps with user queries chatbot who always responds in the style of a professional.\n<|user|>\nquestion about cancelling order {{Order Number}}\n<|assistant|>\nI've understood you have a question regarding canceling order {{Order Number}}, and I'm here to provide you with the information you need. Please go ahead and ask your question, and I'll do my best to assist you."}


# Step 6 - Fine-Tuning with Lora and Supervised Finetuning

In [None]:
# Load the model and tokenizer with specified configurations.
tokenizer = AutoTokenizer.from_pretrained(
    model_ckpt,
    trust_remote_code=trust_remote_code
)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

gptq_config = GPTQConfig(
    bits=bits,
    disable_exllama=disable_exllama,
    tokenizer=tokenizer
)

model = AutoModelForCausalLM.from_pretrained(
    model_ckpt,
    quantization_config=gptq_config,
    device_map=device_map,
    trust_remote_code=trust_remote_code,
    torch_dtype=torch_dtype
)
model.config.use_cache = False
model.config.pretraining_tp = 1
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

# Training arguments
training_args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    gradient_checkpointing=gradient_checkpointing,
    max_grad_norm=max_grad_norm,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    optim=optim,
    lr_scheduler_type=lr_scheduler_type,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    save_steps=save_steps,
    save_strategy=save_strategy,
    logging_steps=logging_steps,
    fp16=fp16,
    bf16=bf16,
    push_to_hub=push_to_hub,
    neftune_noise_alpha = neftune_noise_alpha,
)

# Prepare the model with LoRA (Low-Rank Adaptation) configuration.
lora_config = LoraConfig(
    r=r,
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    bias=bias,
    task_type=task_type,
    target_modules=target_modules
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# Create a trainer for training the model.
trainer = SFTTrainer(
    model=model,
    train_dataset=data,
    peft_config=lora_config,
    dataset_text_field=train_cln_name,
    args=training_args,
    tokenizer=tokenizer,
    packing=packing,
    max_seq_length=max_seq_length,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

tokenizer_config.json:   0%|          | 0.00/983 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/169 [00:00<?, ?B/s]

Using `disable_exllama` is deprecated and will be removed in version 4.37. Use `use_exllama` instead and specify the version with `exllama_config`.The value of `use_exllama` will be overwritten by `disable_exllama` passed in `GPTQConfig` or stored in your config file.


config.json:   0%|          | 0.00/1.31k [00:00<?, ?B/s]

Using `disable_exllama` is deprecated and will be removed in version 4.37. Use `use_exllama` instead and specify the version with `exllama_config`.The value of `use_exllama` will be overwritten by `disable_exllama` passed in `GPTQConfig` or stored in your config file.


model.safetensors:   0%|          | 0.00/4.16G [00:00<?, ?B/s]



generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

trainable params: 6,815,744 || all params: 269,225,984 || trainable%: 2.5316070532033046


Map:   0%|          | 0/26872 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


# Step 7 - Lets start the training process

In [None]:
# Train the model and save it.
trainer.train()
trainer.push_to_hub()



Step,Training Loss
50,0.9721
100,0.7173
150,0.6507
200,0.6301
250,0.5989


events.out.tfevents.1710581417.efbe965da6b0.906.0:   0%|          | 0.00/6.67k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/santhoshmlops/zephyr-7B-alpha-GPTQ-SFT/commit/82cf9c98e402c16b398d9d6825d6675eb91e872f', commit_message='End of training', commit_description='', oid='82cf9c98e402c16b398d9d6825d6675eb91e872f', pr_url=None, pr_revision=None, pr_num=None)

# Step 8 - Inferencing with the LLM

In [None]:
from peft import AutoPeftModelForCausalLM
from transformers import GenerationConfig
from transformers import AutoTokenizer
import torch

def process_data_sample(example):

    processed_example = "<|system|>\n You are a support chatbot who helps with user queries chatbot who always responds in the style of a professional.\n<|user|>\n" + example["instruction"] + "\n<|assistant|>\n"

    return processed_example

tokenizer = AutoTokenizer.from_pretrained("santhoshmlops/zephyr-7B-alpha-GPTQ-SFT")

inp_str = process_data_sample(
    {
        "instruction": "i have a question about new order {{Order Number}}",
    }
)

inputs = tokenizer(inp_str, return_tensors="pt").to("cuda")

model = AutoPeftModelForCausalLM.from_pretrained(
    "santhoshmlops/zephyr-7B-alpha-GPTQ-SFT",
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map="cuda")

generation_config = GenerationConfig(
    do_sample=True,
    top_k=1,
    temperature=0.1,
    max_new_tokens=256,
    pad_token_id=tokenizer.eos_token_id
)

Using `disable_exllama` is deprecated and will be removed in version 4.37. Use `use_exllama` instead and specify the version with `exllama_config`.The value of `use_exllama` will be overwritten by `disable_exllama` passed in `GPTQConfig` or stored in your config file.


In [None]:

import time
st_time = time.time()
outputs = model.generate(**inputs, generation_config=generation_config)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(time.time()-st_time)

<|system|>
 You are a support chatbot who helps with user queries chatbot who always responds in the style of a professional.
<|user|>
i have a question about new order {{Order Number}}
<|assistant|>
I'm on it! I'm here to assist you with any questions you have regarding your order with the order number {{Order Number}}. Let's explore the details together. Could you please provide me with more specific information about the question you have? This will help me provide you with the most accurate and relevant response. Your satisfaction is our top priority, and I'm here to ensure that you have a seamless experience with your order. Let's work together to resolve any concerns you may have. How can I assist you further?
6.770001649856567
