<a href="https://colab.research.google.com/github/shameer-phy/GenAI/blob/main/FineTuning/LoRa_Customer_support_lora_llama_3_2_3b_instruct.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installing necessary libraries

* transformers: Provides state-of-the-art pretrained models for NLP, computer vision, and beyond.
* datasets: A library for easy access to a wide range of datasets for ML and NLP tasks.
* accelerate: Simplifies distributed training and inference for PyTorch models.
* torch: PyTorch library for building and training deep learning models.
* bitsandbytes: Optimized GPU quantization and acceleration for large-scale models.
* peft: Parameter-efficient fine-tuning techniques for large language models.
* trl: Tools for training transformer models with reinforcement learning techniques.

In [1]:
!pip install transformers accelerate bitsandbytes trl datasets peft tokenizers huggingface_hub

Collecting bitsandbytes
  Downloading bitsandbytes-0.45.1-py3-none-manylinux_2_24_x86_64.whl.metadata (5.8 kB)
Collecting trl
  Downloading trl-0.14.0-py3-none-any.whl.metadata (12 kB)
Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.10.0->accelerate)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4

In [None]:
#!pip install transformers==4.47.1 accelerate==0.34.2 bitsandbytes==0.45.0 trl==0.13.0 datasets==3.2.0 peft==0.14.0 tokenizers==0.21.0 huggingface_hub==0.26.0

# Importing Libraries

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaTokenizer, BitsAndBytesConfig
import torch
from datasets import load_dataset
from transformers import Trainer, TrainingArguments
from peft import PeftModel,get_peft_model,LoraConfig, TaskType
from trl import SFTTrainer, SFTConfig

In [3]:
from huggingface_hub import login
#import os
#hf_token = os.environ(HF_TOKEN)

from google.colab import userdata
hf_token = userdata.get('HF_TOKEN')
login(token = hf_token) # Logging into Hugging Face Hub to access models and other resources

# Loading Model configurations and Dataset Preparation

Huggingface model link: https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct

In [4]:
base_model = 'meta-llama/Llama-3.2-3B-Instruct'

#base_model = 'deepseek-ai/DeepSeek-R1'

In [None]:
# Load model directly
from transformers import AutoModelForCausalLM, AutoModel
#model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-R1", trust_remote_code=True)

In [5]:
# Configure 4-bit quantization settings using the BitsAndBytesConfig class
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Enable loading the model with 4-bit precision for reduced memory usage
    bnb_4bit_quant_type='nf4',  # Use NormalFloat4 (nf4), a quantization format for higher accuracy
    bnb_4bit_compute_dtype=torch.float16,  # Use float16 for computation to balance speed and precision
    bnb_4bit_use_double_quant=True  # Enable double quantization for better numerical stability
)

# Load the pre-trained model with 4-bit quantization
model = AutoModelForCausalLM.from_pretrained(
    base_model,  # Name of the base model defined earlier
    #trust_remote_code=True,
    #quantization=bitsandbytes_4bit,
    device_map="auto",  # Automatically map model layers to available devices (e.g., GPU/CPU)
    quantization_config=bnb_config  # Apply the defined 4-bit quantization configuration
    #trust_remote_code=True
)

# Note:
# 1. The use of 4-bit quantization helps in reducing memory requirements while maintaining reasonable performance.
# 2. `device_map="auto"` ensures the model layers are automatically distributed across available hardware for efficient loading.

config.json:   0%|          | 0.00/878 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

In [6]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
# Set the padding token to the end-of-sequence (eos) token
# This ensures compatibility when the model processes inputs with padding
tokenizer.pad_token = tokenizer.eos_token
# Configure the tokenizer to apply padding on the right side of the input
# This is often the default for causal language models to ensure alignment during training or inference
tokenizer.padding_side = "right"

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

Dataset link: https://huggingface.co/datasets/Victorano/customer-support-1k

In [7]:
# Loading the 'Customer_support_faqs_dataset' from the Hugging Face dataset repository
dataset = load_dataset("Victorano/customer-support-1k", split="train")
dataset = dataset.remove_columns(['flags', 'category','intent','text'])
dataset = dataset.train_test_split(test_size=0.2)

README.md:   0%|          | 0.00/471 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/648k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [8]:
dataset

DatasetDict({
    train: Dataset({
        features: ['instruction', 'response'],
        num_rows: 800
    })
    test: Dataset({
        features: ['instruction', 'response'],
        num_rows: 200
    })
})

In [9]:
# Defining the instruction that will guide the assistant's behavior for providing customer support answers.

instruction = """You are a helpful and efficient customer support bot designed to assist users by providing answers to frequently asked questions (FAQs) related to our products and services. Your responses should be concise, clear, and friendly, ensuring the user feels heard and supported. If the user’s question is outside the scope of the FAQ, gently direct them to contact customer support.

Always prioritize accuracy and clarity in your answers.
If the user asks a complex question, break it down into smaller, manageable parts and answer step-by-step.
Provide useful links or references to detailed documentation when appropriate.
Use a friendly and professional tone, ensuring the response is easy to understand.
If the FAQ does not cover the question, offer an apology and suggest contacting customer support.
"""

def template(row):
    # Creating a list of message exchanges (system, user, assistant)
    row_json = [{"role": "system", "content": instruction }, # System message with the pre-defined instructions
               {"role": "user", "content": row["instruction"]}, # User's question from the dataset
               {"role": "assistant", "content": row["response"]}] # The assistant's answer from the dataset

    # Tokenizing the chat template and storing the result in the 'text' column (without applying tokenization)
    row["text"] = tokenizer.apply_chat_template(row_json, tokenize=False)
    return row

# Applying the template function to each row in the dataset using multi-processing (4 processes in parallel)
dataset = dataset.map(template,num_proc= 4)

Map (num_proc=4):   0%|          | 0/800 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/200 [00:00<?, ? examples/s]

In [19]:
#import pandas
#pandas.DataFrame(dataset['train'])

In [20]:
# To check a sample record from dataset
dataset['train']['text'][10]

"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 02 Feb 2025\n\nYou are a helpful and efficient customer support bot designed to assist users by providing answers to frequently asked questions (FAQs) related to our products and services. Your responses should be concise, clear, and friendly, ensuring the user feels heard and supported. If the user’s question is outside the scope of the FAQ, gently direct them to contact customer support.\n\nAlways prioritize accuracy and clarity in your answers.\nIf the user asks a complex question, break it down into smaller, manageable parts and answer step-by-step.\nProvide useful links or references to detailed documentation when appropriate.\nUse a friendly and professional tone, ensuring the response is easy to understand.\nIf the FAQ does not cover the question, offer an apology and suggest contacting customer support.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nI don't k

In [21]:
# Configure LoRA (Low-Rank Adaptation) for fine-tuning the model on a language modeling task
lora_config = LoraConfig(
    r=4,                   # Rank for low-rank matrices
    lora_alpha=8,         # Scaling factor
    lora_dropout=0.2,      # Regularization dropout
    task_type="CAUSAL_LM"  # For language modeling
)
model = get_peft_model(model, lora_config)
# Print the number of trainable parameters in the model after applying LoRA
model.print_trainable_parameters()

trainable params: 1,146,880 || all params: 3,213,896,704 || trainable%: 0.0357


In [22]:
# Setting up training arguments for the model training process
training_arguments = TrainingArguments(
    output_dir="./results",  # Directory where the results will be saved
    num_train_epochs=1,  # Number of training epochs
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    warmup_steps=5,  # Number of warmup steps to gradually increase the learning rate during training
    learning_rate=2e-4,  # Learning rate for the optimizer
    fp16=True,  # Enabling 16-bit floating point precision for faster training on GPUs that support it (reduces memory usage)
    report_to="none",  # Disabling logging/reporting to external services (e.g., TensorBoard, Weights & Biases)
)

# Initializing the SFTTrainer for supervised fine-tuning
trainer = SFTTrainer(
    model=model,  # The pre-trained model to be fine-tuned
    train_dataset=dataset["train"], # The dataset used for training
    eval_dataset=dataset["test"],  # The dataset used for validation
    tokenizer=tokenizer,  # Tokenizer to process input text for the model
    args=training_arguments,  # The training arguments defined above
    peft_config=lora_config,
)

  trainer = SFTTrainer(


Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [23]:
model.train()
trainer.train()

Step,Training Loss
500,0.6753


TrainOutput(global_step=800, training_loss=0.6168776702880859, metrics={'train_runtime': 375.7962, 'train_samples_per_second': 2.129, 'train_steps_per_second': 2.129, 'total_flos': 4314878139666432.0, 'train_loss': 0.6168776702880859, 'epoch': 1.0})

# Testing

In [24]:
# Function to generate a response based on the input prompt
def generate(input_prompt):
    # Define the system and user messages to provide context for the conversation
    messages = [
        {"role": "system", "content": instruction},  # System message with the pre-defined instructions
        {"role": "user", "content": input_prompt}   # User's input prompt
    ]

    # Apply the chat template to format the messages, without tokenizing yet, and add the generation prompt
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

    # Tokenize the formatted prompt, padding and truncating as necessary, and move the data to the GPU
    inputs = tokenizer(prompt, return_tensors='pt', padding=True, truncation=True).to("cuda")

    # Generate the model's output based on the tokenized input, limiting to a maximum of 2048 new tokens
    outputs = model.generate(**inputs, max_new_tokens=2048, num_return_sequences=1)

    # Decode the output tokens back into text, skipping any special tokens (like padding)
    text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return text  # Return the generated text

In [25]:
response = generate("Where to see what payment options are available?")
print(response)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


system

Cutting Knowledge Date: December 2023
Today Date: 02 Feb 2025

You are a helpful and efficient customer support bot designed to assist users by providing answers to frequently asked questions (FAQs) related to our products and services. Your responses should be concise, clear, and friendly, ensuring the user feels heard and supported. If the user’s question is outside the scope of the FAQ, gently direct them to contact customer support.

Always prioritize accuracy and clarity in your answers.
If the user asks a complex question, break it down into smaller, manageable parts and answer step-by-step.
Provide useful links or references to detailed documentation when appropriate.
Use a friendly and professional tone, ensuring the response is easy to understand.
If the FAQ does not cover the question, offer an apology and suggest contacting customer support.user

Where to see what payment options are available?assistant

I'm glad you asked! To explore the available payment options, y

In [26]:
# to format output
print(response.split("assistant")[-1])



I'm glad you asked! To explore the available payment options, you can visit our website at {{Website URL}} and navigate to the {{Section Name}} section. There, you will find a comprehensive list of payment methods we accept. If you prefer a more personalized approach, feel free to reach out to our customer support team, and they will be more than happy to assist you in finding the perfect payment option for your needs. Remember, we're here to help you every step of the way!


# Save the model

In [27]:
model.save_pretrained("/content/customer-faq-llama-3.2-3B") # Saves the model under the same directory.

In [28]:
# To push the model to hugginface
model.push_to_hub("customer-faq-llama-3.2-3B")

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/4.60M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Shameer-Khan/customer-faq-llama-3.2-3B/commit/2270085c12196b3fbccde915e7ae8b55159d5c08', commit_message='Upload model', commit_description='', oid='2270085c12196b3fbccde915e7ae8b55159d5c08', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Shameer-Khan/customer-faq-llama-3.2-3B', endpoint='https://huggingface.co', repo_type='model', repo_id='Shameer-Khan/customer-faq-llama-3.2-3B'), pr_revision=None, pr_num=None)

Model would be saved like this: https://huggingface.co/Chirag4579/customer-faq-llama-3.2-3B

# Load a model

In [None]:
# load locally
# local_model = AutoModelForCausalLM.from_pretrained("/content/customer-faq-llama-3.2-3B")

# load the saved model from huggingface

# model = AutoModelForCausalLM.from_pretrained(
#     "Shameer-Khan/customer-faq-llama-3.2-3",  # Name of the base model defined earlier
#     device_map="auto",  # Automatically map model layers to available devices (e.g., GPU/CPU)
#     quantization_config=bnb_config,  # Apply the defined 4-bit quantization configuration
# )