#  QnA Chatbot for Chocolate Retail Business | Powered by LLaMA2

This project builds a custom chatbot designed to assist customers of a **retail business specializing in chocolates**, leveraging the **LLaMA2-7B-Chat** model for intelligent, context-aware responses.

The model is fine-tuned using **QLoRA** (LoRA adapters + 4-bit quantization), enabling efficient training within a Google Colab environment while maintaining high performance and relevance to domain-specific queries.

---
### 📊 Dataset
- The dataset consists of **1000 QnA pairs** tailored specifically for a chocolate-focused retail business.  
- Each entry reflects realistic customer interactions on topics such as:
  - Taste profiles
  - Pricing and offers
  - Shipping/delivery
  - Gifting and packaging
  - Usage, storage, and dietary information

- Data was generated and refined using AI tools (e.g., ChatGPT), then formatted for supervised fine-tuning.

- Structure:
  - Stored in plain `.txt` files, where each QnA block is separated by two newlines.
  - Split into:
    -  **800 training samples**
    -  **100 validation samples**
    -  **100 test samples**

- These are preprocessed into the Hugging Face `Dataset` format and used in conjunction with `trl.SFTTrainer` for QLoRA fine-tuning.

---

###  Key Features:
- Uses **LLaMA2-7B-Chat** model
- Fine-tuned with **QLoRA** for low-memory training
- Supports **fallback response** for unknown or unclear inputs
- Model trained on **custom chocolate-related QnA data**
- Exports ready-to-use model for chatbot deployment



## Libraries

In [1]:
#Install Required Libraries
!pip install -U transformers accelerate peft bitsandbytes
!pip install -U git+https://github.com/huggingface/trl
!pip install git+https://github.com/huggingface/trl.git@main
from huggingface_hub import login
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
import torch
from datasets import Dataset
from trl import SFTTrainer
import torch
from torch.nn import CrossEntropyLoss
import math
from random import choice
import re
from peft import PeftModel


Collecting transformers
  Downloading transformers-4.54.0-py3-none-any.whl.metadata (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.7/41.7 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
Collecting bitsandbytes
  Downloading bitsandbytes-0.46.1-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting huggingface-hub<1.0,>=0.34.0 (from transformers)
  Downloading huggingface_hub-0.34.3-py3-none-any.whl.metadata (14 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cud

##Dataset

In [2]:
# 🔁 Helper function to load QnA blocks from .txt file
def load_qna(path):
    with open(path, "r", encoding="utf-8") as f:
        blocks = f.read().strip().split("\n\n")  # Split on double newlines
    return [{"text": block.strip()} for block in blocks]  # Wrap each block as a dict


In [4]:
# Load and format custom chocolate-retail QnA dataset
# Each .txt file contains realistic customer interactions
# Format: [INST] question [/INST] answer
# Files: Train.txt, Validation.txt, Test.txt

train_dataset = Dataset.from_list(load_qna("/content/Train.txt"))
val_dataset   = Dataset.from_list(load_qna("/content/Validation.txt"))
test_dataset  = Dataset.from_list(load_qna("/content/Test.txt"))

# Preview dataset stats
print("📊 Training Samples:", len(train_dataset))
print("📊 Validation Samples:", len(val_dataset))
print("📊 Test Samples:", len(test_dataset))

# Show one sample from each
print("\nExample Training Sample:\n", train_dataset[0]["text"])
print("\nExample Validation Sample:\n", val_dataset[0]["text"])
print("\nExample Test Sample:\n", test_dataset[0]["text"])


📊 Training Samples: 800
📊 Validation Samples: 100
📊 Test Samples: 100

Example Training Sample:
 [INST] How do I get in touch with the company? [/INST]
You can email us directly at unthealthandfood@gmail.com for any help or questions.

Example Validation Sample:
 [INST] Seriously, do you ship to the UK? [/INST]
Totally valid doubt! Yes! We ship worldwide. Just email unthealthandfood@gmail.com to get started.

Example Test Sample:
 [INST] Seriously, is this okay as a Diwali gift for employees? [/INST]
Totally valid doubt! Absolutely! It’s one of the classiest gifts, thanks to the elegant packaging.


## Model building

In [22]:
# Authenticate with Hugging Face Hub to access gated models (e.g., LLaMA 2)

#Generate own token from Hugging Face as the token is confidential and comment out the below code


#login(token=input("Enter your HF token: "))



In [6]:
#  Load the base pre-trained LLaMA2-7B model from Hugging Face
model_name = "meta-llama/Llama-2-7b-hf"

# ⚙️ Configure 4-bit quantization using BitsAndBytes for efficient memory usage
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Enables 4-bit loading
    bnb_4bit_compute_dtype=torch.float16  # Use float16 for matrix multiplications
)

#  Load tokenizer for the LLaMA2 model
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# Set padding token to EOS (required for some models during generation)
tokenizer.pad_token = tokenizer.eos_token

# Load the quantized LLaMA2 model with device mapping (e.g., to GPU)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",  # Automatically maps model layers to available GPU(s)
    quantization_config=bnb_config  # Apply 4-bit quantization config
)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [7]:
#  Define LoRA (Low-Rank Adaptation) configuration
lora_config = LoraConfig(
    r=16,                             # Rank of the low-rank matrices
    lora_alpha=32,                   # Scaling factor for LoRA weights
    target_modules=["q_proj", "v_proj"],  # Target specific attention projection layers
    lora_dropout=0.05,               # Dropout applied to LoRA layers during training
    bias="none",                     # Don't train bias parameters
    task_type=TaskType.CAUSAL_LM     # Set task type for causal language modeling
)

#  Apply LoRA adapters to the base model
model = get_peft_model(model, lora_config)


In [8]:
# Define training configuration using Hugging Face's TrainingArguments

training_args = TrainingArguments(
    output_dir="./llama2-output",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    num_train_epochs=3,
    logging_steps=10,
    save_total_limit=2,
    save_steps=100,
    fp16=True,
    report_to="none"
)


In [9]:
# Fine-tune the model using TRL's SFTTrainer (Supervised Fine-Tuning)
# Uisng formatting_func=lambda x: x["text"] because the dataset is a list of dictionaries like {"text": "[INST] question [/INST] answer"}.


trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    peft_config=lora_config,
    args=training_args,
    formatting_func=lambda x: x["text"]
)




Applying formatting function to train dataset:   0%|          | 0/800 [00:00<?, ? examples/s]

Adding EOS to train dataset:   0%|          | 0/800 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/800 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/800 [00:00<?, ? examples/s]

Applying formatting function to eval dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

Adding EOS to eval dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [10]:
# Model Training
trainer.train()

Step,Training Loss
10,2.6614
20,1.6911
30,1.0481
40,0.5996
50,0.3482
60,0.2472
70,0.2026
80,0.1686
90,0.1558
100,0.1454


TrainOutput(global_step=300, training_loss=0.3206971565882365, metrics={'train_runtime': 871.3312, 'train_samples_per_second': 2.754, 'train_steps_per_second': 0.344, 'total_flos': 4991173019074560.0, 'train_loss': 0.3206971565882365})

## Model Evaluation

In [11]:
# Sample prediction of Test Dataset
sample = choice(test_dataset)["text"]

#  Tokenize the sample prompt and move it to the GPU
inputs = tokenizer(sample, return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=100)


print("📌 Prompt:\n", sample)
print("\n📦 Response:\n", tokenizer.decode(outputs[0], skip_special_tokens=True))


📌 Prompt:
 [INST] Seriously, can I have it during my keto diet? [/INST]
Totally valid doubt! It’s vegetarian-friendly! But double-check the label for your dietary needs.

📦 Response:
 [INST] Seriously, can I have it during my keto diet? [/INST]
Totally valid doubt! It’s vegetarian-friendly! But double-check the label for your dietary needs.


In [12]:
# Function to compute average loss and perplexity on a dataset

def compute_perplexity(model, tokenizer, dataset, max_length=512):
    model.eval()
    losses = []

    for sample in dataset:
        text = sample['text']
        inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=max_length)
        input_ids = inputs["input_ids"].to("cuda")
        with torch.no_grad():
            outputs = model(input_ids, labels=input_ids)
        loss = outputs.loss
        losses.append(loss.item())

    avg_loss = sum(losses) / len(losses)
    ppl = math.exp(avg_loss)
    return avg_loss, ppl

#  Evaluate on validation set

val_loss, val_ppl = compute_perplexity(model, tokenizer, val_dataset)
print(f"📊 Validation Loss: {val_loss:.4f}")
print(f"📉 Validation Perplexity: {val_ppl:.2f}")


📊 Validation Loss: 0.1156
📉 Validation Perplexity: 1.12


In [13]:
# Function to compute Exact Match (EM) accuracy on the test dataset

def exact_match_accuracy(model, tokenizer, dataset):
    model.eval()
    match = 0

    for sample in dataset:
        prompt = sample["text"]
        if "[/INST]" not in prompt:
            continue
        split_idx = prompt.index("[/INST]") + len("[/INST]")
        expected_answer = prompt[split_idx:].strip()

        input_prompt = prompt[:split_idx]
        inputs = tokenizer(input_prompt, return_tensors="pt", truncation=True, max_length=512).to("cuda")
        with torch.no_grad():
            output = model.generate(**inputs, max_new_tokens=100)
        generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
        predicted_answer = generated_text[split_idx:].strip()

        if predicted_answer == expected_answer:
            match += 1

    return match / len(dataset) * 100

#  Evaluate Exact Match on Test Set

em_score = exact_match_accuracy(model, tokenizer, test_dataset)
print(f"✅ Exact Match Accuracy on Test Set: {em_score:.2f}%")


✅ Exact Match Accuracy on Test Set: 100.00%


In [14]:
print(f"📊 Validation Loss: {val_loss:.4f}")
print(f"📉 Validation Perplexity: {val_ppl:.2f}")
print(f"✅ Exact Match Accuracy on Test Set: {em_score:.2f}%")


📊 Validation Loss: 0.1156
📉 Validation Perplexity: 1.12
✅ Exact Match Accuracy on Test Set: 100.00%


# Testing the QnA Bot

In [15]:
# Set padding token to the same as end-of-sequence (EOS) token

tokenizer.pad_token = tokenizer.eos_token

In [16]:
# Adding Fall back message



#  Detect gibberish input (very few real words or too many non-alphabet characters)
def is_gibberish(text):
    # If fewer than 3 real words or high non-alpha content, call it gibberish
    words = re.findall(r'\b[a-zA-Z]{4,}\b', text)
    non_alpha_ratio = sum(1 for c in text if not c.isalpha() and c != ' ') / max(len(text), 1)
    return len(words) < 2 or non_alpha_ratio > 0.4


#  Main: Ask the chatbot and return a response
def ask_bot(question, max_new_tokens=100):
    fallback_msg = "I'm not sure how to answer that. Please contact the seller for more help."

    # Handle garbage or unclear user input
    if is_gibberish(question):
        return fallback_msg

    # Format user input into prompt for LLaMA2
    prompt = f"[INST] {question.strip()} [/INST]"
    inputs = tokenizer(prompt, return_tensors="pt", return_token_type_ids=False, truncation=True, max_length=512).to("cuda")

    # Generate response from model
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            repetition_penalty=1.05
        )


    # Postprocess the model output
    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()

    # Clean up
    decoded = decoded.replace(prompt, "").strip()
    decoded = re.sub(r"\[/?INST\]", "", decoded)
    decoded = re.sub(r"<s>|</s>", "", decoded).strip()


    # Check if model response is valid, or fallback
    is_empty = decoded == ""
    is_disclaimer = any(x in decoded.lower() for x in [
        "language model", "as an ai", "i cannot", "i do not have access"
    ])
    looks_like_question = decoded.endswith("?") and not decoded.lower().startswith("yes")
    is_non_alpha = sum(c.isalpha() for c in decoded) / max(len(decoded), 1) < 0.4

    if is_empty or is_disclaimer or is_non_alpha or looks_like_question:
        return fallback_msg

    return decoded


In [18]:
#  Test chatbot for 10 interactions
for i in range(10):
    user_input = input(f"🧑‍💻 You ({i+1}/10): ")

    if user_input.lower() in ["exit", "quit"]:
        print("👋 Goodbye!")
        break

    response = ask_bot(user_input)

    print("🤖 Bot:", response, "\n")


🧑‍💻 You (1/10): can i order in bulk
🤖 Bot: Yes, you can email us at unthealthandfood@gmail.com for bulk orders and discounts. 

🧑‍💻 You (2/10): what is the price
🤖 Bot: ₹699 for 96 grams. 

🧑‍💻 You (3/10): can you deliver to cannada
🤖 Bot: Yes! We ship worldwide. Just email unthealthandfood@gmail.com to get started. 

🧑‍💻 You (4/10): sjhdg
🤖 Bot: I'm not sure how to answer that. Please contact the seller for more help. 

🧑‍💻 You (5/10): does the parcel comes in hidden package?
🤖 Bot: Yes! It arrives in discreet packaging with no product name on the box. 

🧑‍💻 You (6/10):  
🤖 Bot: I'm not sure how to answer that. Please contact the seller for more help. 

🧑‍💻 You (7/10): do you do custom gift wrapping
🤖 Bot: Yes! We do free gift wrapping. Just email unthealthandfood@gmail.com to get started. 

🧑‍💻 You (8/10): can I gift the chocolate to my girlfriend
🤖 Bot: Absolutely! It’s one of the classiest gifts, thanks to the elegant packaging. 

🧑‍💻 You (9/10): can i eat the chocolate daily?
🤖 Bo

# Saving the Model

In [19]:
# Merge LoRA adapter into base model
merged_model = model.merge_and_unload()

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Save merged model + tokenizer to Drive
save_path = "/content/drive/MyDrive/llama2-final"
merged_model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)




Mounted at /content/drive


('/content/drive/MyDrive/llama2-final/tokenizer_config.json',
 '/content/drive/MyDrive/llama2-final/special_tokens_map.json',
 '/content/drive/MyDrive/llama2-final/tokenizer.model',
 '/content/drive/MyDrive/llama2-final/added_tokens.json',
 '/content/drive/MyDrive/llama2-final/tokenizer.json')

## Generate Requirement File

In [20]:
# Generating Requiremnt.txt File
!pip freeze > requirements.txt
print("✅ requirements.txt generated with all installed packages.")


✅ requirements.txt generated with all installed packages.


In [21]:
# Generating Clean Requiremnt.txt File
!pip freeze | grep -E 'transformers|torch|accelerate|peft|bitsandbytes' > requirements_minimal.txt


# Conclusion

In this notebook, I successfully fine-tuned the LLaMA2-7B-Chat model using QLoRA to build a custom chatbot for a chocolate-based retail business. I started by preparing a domain-specific QnA dataset and formatting it for training. The model was trained on Colab using a memory-efficient setup, evaluated thoroughly, and finally saved for deployment.

To make the bot more user-friendly, I also added fallback logic to handle gibberish or irrelevant questions.

### Model Accuracy
After training, the model achieved an Exact Match Accuracy of 100% on the test set, which means it perfectly reproduced the expected answers for all test cases. This shows that the model has learned the dataset extremely well and is ready for real-world queries.