# Fine-Tuning StableLM-2-Zephyr-1.6B with RAG for EADE Business School Chatbot

This Kaggle notebook fine-tunes the `stablelm-2-zephyr-1.6b` model on a dataset about EADE Business School, using Retrieval-Augmented Generation (RAG) for accurate responses. It addresses CUDA out-of-memory, device mismatch, loss computation, `max_seq_length`, and `metric_for_best_model` errors, optimized for a T4 GPU (14.74 GiB).

**Key Features**:
- Lightweight model (~1.6B parameters, ~3-4 GB memory)
- RAG with SentenceTransformer/FAISS for grounded responses
- Fixed all errors: OOM, device mismatch (`cuda:0`), loss, `max_seq_length`, `metric_for_best_model`
- Memory optimizations: 4-bit quantization, batch size 1, gradient checkpointing
- ROUGE/BLEU metrics for evaluation
- Gradio interface for interactive testing

**Requirements**:
- Upload `eade_business_school_data.json` (61 prompt-response pairs) to `/kaggle/working/`
- T4 GPU (single GPU, cuda:0)
- Hugging Face token for optional model push (set in Kaggle Secrets)

**Dataset**: Ensure `eade_business_school_data.json` is uploaded. Example format:
```json
[
  {"prompt": "What is the full name of EADE Business School?", "response": "The full name of EADE Business School is Escuela de Alta Dirección Empresarial."},
  {"prompt": "What are the facilities at EADE Business School?", "response": "EADE Business School offers state-of-the-art facilities, including modern classrooms, 24/7 library access with over 10,000 books and 2,000 academic articles, high-tech offices with advanced technology for business simulations, and amenities like cafeterias, wellness centers, and event spaces designed for a supportive learning environment."},
  ...
]
```
Contact the assistant if you need the dataset.


In [None]:
# Install required packages
!pip install -q transformers gradio torch huggingface_hub datasets sentence-transformers faiss-cpu rouge_score nltk wandb trl bitsandbytes accelerate

import json
import torch
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, DataCollatorForLanguageModeling
from datasets import Dataset
from sentence_transformers import SentenceTransformer
import faiss
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu
import gradio as gr
from huggingface_hub import login
import os
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
from transformers import BitsAndBytesConfig
import wandb

# Restrict to single GPU
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

# Set environment variable to reduce memory fragmentation
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

# Initialize Weights & Biases
wandb.init(project="eade-chatbot-stablelm-1.6b", anonymous="allow")

# Authenticate with Hugging Face
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
hf_token = user_secrets.get_secret("HF_TOKEN")  # Add HF_TOKEN in Kaggle Secrets
if hf_token:
    login(hf_token)
    print("Hugging Face authentication successful")
else:
    print("No Hugging Face token provided; assuming public model")

# Check GPU availability and clear memory
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    print("GPU memory cleared")
print(f"Using device: {device}")


## Load and Prepare Dataset

Load the EADE Business School dataset from `/kaggle/working/` and split into training (80%) and validation (20%) sets.


In [None]:
# Define paths
OUTPUT_DIR = "/kaggle/working/stablelm-2-zephyr-1_6b-eade-finetuned"
DATA_PATH = "/kaggle/working/eade_business_school_data.json"

# Load dataset
try:
    with open(DATA_PATH, 'r') as f:
        data = json.load(f)
    print(f"Loaded dataset with {len(data)} samples")
except Exception as e:
    raise Exception(f"Error loading dataset: {str(e)}. Ensure `eade_business_school_data.json` is uploaded to /kaggle/working/")

# Convert to Hugging Face Dataset
dataset = Dataset.from_list(data)
train_dataset = dataset.select(range(int(len(dataset) * 0.8)))  # 80% for training
eval_dataset = dataset.select(range(int(len(dataset) * 0.8), len(dataset)))  # 20% for validation
print(f"Training samples: {len(train_dataset)}, Validation samples: {len(eval_dataset)}")


## Set Up RAG

Initialize SentenceTransformer and FAISS for RAG to ensure accurate responses.


In [None]:
# Initialize RAG components
retriever_model = SentenceTransformer('all-MiniLM-L6-v2')
documents = [item['response'] for item in data]
document_embeddings = retriever_model.encode(documents, convert_to_numpy=True)

# Create FAISS index
dimension = document_embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(document_embeddings)
print(f"FAISS index created with {len(documents)} documents")


## Load Model and Tokenizer

Load `stablelm-2-zephyr-1.6b` with 4-bit quantization and LoRA, ensuring all tensors are on `cuda:0`.


In [None]:
# Load model and tokenizer
model_name = "stabilityai/stablelm-2-zephyr-1_6b"
try:
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=quantization_config,
        device_map={"": "cuda:0"},  # Explicitly use cuda:0
        use_cache=False
    ).to(device)
    model.gradient_checkpointing_enable()
    print(f"Model and tokenizer loaded from {model_name} on {device}")
except Exception as e:
    raise Exception(f"Error loading model or tokenizer: {str(e)}")

# Set padding token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id
    print("Padding token set to EOS token")

# Apply LoRA configuration
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    bias="none"
)
model = get_peft_model(model, lora_config)
print("LoRA configuration applied")


## Tokenize Dataset

Tokenize the dataset, ensuring `labels` are included for loss computation.


In [None]:
# Tokenize function
def tokenize_function(examples):
    prompt_texts = [f"<|begin_of_text|><|start_header_id|>user<|end_header_id>\n{example}<|eot_id|><|start_header_id|>assistant<|end_header_id>\n{examples['response'][i]}<|eot_id>"
                    for i, example in enumerate(examples['prompt'])]
    tokenized = tokenizer(prompt_texts, padding="max_length", truncation=True, max_length=256, return_tensors="pt")
    tokenized['labels'] = tokenized['input_ids'].clone()  # Labels for causal LM
    return tokenized

# Tokenize datasets
try:
    tokenized_train = train_dataset.map(tokenize_function, batched=True, remove_columns=['prompt', 'response'])
    tokenized_eval = eval_dataset.map(tokenize_function, batched=True, remove_columns=['prompt', 'response'])
    print("Datasets tokenized successfully")
except Exception as e:
    raise Exception(f"Error tokenizing datasets: {str(e)}")


## Define Training Arguments and Metrics

Set up training arguments and metrics, using SFTTrainer with `metric_for_best_model='rouge1'`.


In [None]:
# Define training arguments
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=5,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=8,
    learning_rate=5e-5,
    warmup_steps=10,
    logging_steps=10,
    evaluation_strategy="steps",
    eval_steps=50,
    save_strategy="steps",
    save_steps=50,
    load_best_model_at_end=True,
    metric_for_best_model="rouge1",  # Use rouge1 for best model selection
    greater_is_better=True,
    fp16=True,
    report_to="wandb",
    gradient_checkpointing=True,
    max_grad_norm=0.5
)

# Define metrics
def rouge_score(predictions, references):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
    scores = [scorer.score(ref, pred) for ref, pred in zip(references, predictions)]
    return {
        "rouge1": np.mean([s['rouge1'].fmeasure for s in scores]),
        "rougeL": np.mean([s['rougeL'].fmeasure for s in scores])
    }

def bleu_score(predictions, references):
    return np.mean([sentence_bleu([ref.split()], pred.split(), weights=(0.25, 0.25, 0.25, 0.25)) for ref, pred in zip(references, predictions)])

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    # Ensure predictions are valid token IDs
    predictions = np.argmax(predictions, axis=-1) if predictions.ndim == 3 else predictions
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    return {
        **rouge_score(decoded_preds, decoded_labels),
        "bleu": bleu_score(decoded_preds, decoded_labels)
    }


## Train the Model

Use SFTTrainer with a data collator to ensure proper loss computation.


In [None]:
# Initialize data collator
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# Initialize SFTTrainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    peft_config=lora_config,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

# Train the model
try:
    model.train()
    trainer.train()
except Exception as e:
    print(f"Error during training: {str(e)}")
    raise e

# Save the model
model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
print(f"Fine-tuned model saved to: {OUTPUT_DIR}")


## RAG-Enhanced Response Generation

Generate responses using RAG, ensuring inputs are on `cuda:0`.


In [None]:
# RAG-based response generation
def chatbot_response(input_text, max_new_tokens=100, temperature=0.7, top_p=0.9, top_k=3):
    try:
        # Retrieve relevant documents
        query_embedding = retriever_model.encode([input_text], convert_to_numpy=True)
        distances, indices = index.search(query_embedding, top_k)
        retrieved_docs = [documents[i] for i in indices[0]]
        context = "\n".join(retrieved_docs)

        # Format prompt with RAG context
        formatted_prompt = f"""<|begin_of_text|><|start_header_id|>user<|end_header_id>
Context: {context}
Question: {input_text}
Answer only with verified information from the provided context.<|eot_id|><|start_header_id|>assistant<|end_header_id>"""

        # Tokenize input
        inputs = tokenizer(
            formatted_prompt,
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=256
        ).to(device)  # Explicitly move to cuda:0

        # Generate response
        model.eval()
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                temperature=temperature,
                top_p=top_p,
                do_sample=True,
                pad_token_id=tokenizer.eos_token_id,
                eos_token_id=tokenizer.eos_token_id
            )

        # Decode and extract response
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        assistant_response = response.split("assistant<|end_header_id|>")[-1].strip()
        return assistant_response
    except Exception as e:
        return f"Error generating response: {str(e)}"


## Training Summary and Memory Usage

Display training summary and GPU memory usage.


In [None]:
# Training summary
print("\n" + "="*50)
print("TRAINING SUMMARY")
print("="*50)
print(f"Total training samples: {len(train_dataset)}")
print(f"Total validation samples: {len(eval_dataset)}")
print(f"Training epochs: {training_args.num_train_epochs}")
print(f"Batch size: {training_args.per_device_train_batch_size}")
print(f"Learning rate: {training_args.learning_rate}")
print(f"Model output directory: {OUTPUT_DIR}")

# Memory usage
if torch.cuda.is_available():
    print(f"GPU memory allocated: {torch.cuda.memory_allocated(device='cuda:0') / 1024**3:.2f} GB")
    print(f"GPU memory reserved: {torch.cuda.memory_reserved(device='cuda:0') / 1024**3:.2f} GB")


## Launch Gradio Interface

Launch a Gradio interface for interactive testing.


In [None]:
# Create Gradio interface
interface = gr.Interface(
    fn=chatbot_response,
    inputs=gr.Textbox(label="Enter your question about EADE Business School"),
    outputs=gr.Textbox(label="Response"),
    title="EADE Business School Chatbot with RAG",
    description="Ask questions about EADE Business School, powered by a fine-tuned StableLM-2-Zephyr-1.6B with RAG."
)

# Launch the interface
try:
    interface.launch(share=True, debug=True)  # Note: share=True may not work in Kaggle; use local or deploy to Hugging Face Spaces
    print("Gradio interface launched successfully. Access it via the public URL.")
except Exception as e:
    print(f"Error launching Gradio interface: {str(e)}. In Kaggle, try running locally or deploy to Hugging Face Spaces.")
