# Multi-LoRA RAG Assistant

Production-style LLM system featuring:

- Dual LoRA adapters (Technical + Research)
- Embedding-based routing
- Retrieval Augmented Generation (FAISS)
- Conversation memory
- Adapter merging
- Latency benchmarking
- Gradio UI

Run top to bottom on GPU.


## 1. Environment Setup

In [1]:

!pip install -q transformers accelerate peft bitsandbytes datasets gradio sentence-transformers faiss-cpu


## 2. Global Configuration

In [2]:

BASE_MODEL = "Qwen/Qwen2.5-0.5B-Instruct"
TECH_PATH = "/content/tech_lora"
RESEARCH_PATH = "/content/research_lora"


## 3. Load Base Model (4-bit)

In [3]:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=bnb_config,
    device_map="auto"
)

print("Base model loaded")


config.json:   0%|          | 0.00/659 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/290 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

Base model loaded


## 4. Load Datasets

In [4]:

from datasets import load_dataset

tech_ds = load_dataset("HuggingFaceH4/CodeAlpaca_20K", split="train[:2000]")
research_ds = load_dataset("cnn_dailymail", "3.0.0", split="train[:1000]")

print(tech_ds[0].keys())
print(research_ds[0].keys())


dict_keys(['prompt', 'completion'])
dict_keys(['article', 'highlights', 'id'])


## 5. Dataset Preprocessing

In [5]:

def preprocess_tech(example):
    text = f"""### Prompt:
{example['prompt']}

### Completion:
{example['completion']}"""
    tokens = tokenizer(text, truncation=True, padding="max_length", max_length=256)
    tokens["labels"] = tokens["input_ids"].copy()
    return tokens

def preprocess_research(example):
    text = f"""Article:
{example['article']}

Summary:
{example['highlights']}"""
    tokens = tokenizer(text, truncation=True, padding="max_length", max_length=256)
    tokens["labels"] = tokens["input_ids"].copy()
    return tokens

tech_ds = tech_ds.map(preprocess_tech, remove_columns=tech_ds.column_names)
research_ds = research_ds.map(preprocess_research, remove_columns=research_ds.column_names)


Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

## 6. LoRA Configuration

In [6]:

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj","v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)


## 7. Train Technical LoRA

In [7]:

from transformers import TrainingArguments, Trainer

tech_model = get_peft_model(model, lora_config)

training_args = TrainingArguments(
    output_dir="./tech",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    max_steps=300,
    fp16=True,
    logging_steps=20,
    save_strategy="no"
)

trainer = Trainer(model=tech_model, args=training_args, train_dataset=tech_ds)
trainer.train()

tech_model.save_pretrained(TECH_PATH)
print("Saved TECH LoRA")


Step,Training Loss
20,9.611572
40,2.139327
60,0.443237
80,0.440865
100,0.366539
120,0.362984
140,0.388137
160,0.357526
180,0.350845
200,0.313849


Saved TECH LoRA


## 8. Train Research LoRA

In [8]:

research_model = get_peft_model(model, lora_config)

training_args = TrainingArguments(
    output_dir="./research",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    max_steps=300,
    fp16=True,
    logging_steps=20,
    save_strategy="no"
)

trainer = Trainer(model=research_model, args=training_args, train_dataset=research_ds)
trainer.train()

research_model.save_pretrained(RESEARCH_PATH)
print("Saved RESEARCH LoRA")




Step,Training Loss
20,2.893396
40,2.833511
60,2.812193
80,2.773328
100,2.703939
120,2.689964
140,2.743816
160,2.69005
180,2.734476
200,2.683607


Saved RESEARCH LoRA


## 9. Reload Base Model For Inference

In [9]:

base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=bnb_config,
    device_map="auto"
)
print("Inference model ready")


Loading weights:   0%|          | 0/290 [00:00<?, ?it/s]

Inference model ready


## 10. Embedding Router

In [10]:

from sentence_transformers import SentenceTransformer
import numpy as np

router_model = SentenceTransformer("all-MiniLM-L6-v2")

tech_examples = ["debug python", "write code", "fix bug", "algorithm"]
research_examples = ["summarize paper", "experiment design", "related work", "research"]

tech_emb = router_model.encode(tech_examples, normalize_embeddings=True).mean(axis=0)
research_emb = router_model.encode(research_examples, normalize_embeddings=True).mean(axis=0)

def route_prompt(prompt):
    emb = router_model.encode([prompt], normalize_embeddings=True)[0]
    return ("tech", float(np.dot(emb, tech_emb))) if np.dot(emb, tech_emb) > np.dot(emb, research_emb) else ("research", float(np.dot(emb, research_emb)))


Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


## 11. RAG (FAISS)

In [11]:

import faiss

docs = [
    "LoRA adapts low rank matrices.",
    "Transformers use self attention.",
    "CNN DailyMail benchmarks summarization.",
    "FAISS enables fast retrieval."
]

doc_emb = router_model.encode(docs, normalize_embeddings=True)
index = faiss.IndexFlatIP(doc_emb.shape[1])
index.add(doc_emb)

def retrieve(q, k=2):
    qe = router_model.encode([q], normalize_embeddings=True)
    _, ids = index.search(qe, k)
    return [docs[i] for i in ids[0]]


## 12. Conversation Memory

In [12]:

chat_history = []

def build_context(p):
    global chat_history
    chat_history.append(p)
    chat_history = chat_history[-3:]
    return "\n".join(chat_history)


## 13. Adapter Loading + Merging

In [13]:

from peft import PeftModel

def load_lora(path):
    return PeftModel.from_pretrained(base_model, path)


## 14. Unified Generate Function

In [14]:

import time

def generate(prompt):
    start = time.time()
    domain, conf = route_prompt(prompt)
    context = build_context(prompt)
    rag = "\n".join(retrieve(prompt))

    full = f"""Context:
{rag}

Conversation:
{context}

User:
{prompt}
"""

    model = load_lora(TECH_PATH if domain=="tech" else RESEARCH_PATH)

    inputs = tokenizer(full, return_tensors="pt").to("cuda")
    with torch.no_grad():
        out = model.generate(**inputs, max_new_tokens=200, temperature=0.7, do_sample=True)

    text = tokenizer.decode(out[0], skip_special_tokens=True)
    return f"[{domain.upper()} | {time.time()-start:.2f}s | {conf:.2f}]\n\n{text}"


## 15. Benchmarking

In [15]:

def benchmark():
    for name,path in [("TECH",TECH_PATH),("RESEARCH",RESEARCH_PATH)]:
        m = load_lora(path)
        t=time.time()
        _=m.generate(**tokenizer("test",return_tensors="pt").to("cuda"),max_new_tokens=10)
        print(name, time.time()-t)

benchmark()


TECH 0.967710018157959




RESEARCH 0.8148577213287354


## 16. Gradio UI

In [16]:

import gradio as gr

demo = gr.Interface(
    fn=generate,
    inputs=gr.Textbox(lines=4, placeholder="Ask coding or research questions"),
    outputs="text",
    title="Multi-LoRA RAG Assistant",
    description="Embedding router + RAG + LoRA + memory"
)

demo.launch(share=True)


Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://93fb7d7163b4dcff27.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




In [17]:
tech_model.save_pretrained("tech_lora")
research_model.save_pretrained("research_lora")


In [18]:
!zip -r tech_lora.zip tech_lora
!zip -r research_lora.zip research_lora


updating: tech_lora/ (stored 0%)
updating: tech_lora/adapter_config.json (deflated 57%)
updating: tech_lora/adapter_model.safetensors (deflated 7%)
updating: tech_lora/README.md (deflated 65%)
updating: research_lora/ (stored 0%)
updating: research_lora/adapter_config.json (deflated 57%)
updating: research_lora/adapter_model.safetensors (deflated 7%)
updating: research_lora/README.md (deflated 65%)


In [19]:
 from google.colab import files

files.download("tech_lora.zip")
files.download("research_lora.zip")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [20]:
%%writefile requirements.txt
torch
transformers
accelerate
peft
bitsandbytes
sentence-transformers
faiss-cpu
gradio


Overwriting requirements.txt


In [21]:
from google.colab import files
files.download("requirements.txt")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## 17. Summary
You now have a full multi-LoRA RAG system.