# KUchat Test Backend - Google Colab Free/T4 GPU

**Lightweight version for testing on Google Colab Free with T4 GPU**

This notebook uses smaller models suitable for free Colab resources:
- **Text Model**: Qwen/Qwen2-7B-Instruct (7B parameters - fits on T4)
- **GPU**: T4 (15GB VRAM) - Available on Colab Free
- **Purpose**: Test functionality before deploying to A100

## Setup Instructions

1. **Runtime**: Runtime → Change runtime type → **T4 GPU**
2. **Run cells** in order
3. **Test** with the API endpoints
4. If works well, upgrade to full version with A100 GPU

## 1️⃣ Install Dependencies

In [None]:
%%capture
# Install required packages
!pip install torch transformers accelerate bitsandbytes
!pip install langchain langchain-community chromadb sentence-transformers
!pip install fastapi uvicorn pydantic
!pip install duckduckgo-search wikipedia beautifulsoup4
!pip install pypdf python-multipart aiofiles
!pip install gradio
!pip install pyngrok

print("✅ All packages installed successfully!")

## 2️⃣ Import Libraries

In [None]:
import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from fastapi import FastAPI, UploadFile, File
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from typing import Optional, List
import uvicorn
import json
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader, TextLoader
import chromadb
from duckduckgo_search import DDGS
import wikipedia
import nest_asyncio
from pyngrok import ngrok
import threading
import time

nest_asyncio.apply()

print("✅ Libraries imported successfully!")
print(f"🔥 GPU Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"💎 GPU Name: {torch.cuda.get_device_name(0)}")
    print(f"💾 GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

## 3️⃣ Configuration

In [None]:
# Model Configuration (Optimized for T4 GPU)
MODEL_NAME = "Qwen/Qwen2-7B-Instruct"  # 7B model fits on T4
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"

# Use 4-bit quantization to save memory
USE_4BIT = True

# Docs folder path
DOCS_FOLDER = "./docs"

# Server config
PORT = 8000

print("✅ Configuration set!")
print(f"📦 Model: {MODEL_NAME}")
print(f"🔢 4-bit Quantization: {USE_4BIT}")
print(f"📁 Docs folder: {DOCS_FOLDER}")

## 4️⃣ Download Documents from GitHub

**Option 1: Download docs folder from GitHub (Recommended)**

In [None]:
# Download only docs folder using sparse checkout
!git clone --depth 1 --filter=blob:none --sparse https://github.com/themistymoon/KUchat.git
%cd KUchat
!git sparse-checkout set docs

# Move docs to current directory
!mv docs /content/docs
%cd /content
!rm -rf KUchat

print("✅ Documents downloaded from GitHub!")
print(f"📁 Documents location: {DOCS_FOLDER}")

# List downloaded files
!ls -lh docs/

**Option 2: Upload from Google Drive (Alternative)**

Uncomment and run this cell if you want to use files from Google Drive instead:

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')
# !cp -r /content/drive/MyDrive/KUchat/docs ./docs
# print("✅ Documents copied from Google Drive!")

## 5️⃣ Load AI Model (7B - T4 Compatible)

In [None]:
print("="*60)
print("KUCHAT TEST BACKEND - T4 GPU VERSION")
print("="*60)
print(f"\n[1/4] Loading AI model: {MODEL_NAME}...")
print("⏳ This will take 2-5 minutes on first run...\n")

# Configure 4-bit quantization for T4 GPU
if USE_4BIT:
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16
    )
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True
    )
    print("✅ Model loaded with 4-bit quantization (saves 75% memory!)")
else:
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        device_map="auto",
        torch_dtype=torch.float16,
        trust_remote_code=True
    )
    print("✅ Model loaded in FP16")

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)

print(f"✅ Model loaded successfully!")
print(f"💾 Model size: ~{sum(p.numel() for p in model.parameters()) / 1e9:.1f}B parameters")
if torch.cuda.is_available():
    print(f"🔥 GPU memory used: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")

## 6️⃣ Initialize RAG System

In [None]:
print("\n[2/4] Initializing RAG system...")

class RAGSystem:
    def __init__(self):
        self.embeddings = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL)
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200
        )
        self.vectorstore = None
        self.docs_loaded = False

    def load_documents(self, folder_path: str):
        if not os.path.exists(folder_path):
            print(f"⚠️  Folder {folder_path} not found")
            return

        documents = []
        for root, dirs, files in os.walk(folder_path):
            for file in files:
                if file.endswith('.pdf'):
                    file_path = os.path.join(root, file)
                    try:
                        loader = PyPDFLoader(file_path)
                        documents.extend(loader.load())
                        print(f"  📄 Loaded: {file}")
                    except Exception as e:
                        print(f"  ❌ Error loading {file}: {e}")
                elif file.endswith(('.txt', '.md')):
                    file_path = os.path.join(root, file)
                    try:
                        loader = TextLoader(file_path)
                        documents.extend(loader.load())
                        print(f"  📄 Loaded: {file}")
                    except Exception as e:
                        print(f"  ❌ Error loading {file}: {e}")

        if documents:
            texts = self.text_splitter.split_documents(documents)
            self.vectorstore = Chroma.from_documents(texts, self.embeddings)
            self.docs_loaded = True
            print(f"✅ Successfully loaded {len(documents)} documents ({len(texts)} chunks)")
        else:
            print("⚠️  No documents found in folder")

    def query(self, question: str, k: int = 3):
        if not self.docs_loaded or self.vectorstore is None:
            return None
        results = self.vectorstore.similarity_search(question, k=k)
        return "\n\n".join([doc.page_content for doc in results])

# Initialize RAG
rag_system = RAGSystem()
print("✅ RAG system initialized!")

## 7️⃣ Load Documents into RAG

In [None]:
print("\n[3/4] Loading documents from ./docs folder...\n")
rag_system.load_documents(DOCS_FOLDER)
print("\n✅ Documents loaded into RAG system!")

## 8️⃣ Web Search System

In [None]:
class WebSearchSystem:
    @staticmethod
    def search_duckduckgo(query: str, max_results: int = 3):
        try:
            with DDGS() as ddgs:
                results = list(ddgs.text(query, max_results=max_results))
                return "\n\n".join([f"**{r['title']}**\n{r['body']}" for r in results])
        except Exception as e:
            return f"Web search failed: {str(e)}"

    @staticmethod
    def search_wikipedia(query: str):
        try:
            wikipedia.set_lang('th')
            summary = wikipedia.summary(query, sentences=3)
            return summary
        except:
            try:
                wikipedia.set_lang('en')
                summary = wikipedia.summary(query, sentences=3)
                return summary
            except Exception as e:
                return f"Wikipedia search failed: {str(e)}"

web_search = WebSearchSystem()
print("✅ Web search system initialized!")

## 9️⃣ FastAPI Backend

In [None]:
app = FastAPI(title="KUchat Test Backend - T4 GPU")

class QueryRequest(BaseModel):
    question: str
    max_tokens: int = 512
    temperature: float = 0.7
    use_rag: bool = True
    use_web_search: bool = False

@app.get("/")
async def root():
    return {
        "status": "running",
        "model": MODEL_NAME,
        "version": "test-t4",
        "gpu": torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
    }

@app.get("/health")
async def health():
    return {
        "status": "healthy",
        "model_loaded": True,
        "rag_loaded": rag_system.docs_loaded,
        "gpu_available": torch.cuda.is_available()
    }

@app.post("/query")
async def query(request: QueryRequest):
    try:
        # Build context
        context = ""
        
        if request.use_rag and rag_system.docs_loaded:
            rag_context = rag_system.query(request.question)
            if rag_context:
                context += f"\n\nเอกสารที่เกี่ยวข้อง:\n{rag_context}"
        
        if request.use_web_search:
            web_results = web_search.search_duckduckgo(request.question)
            context += f"\n\nข้อมูลจากการค้นหาเว็บ:\n{web_results}"
        
        # Build prompt
        prompt = f"""คุณคือผู้ช่วยตอบคำถามเกี่ยวกับหลักสูตรมหาวิทยาลัยเกษตรศาสตร์

คำถาม: {request.question}
{context}

คำตอบ:"""
        
        # Generate response
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
        
        outputs = model.generate(
            **inputs,
            max_new_tokens=request.max_tokens,
            temperature=request.temperature,
            do_sample=True,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id
        )
        
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        # Extract answer (remove prompt)
        answer = response.split("คำตอบ:")[-1].strip()
        
        return {
            "response": answer,
            "model": MODEL_NAME,
            "used_rag": request.use_rag and rag_system.docs_loaded,
            "used_web_search": request.use_web_search
        }
    
    except Exception as e:
        return {"error": str(e)}

print("✅ FastAPI backend created!")

## 🔟 Setup Ngrok Authentication

In [None]:
# Get Ngrok auth token from https://dashboard.ngrok.com/get-started/your-authtoken
NGROK_AUTH_TOKEN = ""  # Paste your token here

if NGROK_AUTH_TOKEN:
    ngrok.set_auth_token(NGROK_AUTH_TOKEN)
    print("✅ Ngrok authenticated!")
else:
    print("⚠️  Please set NGROK_AUTH_TOKEN above")
    print("   Get it from: https://dashboard.ngrok.com/get-started/your-authtoken")

## 1️⃣1️⃣ Start Backend Server

In [None]:
print("\n[4/4] Starting backend server...\n")

# Start Uvicorn in background thread
def run_server():
    uvicorn.run(app, host="0.0.0.0", port=PORT, log_level="info")

server_thread = threading.Thread(target=run_server, daemon=True)
server_thread.start()

# Wait for server to start
time.sleep(5)

# Start Ngrok tunnel
public_url = ngrok.connect(PORT)

print("="*60)
print("🎉 BACKEND SERVER IS RUNNING!")
print("="*60)
print(f"\n🌐 Public URL: {public_url}")
print(f"📝 API Documentation: {public_url}/docs")
print(f"\n💡 Use this URL in your frontend_app.py:")
print(f"   API_URL = \"{public_url}\"")
print("\n⏸️  Keep this cell running to keep the server alive!")
print("="*60)

## 1️⃣2️⃣ Test the API

In [None]:
import requests

# Test health endpoint
response = requests.get(f"{public_url}/health")
print("Health Check:")
print(response.json())

# Test query
test_query = {
    "question": "หลักสูตรวิศวกรรมคอมพิวเตอร์มีอะไรบ้าง",
    "max_tokens": 256,
    "temperature": 0.7,
    "use_rag": True,
    "use_web_search": False
}

print("\nTest Query:")
response = requests.post(f"{public_url}/query", json=test_query)
print(response.json())

---

## 🎯 Next Steps

1. **Copy the public URL** from above
2. **Update frontend_app.py** with the URL
3. **Run frontend** on your local computer
4. **Test the chatbot**!

If everything works well, you can upgrade to the full version with:
- **GPU**: A100 (80GB)
- **Model**: Qwen3-Omni-30B + GPT-OSS-120B
- **Performance**: 20-50 tokens/second

---

## 📊 Performance Comparison

| Feature | Test (T4) | Production (A100) |
|---------|-----------|-------------------|
| GPU | T4 (15GB) | A100 (80GB) |
| Model | Qwen2-7B | Qwen3-Omni-30B + GPT-OSS-120B |
| Speed | 5-10 tok/s | 20-50 tok/s |
| Quality | Good | Excellent |
| Cost | Free | ~$1/hour |
| Purpose | Testing | Production |

---