# KUchat Production Deployment

**Enterprise-Grade AI Assistant for Kasetsart University Curriculum Information**

## Overview

This notebook provides a complete production deployment of the KUchat AI assistant system, featuring:

- GPT-OSS-20B language model (20 billion parameters)
- 4-bit quantization using Unsloth optimization framework
- Retrieval-Augmented Generation (RAG) system with Qdrant + BGE-M3-Thai + Reranking
- Web search integration (DuckDuckGo and Wikipedia)
- Public-facing Gradio web interface
- Zero local configuration required

## System Requirements

1. **Google Colab Pro+ subscription** (A100 GPU access required)
2. **Documentation repository** (automatically downloaded from GitHub)
3. **Internet connection** for model downloads and web search

## Deployment Instructions

1. Navigate to Runtime → Change runtime type → Select **A100 GPU**
2. Execute all cells sequentially from top to bottom
3. Wait for automatic documentation download from repository
4. Access the generated public URL for the demo interface

## Technical Specifications

- **Model**: GPT-OSS-20B (20B parameters)
- **Quantization**: 4-bit BitsAndBytes (Unsloth optimized)
- **Memory Footprint**: Approximately 12GB VRAM (model only)
- **Total VRAM Usage**: ~22GB (including RAG components)
- **Inference Speed**: 40-80 tokens per second
- **Optimization**: 2x faster inference, 75% memory reduction vs FP16
- **Quality**: Production-ready with enterprise-grade responses

---

## Step 1: Install Dependencies

In [None]:
# Model Configuration
MODEL_NAME = "unsloth/gpt-oss-20b-unsloth-bnb-4bit"

# UPDATED: BGE-M3-Thai with normalization for better performance
EMBEDDING_MODEL = "jaeyong2/bge-m3-Thai"
RERANKER_MODEL = "BAAI/bge-reranker-v2-m3"  # Cross-encoder for reranking

DOCS_FOLDER = "./docs"
HF_TOKEN = "YOUR_HUGGINGFACE_TOKEN_HERE"  # Replace with your token from https://huggingface.co/settings/tokens

print("="*60)
print("Configuration initialized")
print("="*60)
print(f"LLM Model: {MODEL_NAME}")
print(f"Embedding Model: {EMBEDDING_MODEL}")
print(f"  - Dimensions: 1024 (BGE-M3)")
print(f"  - Normalized: Yes")
print(f"  - Thai-optimized: Yes")
print(f"Reranker Model: {RERANKER_MODEL}")
print(f"Vector DB: Qdrant (in-memory)")
print(f"Expected VRAM: ~22GB (12GB model + 10GB RAG)")
print("="*60)

Configuration initialized
LLM Model: unsloth/gpt-oss-20b-unsloth-bnb-4bit
Embedding Model: jaeyong2/bge-m3-Thai
  - Dimensions: 1024 (BGE-M3)
  - Normalized: Yes
  - Thai-optimized: Yes
Reranker Model: BAAI/bge-reranker-v2-m3
Vector DB: Qdrant (in-memory)
Expected VRAM: ~22GB (12GB model + 10GB RAG)


## Step 2: Import Libraries and Verify GPU

In [2]:
%%capture
!pip install -U unsloth transformers accelerate bitsandbytes
!pip install langchain langchain-community langchain-huggingface
!pip install sentence-transformers torch tiktoken
!pip install qdrant-client
!pip install fastapi uvicorn pyngrok
!pip install gradio duckduckgo-search wikipedia
!pip install pypdf

# NEW: Install Qdrant for vector database
!pip install qdrant-client

import warnings
warnings.filterwarnings("ignore")

import torch
print(f"\nPyTorch Version: {torch.__version__}")
print(f"CUDA Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA Version: {torch.version.cuda}")
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")

print("\nAll dependencies installed (including Qdrant)")
print("Ready for BGE-M3-Thai + Reranking + Qdrant")

## Step 3: System Configuration

## Step 4: Download Documentation from Repository

Documentation files are automatically downloaded from the GitHub repository.

In [3]:
print("Downloading documentation from GitHub repository...\n")

# Clone repository with sparse checkout (docs folder only)
!git clone --depth 1 --filter=blob:none --sparse https://github.com/themistymoon/KUchat.git
%cd KUchat
!git sparse-checkout set docs

# Move docs folder from /content/KUchat/docs to /content/docs
%cd /content
!mv KUchat/docs docs
!rm -rf KUchat

print("\nDocumentation downloaded successfully")
print(f"Location: {DOCS_FOLDER}")

# Display folder structure
!echo "\nFolder structure:"
!ls -lh docs/ | head -20

Downloading documentation from GitHub repository...

Cloning into 'KUchat'...
remote: Enumerating objects: 24, done.[K
remote: Counting objects: 100% (24/24), done.[K
remote: Compressing objects: 100% (23/23), done.[K
remote: Total 24 (delta 0), reused 22 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (24/24), 7.15 KiB | 7.15 MiB/s, done.
remote: Enumerating objects: 11, done.[K
remote: Counting objects: 100% (11/11), done.[K
remote: Compressing objects: 100% (11/11), done.[K
remote: Total 11 (delta 1), reused 8 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (11/11), 44.77 KiB | 3.20 MiB/s, done.
Resolving deltas: 100% (1/1), done.
/content/KUchat
remote: Enumerating objects: 133, done.[K
remote: Counting objects: 100% (133/133), done.[K
remote: Compressing objects: 100% (133/133), done.[K
remote: Total 133 (delta 0), reused 132 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (133/133), 150.64 MiB | 40.73 MiB/s, done.
Updating files: 10

## Step 5: Load GPT-OSS-120B Language Model

In [4]:
import unsloth
import time
from unsloth import FastLanguageModel

print("=" * 60)
print("[1/4] Loading GPT-OSS-20B Language Model")
print("=" * 60)
print("Estimated loading time: 2-3 minutes (first run)")
print("\nModel: GPT-OSS-20B (20 billion parameters)")
print("Quantization: 4-bit BitsAndBytes (Unsloth optimized)")
print("Expected VRAM usage: ~12GB\n")

start_time = time.time()

# Configure model parameters
max_seq_length = 2048  # Maximum context length

# Load model using Unsloth's FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=max_seq_length,
    dtype=None,  # Automatic dtype detection
    load_in_4bit=True,  # Enable 4-bit quantization
)

# Optimize for inference
FastLanguageModel.for_inference(model)

load_time = time.time() - start_time

print(f"\nModel loaded successfully in {load_time:.1f} seconds")
print(f"Model: GPT-OSS-20B (20B parameters)")
print(f"Quantization: 4-bit BitsAndBytes")
print(f"Maximum sequence length: {max_seq_length} tokens")
if torch.cuda.is_available():
    print(f"GPU memory allocated: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
    print(f"GPU memory reserved: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")
    print(f"Total VRAM usage: ~{torch.cuda.memory_allocated() / 1024**3:.0f} GB")

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
[1/4] Loading GPT-OSS-20B Language Model
Estimated loading time: 2-3 minutes (first run)

Model: GPT-OSS-20B (20 billion parameters)
Quantization: 4-bit BitsAndBytes (Unsloth optimized)
Expected VRAM usage: ~12GB

==((====))==  Unsloth 2025.10.3: Fast Gpt_Oss patching. Transformers: 4.56.2.
   \\   /|    NVIDIA A100-SXM4-80GB. Num GPUs = 1. Max memory: 79.318 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/3.37G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.16G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/165 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/27.9M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/446 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]


Model loaded successfully in 56.6 seconds
Model: GPT-OSS-20B (20B parameters)
Quantization: 4-bit BitsAndBytes
Maximum sequence length: 2048 tokens
GPU memory allocated: 11.67 GB
GPU memory reserved: 19.30 GB
Total VRAM usage: ~12 GB


## Step 6: Initialize RAG System

In [None]:
import os
import json
import numpy as np
from pathlib import Path
from typing import List, Dict, Tuple, Optional
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader, TextLoader
from sentence_transformers import SentenceTransformer, CrossEncoder
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct, Filter, FieldCondition, MatchAny
import re

class RAGSystem:
    def __init__(self):
        """
        Initialize RAG System with Lenient Search Strategy

        Strategy: Broad → Rerank → Post-filter + Boost + Fallback
        """
        print("Initializing RAGSystem with strategy: Broad → Rerank → Post-filter + Boost")

        self.embedding_model = None
        self.reranker = None
        self.qdrant_client = None
        self.collection_name = "ku_curricula"
        self.docs_loaded = False
        self.catalog = {}
        self.gened_catalog = None  # General Education catalog
        self.all_chunks = []
        self.chunk_metadata = []

        # Thai keyword patterns
        self.year_keywords = {
            "1": ["ปีหนึ่ง", "ปี 1", "ปี1", "ปีที่ 1", "ปีที่หนึ่ง", "ชั้นปีที่ 1", "ชั้นปีที่หนึ่ง"],
            "2": ["ปีสอง", "ปี 2", "ปี2", "ปีที่ 2", "ปีที่สอง", "ชั้นปีที่ 2", "ชั้นปีที่สอง"],
            "3": ["ปีสาม", "ปี 3", "ปี3", "ปีที่ 3", "ปีที่สาม", "ชั้นปีที่ 3", "ชั้นปีที่สาม"],
            "4": ["ปีสี่", "ปี 4", "ปี4", "ปีที่ 4", "ปีที่สี่", "ชั้นปีที่ 4", "ชั้นปีที่สี่"]
        }

        self.semester_keywords = {
            "1": ["ภาคต้น", "ภาค 1", "ภาค1", "ภาคการศึกษาที่ 1", "เทอม 1", "เทอม1"],
            "2": ["ภาคปลาย", "ภาค 2", "ภาค2", "ภาคการศึกษาที่ 2", "เทอม 2", "เทอม2"],
            "summer": ["ภาคฤดูร้อน", "ภาคร้อน", "summer"]
        }

        # Faculty/Major mappings
        self.faculty_mappings = {
            # คณะวิทยาศาสตร์
            "วิทยาการคอมพิวเตอร์": ["คอมพิวเตอร์", "computer science", "cs", "คอม", "compsci"],
            "เคมี": ["chemistry", "เคมี", "chem"],
            "ฟิสิกส์": ["physics", "ฟิสิกส์", "phy"],
            "คณิตศาสตร์": ["mathematics", "math", "คณิต"],
            "ชีววิทยา": ["biology", "bio", "ชีวะ"],
            "จุลชีววิทยา": ["microbiology", "micro", "จุลชีว"],

            # คณะวิศวกรรมศาสตร์
            "วิศวกรรมไฟฟ้า": ["electrical engineering", "ee", "ไฟฟ้า"],
            "วิศวกรรมเครื่องกล": ["mechanical engineering", "me", "เครื่องกล"],
            "วิศวกรรมโยธา": ["civil engineering", "ce", "โยธา"],
            "วิศวกรรมเคมี": ["chemical engineering", "เคมี"],
            "วิศวกรรมอุตสาหการ": ["industrial engineering", "ie", "อุตสาหการ"],

            # คณะเกษตร
            "เกษตรศาสตร์": ["agriculture", "เกษตร", "agri"],
            "พืชศาสตร์": ["plant science", "พืช"],
            "สัตวศาสตร์": ["animal science", "สัตว์"],
            "กีฏวิทยา": ["entomology", "แมลง"],

            # คณะบริหารธุรกิจ
            "บริหารธุรกิจ": ["business administration", "ba", "บธ", "business"],
            "บัญชี": ["accounting", "acc", "บัญชี"],
            "การจัดการ": ["management", "จัดการ"],
            "การตลาด": ["marketing", "mkt"],
            "การเงิน": ["finance", "fin"],

            # คณะมนุษยศาสตร์
            "ภาษาไทย": ["thai", "ไทย"],
            "ภาษาอังกฤษ": ["english", "eng", "อังกฤษ"],
            "ภาษาจีน": ["chinese", "จีน"],
            "ภาษาญี่ปุ่น": ["japanese", "ญี่ปุ่น"],

            # คณะสังคมศาสตร์
            "เศรษฐศาสตร์": ["economics", "econ", "เศรษฐ"],
            "รัฐศาสตร์": ["political science", "รัฐศาสตร์"],
            "สังคมวิทยา": ["sociology", "สังคม"],

            # คณะศึกษาศาสตร์
            "ครุศาสตร์": ["education", "ศึกษาศาสตร์", "ครู"]
        }

        print("✓ RAG System initialized")
        print("  Next step: load_models() then load_documents()")

    def load_models(self):
        """Load embedding model and reranker"""
        print("Loading embedding & reranker models...")

        # Embedding: BGE-M3-Thai (1024 dimensions)
        self.embedding_model = SentenceTransformer('jaeyong2/bge-m3-Thai')
        print("✓ Loaded BGE-M3-Thai (1024d, normalized)")

        # Reranker: BGE-Reranker-v2-m3
        self.reranker = CrossEncoder('BAAI/bge-reranker-v2-m3')
        print("✓ Loaded BGE-Reranker-v2-m3 (Cross-Encoder)")

        # Initialize Qdrant client
        self.qdrant_client = QdrantClient(":memory:")
        print("✓ Initialized Qdrant (in-memory)")

    def load_catalog(self, docs_folder: str):
        """Load program catalog from JSON"""
        catalog_path = os.path.join(docs_folder, "curricula_catalog.json")

        if not os.path.exists(catalog_path):
            print(f"⚠ Warning: Catalog not found at {catalog_path}")
            return False

        try:
            with open(catalog_path, 'r', encoding='utf-8') as f:
                catalog_data = json.load(f)

            # Build file_path → metadata mapping
            for program_id, metadata in catalog_data.items():
                file_path = metadata['file_path']
                self.catalog[file_path] = metadata

            print(f"✓ Loaded catalog: {len(self.catalog)} programs")
            print(f"  Faculties: {len(set(m['faculty'] for m in self.catalog.values()))}")
            
            # Load General Education catalog
            gened_path = os.path.join(docs_folder, "general_education_catalog.json")
            if os.path.exists(gened_path):
                try:
                    with open(gened_path, 'r', encoding='utf-8') as f:
                        self.gened_catalog = json.load(f)
                    print(f"✓ Loaded Gen Ed catalog: {self.gened_catalog['total_courses']} courses")
                    print(f"  Categories: {', '.join([cat['category_th'] for cat in self.gened_catalog['categories']])}")
                except Exception as e:
                    print(f"⚠ Warning: Could not load Gen Ed catalog: {e}")
                    self.gened_catalog = None

            return True
        except Exception as e:
            print(f"Error loading catalog: {e}")
            return False

    def load_documents(self, docs_folder: str):
        """Load curriculum PDFs with proper metadata"""
        if not self.load_catalog(docs_folder):
            return False

        print(f"\nLoading documents from {docs_folder}...")

        # Find all curriculum PDFs
        pdf_files = []
        for root, dirs, files in os.walk(docs_folder):
            for file in files:
                if file.endswith('.pdf'):
                    file_path = os.path.join(root, file)
                    pdf_files.append(file_path)

        if not pdf_files:
            print("No PDF files found!")
            return False

        print(f"Found {len(pdf_files)} PDF files")
        print(f"Catalog has metadata for {len(self.catalog)} programs")

        # Text splitter
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1500,
            chunk_overlap=300,
            separators=["\n\n", "\n", ".", "!", "?", " ", ""],
            length_function=len
        )

        # Process PDFs
        all_chunks = []
        all_metadata = []

        for file_path in pdf_files:
            try:
                loader = PyPDFLoader(file_path)
                docs = loader.load()

                # Get metadata from catalog
                catalog_metadata = self.catalog.get(file_path, {})

                # Split into chunks
                chunks = text_splitter.split_documents(docs)

                for i, chunk in enumerate(chunks):
                    metadata = {
                        'file_path': file_path,
                        'file_name': os.path.basename(file_path),
                        'chunk_index': i,
                        'total_chunks': len(chunks),
                        'program': catalog_metadata.get('program', os.path.basename(file_path)),
                        'faculty': catalog_metadata.get('faculty', 'Unknown'),
                        'degree': catalog_metadata.get('degree', 'Unknown'),
                        'id': catalog_metadata.get('id', ''),
                        'keywords': catalog_metadata.get('keywords', [])
                    }

                    all_chunks.append(chunk.page_content)
                    all_metadata.append(metadata)

            except Exception as e:
                print(f"Error processing {file_path}: {e}")

        if not all_chunks:
            print("No chunks created!")
            return False

        # Create Qdrant collection
        try:
            self.qdrant_client.recreate_collection(
                collection_name=self.collection_name,
                vectors_config=VectorParams(
                    size=1024,  # BGE-M3-Thai dimension
                    distance=Distance.COSINE
                )
            )
            print(f"✓ Created Qdrant collection: {self.collection_name}")
        except Exception as e:
            print(f"Error creating collection: {e}")
            return False

        # Generate embeddings and upload
        print(f"\nGenerating embeddings for {len(all_chunks)} chunks...")

        batch_size = 32
        for i in range(0, len(all_chunks), batch_size):
            batch_chunks = all_chunks[i:i+batch_size]
            batch_metadata = all_metadata[i:i+batch_size]

            # Generate embeddings
            embeddings = self.embedding_model.encode(
                batch_chunks,
                normalize_embeddings=True,
                show_progress_bar=False
            )

            # Upload to Qdrant
            points = [
                PointStruct(
                    id=i+j,
                    vector=embeddings[j].tolist(),
                    payload={
                        'text': batch_chunks[j],
                        **batch_metadata[j]
                    }
                )
                for j in range(len(batch_chunks))
            ]

            self.qdrant_client.upsert(
                collection_name=self.collection_name,
                points=points
            )

            if (i+batch_size) % 100 == 0:
                print(f"  Processed {min(i+batch_size, len(all_chunks))}/{len(all_chunks)} chunks")

        self.all_chunks = all_chunks
        self.chunk_metadata = all_metadata
        self.docs_loaded = True

        print(f"\n✓ Loaded {len(all_chunks)} chunks from {len(pdf_files)} PDFs")
        print(f"  Embedding dimensions: 1024 (BGE-M3-Thai)")
        print(f"  Vector DB: Qdrant (cosine similarity)")

        return True

    def extract_keywords_from_question(self, question: str) -> List[str]:
        """Extract keywords from Thai question"""
        keywords = []

        # Extract year
        for year, patterns in self.year_keywords.items():
            if any(pattern in question for pattern in patterns):
                keywords.append(f"ปี {year}")
                keywords.append(f"ชั้นปีที่ {year}")

        # Extract semester
        for sem, patterns in self.semester_keywords.items():
            if any(pattern in question for pattern in patterns):
                if sem != "summer":
                    keywords.append(f"ภาค {sem}")
                    keywords.append(f"ภาคการศึกษาที่ {sem}")
                else:
                    keywords.append("ภาคฤดูร้อน")

        # Extract major/program (lenient: check if ANY synonym appears)
        for major, synonyms in self.faculty_mappings.items():
            if any(syn.lower() in question.lower() for syn in synonyms):
                keywords.append(major)

        return keywords

    def query(self, question: str, k: int = 5, initial_k: int = 50, score_threshold: float = 0.3):
        """
        LENIENT Query Strategy:
        1. Broad search (NO pre-filtering, retrieve 50 candidates)
        2. Rerank with Cross-Encoder
        3. Post-filter + Boost + Fallback
        """
        if not self.docs_loaded:
            print("No documents loaded!")
            return None

        print(f"\n{'='*70}")
        print(f"Query: {question}")
        print(f"{'='*70}")

        # Extract keywords for BOOSTING (not filtering!)
        keywords = self.extract_keywords_from_question(question)

        if keywords:
            print(f"Keywords detected: {keywords[:5]}")

        # Stage 1: BROAD Semantic Search (ALL programs)
        query_embedding = self.embedding_model.encode(
            question,
            normalize_embeddings=True
        )

        print(f"\n[Stage 1] Semantic Search (Qdrant)")
        print(f"  Searching ALL programs (no pre-filter)")
        print(f"  Query embedding norm: {np.linalg.norm(query_embedding):.4f}")

        # Search ALL programs
        results = self.qdrant_client.search(
            collection_name=self.collection_name,
            query_vector=query_embedding.tolist(),
            limit=initial_k
        )

        if not results:
            print(f"\nNo results found")
            return None

        print(f"  Retrieved {len(results)} candidates")
        print(f"  Top 3 semantic scores: {[f'{r.score:.3f}' for r in results[:3]]}")

        # Stage 2: Reranking
        print(f"\n[Stage 2] Reranking (Cross-Encoder)")
        print(f"  Reranking {len(results)} candidates...")

        pairs = [[question, result.payload['text']] for result in results]
        rerank_scores = self.reranker.predict(pairs)

        reranked_results = [
            (result, float(rerank_score))
            for result, rerank_score in zip(results, rerank_scores)
        ]

        reranked_results.sort(key=lambda x: x[1], reverse=True)

        print(f"  Reranking complete")
        print(f"  Top 3 rerank scores: {[f'{score:.3f}' for _, score in reranked_results[:3]]}")

        # Stage 3: Post-filter + Boost
        print(f"\n[Stage 3] Post-filter + Boost + Fallback")

        filtered_results = []
        for result, rerank_score in reranked_results:
            # Keyword Boosting
            boost = 0.0
            metadata = result.payload
            text_content = metadata.get('text', '').lower()

            if keywords:
                # Check if chunk contains keywords
                for keyword in keywords:
                    if keyword.lower() in text_content:
                        boost += 0.1

                # Check if metadata matches program/faculty
                for keyword in keywords:
                    if keyword.lower() in metadata.get('program', '').lower():
                        boost += 0.2
                    if keyword.lower() in metadata.get('faculty', '').lower():
                        boost += 0.1

            final_score = rerank_score + boost
            filtered_results.append((result, final_score))

        # Sort by boosted score
        filtered_results.sort(key=lambda x: x[1], reverse=True)

        # Take top K
        final_results = filtered_results[:k]

        if not final_results:
            print("  No results after filtering!")
            return None

        print(f"  Final results: {len(final_results)}")
        print(f"  Top 3 boosted scores: {[f'{score:.3f}' for _, score in final_results[:3]]}")

        # Build context
        result_texts = []
        source_files_metadata = []
        seen_files = set()

        for i, (result, score) in enumerate(final_results):
            metadata = result.payload
            text = metadata['text']

            # Source tracking
            file_name = metadata.get('file_name', 'Unknown')
            if file_name not in seen_files:
                source_files_metadata.append({
                    'file_name': file_name,
                    'program': metadata.get('program', 'Unknown'),
                    'faculty': metadata.get('faculty', 'Unknown')
                })
                seen_files.add(file_name)

            # Add rank and source info
            chunk_info = f"[Document {i+1}] {metadata.get('program', 'Unknown')}"
            result_texts.append(f"{chunk_info}\n\n{text}")

        print(f"\n[Result Summary]")
        print(f"  Total chunks: {len(result_texts)}")
        print(f"  Unique programs: {len(seen_files)}")
        print(f"  Sources: {', '.join([m['program'] for m in source_files_metadata[:3]])}")

        # Show related programs
        if len(seen_files) > 1:
            related_programs = set()
            for metadata in [r[0].payload for r in final_results]:
                related_programs.add(metadata.get('id', ''))

            related_programs_ids = [pid for pid in related_programs if pid]

            suggestions = []
            for rel_id in list(related_programs_ids)[:3]:
                for file_path, metadata in self.catalog.items():
                    if metadata.get('id') == rel_id:
                        suggestions.append(f"- {metadata['program']}")
                        break

            if suggestions:
                result_texts.append(
                    f"\nหลักสูตรที่เกี่ยวข้อง:\n" + "\n".join(suggestions)
                )

        # AUTO-APPEND: Add General Education info for year/course queries
        gen_ed_appended = False
        if re.search(r'(ปี\s*\d+|รายวิชา|เรียนอะไร|วิชาเลือก|วิชาเสรี|ศึกษาทั่วไป)', question.lower()):
            if self.gened_catalog:
                # Build Gen Ed context from JSON catalog
                gen_ed_context = f"""
[วิชาศึกษาทั่วไป - General Education]
นักศึกษาทุกหลักสูตรต้องลงทะเบียนวิชาศึกษาทั่วไป รวม {self.gened_catalog['credit_requirements']['total_minimum']} หน่วยกิต

**วิชาบังคับ:**
"""
                # Add required courses
                for req in self.gened_catalog['required_courses']:
                    gen_ed_context += f"- {req['course_code']} {req['course_name_th']} ({req['credits']}) - บังคับทุกหลักสูตร\n"
                
                gen_ed_context += "\n**วิชาเลือก (เลือกตามกลุ่มสาระ):**\n"
                
                # Add categories with sample courses
                for category in self.gened_catalog['categories']:
                    cat_req = self.gened_catalog['credit_requirements']['by_category'].get(
                        category['category_id'], {}
                    )
                    min_credits = cat_req.get('minimum', 3)
                    
                    gen_ed_context += f"\n**{category['category_th']} ({category['category_en']})** - ขั้นต่ำ {min_credits} หน่วยกิต\n"
                    gen_ed_context += f"  {category['description']}\n"
                    
                    # Show 3-5 sample courses
                    sample_courses = category['courses'][:5]
                    for course in sample_courses:
                        gen_ed_context += f"  - {course['course_code']} {course['course_name_th']} ({course['credits']})\n"
                    
                    if category['total_courses'] > 5:
                        gen_ed_context += f"  ... และอีก {category['total_courses'] - 5} วิชา\n"
                
                gen_ed_context += f"\n**รวมหน่วยกิต:** {self.gened_catalog['credit_requirements']['required']} (บังคับ) + {self.gened_catalog['credit_requirements']['elective_minimum']} (เลือก) = {self.gened_catalog['credit_requirements']['total_minimum']} หน่วยกิต"
                
                result_texts.append(gen_ed_context)
                gen_ed_appended = True
                print(f"  ✓ Auto-appended General Education info from catalog ({self.gened_catalog['total_courses']} courses)")
        
        # Return both context and source metadata
        context_text = "\n\n---\n\n".join(result_texts)
        
        # Add Gen Ed to sources metadata if appended
        if gen_ed_appended:
            source_files_metadata.append({
                'file_name': 'general_education_catalog.json',
                'program': 'วิชาศึกษาทั่วไป',
                'faculty': 'ทุกคณะ'
            })
        
        return {'context': context_text, 'sources': source_files_metadata}

# Initialize RAG system with LENIENT mode
print("[2/4] Initializing RAG System...")
print("Strategy: Broad search → Rerank → Post-filter + Boost + Fallback")
print()

rag_system = RAGSystem()




[2/4] Initializing RAG System...
Strategy: Broad search → Rerank → Post-filter + Boost + Fallback


Initializing RAG System v3.0 (Lenient Search)
Search Strategy: Broad → Rerank → Post-filter + Boost
No pre-filtering - searches ALL programs for best accuracy

[1/3] Loading BGE-M3-Thai embedding model...
  Model: jaeyong2/bge-m3-Thai
  Dimensions: 1024
  Normalization: Enabled


config.json:   0%|          | 0.00/787 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

  Loaded! Embedding dimension: 1024

[2/3] Loading BGE-Reranker-v2-m3...
  Model: BAAI/bge-reranker-v2-m3
  Type: Cross-Encoder


config.json:   0%|          | 0.00/795 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

  Loaded!

[3/3] Initializing Qdrant vector database...
  Mode: In-memory (fast)
  Distance: Cosine
  Initialized!

RAG System Ready
  - Search Strategy: Broad → Rerank → Post-filter + Boost
  - Embedding: BGE-M3-Thai (1024d, normalized)
  - Reranker: BGE-Reranker-v2-m3 (Cross-Encoder)
  - Vector DB: Qdrant (Cosine similarity)



## Step 7: Load Curriculum Documents

In [6]:
print("[3/4] Loading documents into RAG system...\n")
rag_system.load_documents(DOCS_FOLDER)

[3/4] Loading documents into RAG system...

Loaded catalog: 131 programs from 20 faculties
  Total keywords: 863

Loading documents from: ./docs


                        PDF EXTRACTION SAMPLE                         
──────────────────────────────────────────────────────────────────────
File: Bachelor of Science (Microbiology).pdf
Pages: 14

First page content (first 300 chars):
...
──────────────────────────────────────────────────────────────────────

  Loaded 10 PDF files...
  Loaded 20 PDF files...
  Loaded 30 PDF files...
  Loaded 40 PDF files...
  Loaded 50 PDF files...
  Loaded 60 PDF files...
  Loaded 70 PDF files...
  Loaded 80 PDF files...
  Loaded 90 PDF files...
  Loaded 100 PDF files...
  Loaded 110 PDF files...
  Loaded 120 PDF files...
  Loaded 130 PDF files...

Loaded 2155 documents from 131 files

Splitting documents into chunks...
Created 1010 text chunks

                             CHUNK SAMPLE                             
─────────────────────────────────────────

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Created 1010 embeddings (normalized)

──────────────────────────────────────────────────────────────────────
EMBEDDING SAMPLE
──────────────────────────────────────────────────────────────────────
Text (first 200 chars):
  รายละเอียดของหลักสูตร 
หลักสูตรวิทยาศาสตรบัณฑิต 
สาขาวิชาวิทยาการคอมพิวเตอร์ 
หลักสูตรปรับปรุง พ.ศ. 2565 
(หลักสูตรปรับปรุงแบบแยก) 
ชื่อสถาบันอุดมศึกษา มหาวิทยาลัยเกษตรศาสตร์ 
วิทยาเขต/คณะ/ภาควิชา คณะ...

Embedding info:
  Shape: (1024,)
  Norm: 1.0000 (should be ~1.0 after normalization)
  Min/Max: -0.1154 / 0.4638
  First 10 values: [ 0.05831048 -0.00651068 -0.02117609 -0.03192695 -0.00056314 -0.00315247
  0.05329962  0.03234383  0.03291174  0.02079656]
──────────────────────────────────────────────────────────────────────

Creating Qdrant collection 'ku_curricula'...
Collection created (dimension: 1024, distance: Cosine)

Uploading vectors to Qdrant...

RAG SYSTEM READY
  Documents loaded: 2155
  Text chunks: 1010
  Vectors in Qdrant: 1010
  Embedding model: BGE-M

## Step 8: Initialize Web Search System

In [7]:
class WebSearchSystem:
    @staticmethod
    def search_duckduckgo(query: str, max_results: int = 3):
        try:
            with DDGS() as ddgs:
                results = list(ddgs.text(query, max_results=max_results))
                return "\n\n".join([f"**{r['title']}**\n{r['body']}" for r in results])
        except Exception as e:
            return f"Web search failed: {str(e)}"

    @staticmethod
    def search_wikipedia(query: str):
        try:
            wikipedia.set_lang('th')
            summary = wikipedia.summary(query, sentences=3)
            return summary
        except:
            try:
                wikipedia.set_lang('en')
                summary = wikipedia.summary(query, sentences=3)
                return summary
            except Exception as e:
                return f"Wikipedia search failed: {str(e)}"

web_search = WebSearchSystem()
print("Web search system initialized")

Web search system initialized


## Step 9: Chat Function with RAG and Web Search

In [None]:
def chat_with_bot(
    message: str,
    history: List[Tuple[str, str]],
    use_rag: bool,
    use_web_search: bool,
    temperature: float,
    max_tokens: int
) -> Tuple[str, str]:
    """
    Main chat function with RAG and web search
    Returns: (response, log)
    """
    import time
    log_messages = []

    try:
        start_time = time.time()
        log_msg = f"[{time.strftime('%H:%M:%S')}] Processing query: '{message[:50]}...'"
        log_messages.append(log_msg)
        print(log_msg)  # Real-time console output

        # Build context
        context = ""

        if use_rag and rag_system.docs_loaded:
            log_msg = f"[{time.strftime('%H:%M:%S')}] Searching curriculum documents..."
            log_messages.append(log_msg)
            print(log_msg)

            rag_result = rag_system.query(message, k=5)  # Returns dict with 'context' and 'sources'

            if rag_result and isinstance(rag_result, dict):
                rag_context = rag_result['context']
                sources = rag_result['sources']
                word_count = len(rag_context.split())

                # Check if retrieved context is relevant (has enough content)
                if word_count < 20:
                    log_msg = f"[{time.strftime('%H:%M:%S')}] Found only {word_count} words - insufficient information"
                    log_messages.append(log_msg)
                    print(log_msg)
                    context += f"\n\nเอกสารที่พบ: ไม่เพียงพอหรือไม่เกี่ยวข้อง (เพียง {word_count} คำ)"
                else:
                    context += f"\n\nเอกสารจากหลักสูตรมหาวิทยาลัยเกษตรศาสตร์:\n{rag_context}"

                    # Log source documents
                    log_msg = f"[{time.strftime('%H:%M:%S')}] Found {word_count} words from {len(sources)} document(s)"
                    log_messages.append(log_msg)
                    print(log_msg)

                    # Add detailed source information to log
                    unique_sources = []
                    for src in sources:
                        source_label = f"{src['program']} ({src['file_name']})" if src['program'] else src['file_name']
                        if source_label not in unique_sources:
                            unique_sources.append(source_label)

                    if unique_sources:
                        log_msg = f"[{time.strftime('%H:%M:%S')}] Sources used:"
                        log_messages.append(log_msg)
                        for i, src in enumerate(unique_sources[:5], 1):
                            log_msg = f"    {i}. {src}"
                            log_messages.append(log_msg)
                        if len(unique_sources) > 5:
                            log_msg = f"    ... and {len(unique_sources) - 5} more"
                            log_messages.append(log_msg)
            else:
                log_msg = f"[{time.strftime('%H:%M:%S')}] No relevant documents found"
                log_messages.append(log_msg)
                print(log_msg)
                context += "\n\nไม่พบเอกสารที่เกี่ยวข้อง"

        if use_web_search:
            log_msg = f"[{time.strftime('%H:%M:%S')}] Searching web..."
            log_messages.append(log_msg)
            print(log_msg)

            web_results = web_search.search_duckduckgo(message, max_results=2)
            context += f"\n\nข้อมูลจากอินเทอร์เน็ต:\n{web_results}"

            log_msg = f"[{time.strftime('%H:%M:%S')}] Web search completed"
            log_messages.append(log_msg)
            print(log_msg)

        # Build conversation history
        conversation = ""
        for user_msg, bot_msg in history[-3:]:  # Last 3 exchanges
            conversation += f"คำถาม: {user_msg}\nคำตอบ: {bot_msg}\n\n"

        # Build prompt with strict instructions and few-shot examples
        system_instruction = """คุณเป็นผู้ช่วยตอบคำถามเกี่ยวกับหลักสูตรมหาวิทยาลัยเกษตรศาสตร์

กฎสำคัญ (ห้ามละเมิด):
1. ตอบเป็นภาษาไทยเท่านั้น
2. ห้ามแสดงกระบวนการคิด - ห้ามพูดว่า "We need to", "The user is asking", "We can mention", "We should answer"
3. เริ่มตอบทันที - ไม่ต้องอธิบายว่าคิดยังไง
4. ใช้เฉพาะข้อมูลจากเอกสาร - ห้ามสมมุติ
5. ตอบโดยละเอียดและครบถ้วน - ระบุรหัสวิชา ชื่อวิชา หน่วยกิต ทุกวิชาที่พบในเอกสาร
6. ถ้าถามเกี่ยวกับ "ปี 1" ต้องตอบทั้งภาคเรียนที่ 1 และ 2 (ถ้ามีในเอกสาร)
7. ต้องรวมวิชาศึกษาทั่วไป (ภาษาไทย ภาษาอังกฤษ ศาสตร์แห่งแผ่นดิน) ด้วย

ตัวอย่างที่ 1:
คำถาม: "ปี 1 หลักสูตรวิทยาการคอมพิวเตอร์ต้องเรียนอะไรบ้าง"
เอกสาร: "ปีที่ 1 ภาค 1: 01417111 แคลคูลัส I (3), 01418111 วิทยาการคอมพิวเตอร์เบื้องต้น (2), 01418112 แนวคิดการโปรแกรม (3), 01418141 ทรัพย์สินทางปัญญา (3), 01999111 ศาสตร์แห่งแผ่นดิน (2), ภาษาไทย (3), ภาษาอังกฤษ (3). ภาค 2: 01417322 พีชคณิตเชิงเส้นพื้นฐาน 3..."

คำตอบที่ถูก:
"ปีที่ 1 ของหลักสูตรวิทยาการคอมพิวเตอร์ มีรายวิชาดังนี้:

**ภาคการศึกษาที่ 1:**
1. 01417111 แคลคูลัส I (3 หน่วยกิต)
2. 01418111 วิทยาการคอมพิวเตอร์เบื้องต้น (2 หน่วยกิต)
3. 01418112 แนวคิดการโปรแกรมเบื้องต้น (3 หน่วยกิต)
4. 01418141 ทรัพย์สินทางปัญญาและจรรยาบรรณวิชาชีพ (3 หน่วยกิต)
5. 01999111 ศาสตร์แห่งแผ่นดิน (2 หน่วยกิต)
6. วิชาภาษาไทย (3 หน่วยกิต)
7. วิชาภาษาอังกฤษ (3 หน่วยกิต)
รวม 19 หน่วยกิต

**ภาคการศึกษาที่ 2:**
1.01417322 พีชคณิตเชิงเส้นพื้นฐาน 3(3-0-6)
2.01418113 การโปรแกรมคอมพิวเตอร์ 3(2-2-5)
3.01418131 การโปรแกรมทางสถิติ 3(3-0-6)
4.01418132 หลักมูลการคณนา 3(3-0-6)
5.01175xxx กิจกรรมพลศึกษา 1(0-2-1)
6.วิชาศึกษาทั่วไปกลุ่มสาระศาสตร์แห่งผู้ประกอบการ 3( - - )
7.วิชาศึกษาทั่วไปกลุ่มสาระสุนทรียศาสตร์ 3( - - )
รวม 19 หน่วยกิต
...

คำตอบที่ผิด:
- ตอบเฉพาะภาค 1 (ควรตอบทั้งภาค 1 และ 2)
- ข้ามวิชาศึกษาทั่วไป (ต้องรวมทุกวิชา)
- "We have a long text..." (ห้ามแสดงกระบวนการคิด)

ตัวอย่างที่ 2:
คำถาม: "ค่าเทอมคณะวิทยาศาสตร์เท่าไหร่"
เอกสาร: "ไม่มีข้อมูลค่าเทอม"
คำตอบที่ถูก:
"ขออภัยค่ะ ไม่พบข้อมูลค่าเทอมในเอกสารหลักสูตร กรุณาติดต่อสำนักทะเบียนและประมวลผล มหาวิทยาลัยเกษตรศาสตร์ โทร. 02-942-8000"

ตัวอย่างที่ 3:
คำถาม: "วิศวกรรมคอมพิวเตอร์ต่างจากวิทยาการคอมพิวเตอร์อย่างไร"
เอกสาร: "วิศวกรรมคอมพิวเตอร์ เน้นฮาร์ดแวร์และระบบฝังตัว วิทยาการคอมพิวเตอร์ เน้นซอฟต์แวร์และอัลกอริทึม"
คำตอบที่ถูก:
"ความแตกต่างระหว่าง 2 หลักสูตร:

**วิศวกรรมคอมพิวเตอร์:**
- เน้นด้านฮาร์ดแวร์และระบบฝังตัว
- เรียนวิชาวงจรดิจิทัล ไมโครโปรเซสเซอร์

**วิทยาการคอมพิวเตอร์:**
- เน้นด้านซอฟต์แวร์และอัลกอริทึม
- เรียนวิชาโครงสร้างข้อมูล ปัญญาประดิษฐ์"

เริ่มตอบเลย:
"""

        user_query = f"""
{conversation}

คำถาม: {message}

เอกสาร:
{context if context else "ไม่พบเอกสาร"}

คำตอบ:"""

        prompt = system_instruction + user_query

        # Generate response
        log_msg = f"[{time.strftime('%H:%M:%S')}] Generating response (max {max_tokens} tokens)..."
        log_messages.append(log_msg)
        print(log_msg)
        print(f"[{time.strftime('%H:%M:%S')}] Prompt length: {len(prompt)} characters")

        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
        print(f"[{time.strftime('%H:%M:%S')}] Input tokens: {len(inputs.input_ids[0])}")

        gen_start = time.time()
        try:
            with torch.no_grad():
                # Clear CUDA cache before generation
                if torch.cuda.is_available():
                    torch.cuda.empty_cache()

                print(f"[{time.strftime('%H:%M:%S')}] Starting model inference...")
                outputs = model.generate(
                    **inputs,
                    max_new_tokens=max_tokens,
                    temperature=temperature,
                    do_sample=True,
                    top_p=0.9,
                    top_k=50,
                    pad_token_id=tokenizer.eos_token_id,
                    use_cache=True
                )
                print(f"[{time.strftime('%H:%M:%S')}] Model inference completed")
        except RuntimeError as e:
            if "Offset increment" in str(e) or "graph capture" in str(e):
                log_msg = f"[{time.strftime('%H:%M:%S')}] Retrying with fallback generation mode..."
                log_messages.append(log_msg)
                print(log_msg)

                # Retry with simpler generation parameters
                with torch.no_grad():
                    torch.cuda.empty_cache()
                    outputs = model.generate(
                        **inputs,
                        max_new_tokens=max_tokens,
                        temperature=temperature,
                        do_sample=False,  # Use greedy decoding as fallback
                        pad_token_id=tokenizer.eos_token_id,
                        use_cache=False
                    )
                print(f"[{time.strftime('%H:%M:%S')}] Fallback generation completed")
            else:
                raise

        gen_time = time.time() - gen_start
        print(f"[{time.strftime('%H:%M:%S')}] Decoding output...")

        response = tokenizer.decode(outputs[0], skip_special_tokens=True)

        # Extract answer (remove prompt)
        if "คำตอบ:" in response:
            answer = response.split("คำตอบ:")[-1].strip()
        else:
            answer = response[len(prompt):].strip()

        # POST-PROCESSING: Remove chain of thought in English (ENHANCED)
        import re
        
        original_answer = answer
        
        # STAGE 1: Remove model control tokens
        answer = re.sub(r'(assistant|user|system)(final|response|answer)?\*{0,2}', '', answer, flags=re.IGNORECASE)
        
        # STAGE 2: Remove English thinking patterns (COMPREHENSIVE)
        thinking_patterns = [
            r'^.*?(We have to|We need to|We can|We should|We must).*?(?=\*\*|ปี|หลักสูตร|รายวิชา|ขออภัย|###|##)',
            r'^.*?(Let\'s|Thus|Therefore|However|But the|Also|So).*?(?=\*\*|ปี|หลักสูตร|รายวิชา|ขออภัย|###|##)',
            r'^.*?(The (user|question|document|answer|provided)).*?(?=\*\*|ปี|หลักสูตร|รายวิชา|ขออภัย|###|##)',
            r'^.*?(They want|They are|It is|It seems|This is).*?(?=\*\*|ปี|หลักสูตร|รายวิชา|ขออภัย|###|##)',
            r'^.*?(likely expects|should list|can answer|need to).*?(?=\*\*|ปี|หลักสูตร|รายวิชา|ขออภัย|###|##)',
        ]
        
        for pattern in thinking_patterns:
            answer = re.sub(pattern, '', answer, flags=re.DOTALL | re.IGNORECASE)
        
        # STAGE 3: Line-by-line filtering (STRICTER)
        lines = answer.split('\n')
        filtered_lines = []
        skip_mode = True  # Start in skip mode to remove leading junk
        
        for line in lines:
            line_stripped = line.strip()
            
            # Skip empty lines while in skip mode
            if not line_stripped and skip_mode:
                continue
            
            # Check if line is English thinking or junk
            is_english_thinking = re.match(
                r'^(We |Let\'s |They |The |This |That |It |Should |Must |Need |Can |Will |'
                r'But |Also |Thus |Therefore |However |So |Yet |Still |Answer:|Question:|'
                r'ปี \d+$|^[ก-๙]{1,3}$|^[ก-๙\s.]{1,10}$)',  # Single Thai chars or fragments
                line_stripped, 
                re.IGNORECASE
            )
            
            # Check if we hit REAL Thai content (5+ Thai characters or markdown formatting)
            is_real_content = (
                re.search(r'[ก-๙]{5,}', line_stripped) or  # 5+ Thai chars
                line_stripped.startswith('**') or
                line_stripped.startswith('###') or
                line_stripped.startswith('##') or
                line_stripped.startswith('|') or  # Table
                line_stripped.startswith('-') or  # List
                re.match(r'^\d+\.', line_stripped)  # Numbered list
            )
            
            if is_real_content:
                skip_mode = False
            
            if is_english_thinking and skip_mode:
                continue
            
            if not skip_mode:
                filtered_lines.append(line)
        
        answer = '\n'.join(filtered_lines).strip()
        
        # STAGE 4: Remove leading junk (AGGRESSIVE)
        # Find first real Thai content or markdown formatting
        match = re.search(r'(\*\*|###|##|ปี.*?ของ|หลักสูตร|รายวิชา|ขออภัย|ภาค|คณะ|\|.*?\|)', answer)
        if match and match.start() > 0:
            answer = answer[match.start():].strip()
        
        # STAGE 5: Clean up remaining fragments
        # Remove lines with only 1-2 Thai characters at the start
        lines = answer.split('\n')
        while lines and re.match(r'^[ก-๙\s.]{1,3}$', lines[0].strip()):
            lines.pop(0)
        answer = '\n'.join(lines).strip()
        
        if answer != original_answer:
            log_msg = f"[{time.strftime('%H:%M:%S')}] ✂️ Removed English chain-of-thought"
            log_messages.append(log_msg)
            print(log_msg)

        # Calculate tokens generated
        tokens_generated = len(outputs[0]) - len(inputs.input_ids[0])
        tokens_per_sec = tokens_generated / gen_time if gen_time > 0 else 0

        total_time = time.time() - start_time

        log_msg = f"[{time.strftime('%H:%M:%S')}] Generated {tokens_generated} tokens in {gen_time:.2f}s ({tokens_per_sec:.1f} tok/s)"
        log_messages.append(log_msg)
        print(log_msg)

        log_msg = f"[{time.strftime('%H:%M:%S')}] Total time: {total_time:.2f}s"
        log_messages.append(log_msg)
        print(log_msg)
        print("="*60)

        return answer, "\n".join(log_messages)

    except Exception as e:
        error_log = "\n".join(log_messages) + f"\n[{time.strftime('%H:%M:%S')}] ERROR: {str(e)}"
        print(f"\n[{time.strftime('%H:%M:%S')}] ERROR: {str(e)}")
        print("="*60)
        return f"Error: {str(e)}", error_log

print("Chat function initialized")

Chat function initialized


## Step 10: Launch Gradio Demo Interface

In [9]:
import gradio as gr

print("\n[4/4] Launching Gradio interface...\n")

# Create Gradio interface
with gr.Blocks(
    title="KUchat - Kasetsart University AI Assistant",
    theme=gr.themes.Soft()
) as demo:

    gr.Markdown("""
    # KUchat - Kasetsart University AI Assistant

    ผู้ช่วยตอบคำถามเกี่ยวกับหลักสูตรมหาวิทยาลัยเกษตรศาสตร์

    **Powered by GPT-OSS-20B (4-bit, Unsloth) on A100 GPU**

    ---
    """)

    with gr.Row():
        with gr.Column(scale=3):
            chatbot = gr.Chatbot(
                height=400,
                label="Chat History",
                show_copy_button=True,
                type="messages"
            )

            # Log display
            log_box = gr.Textbox(
                label="System Log",
                lines=4,
                max_lines=4,
                interactive=False,
                show_copy_button=True
            )

            msg = gr.Textbox(
                label="Your Question",
                placeholder="ถามคำถามเกี่ยวกับหลักสูตร เช่น 'หลักสูตรวิศวกรรมคอมพิวเตอร์มีอะไรบ้าง'",
                lines=2
            )

            with gr.Row():
                submit_btn = gr.Button("Send", variant="primary")
                clear_btn = gr.Button("Clear")

        with gr.Column(scale=1):
            gr.Markdown("### Settings")

            use_rag = gr.Checkbox(
                label="Use RAG (Curriculum Documents)",
                value=True,
                info="Search in curriculum documents"
            )

            use_web_search = gr.Checkbox(
                label="Use Web Search",
                value=False,
                info="Search online for latest information"
            )

            temperature = gr.Slider(
                minimum=0.1,
                maximum=1.0,
                value=0.7,
                step=0.1,
                label="Temperature",
                info="Higher values increase creativity"
            )

            max_tokens = gr.Slider(
                minimum=128,
                maximum=2048,
                value=512,  # เพิ่มเป็น 512 เพื่อให้ตอบละเอียดขึ้น
                step=128,
                label="Max Tokens",
                info="Maximum response length"
            )

            gr.Markdown("""
            ---
            ### System Information
            - **Model**: GPT-OSS-20B
            - **Parameters**: 20B (4-bit)
            - **GPU**: A100 80GB
            - **VRAM Usage**: ~12GB (model) + ~10GB (RAG) = ~22GB total
            - **Free VRAM**: ~58GB
            - **Inference Speed**: 40-80 tokens/sec
            - **Optimization**: Unsloth Framework
            - **Vector DB**: Qdrant
            - **Embedding**: BGE-M3-Thai (1024d)
            - **Documents**: 131 programs
            """)

    gr.Markdown("""
    ---
    ### Example Questions:
    - หลักสูตรวิศวกรรมคอมพิวเตอร์มีอะไรบ้าง
    - คณะวิทยาศาสตร์มีกี่สาขา
    - ค่าเทอมคณะบริหารธุรกิจเท่าไหร่
    - วิทยาการคอมพิวเตอร์ต่างจากวิศวกรรมคอมพิวเตอร์อย่างไร
    """)

    # Chat function
    def respond(message, chat_history, use_rag, use_web_search, temperature, max_tokens):
        # Convert chat_history from messages format to tuples for chat_with_bot
        history_tuples = [(msg["content"], resp["content"])
                          for msg, resp in zip(chat_history[::2], chat_history[1::2])] if chat_history else []

        bot_message, log = chat_with_bot(
            message, history_tuples, use_rag, use_web_search, temperature, max_tokens
        )

        # Return in OpenAI messages format
        chat_history.append({"role": "user", "content": message})
        chat_history.append({"role": "assistant", "content": bot_message})

        return "", chat_history, log

    # Event handlers
    submit_btn.click(
        respond,
        inputs=[msg, chatbot, use_rag, use_web_search, temperature, max_tokens],
        outputs=[msg, chatbot, log_box]
    )

    msg.submit(
        respond,
        inputs=[msg, chatbot, use_rag, use_web_search, temperature, max_tokens],
        outputs=[msg, chatbot, log_box]
    )

    clear_btn.click(lambda: ([], ""), None, [chatbot, log_box], queue=False)

# Launch with public URL
print("="*60)
print("LAUNCHING KUCHAT DEMO")
print("="*60)
print("\nStarting Gradio interface...\n")

demo.launch(
    share=True,  # Creates public URL
    debug=True,  # Show errors in Colab notebook
    show_error=True,
    server_port=7860
)

print("\n" + "="*60)
print("KUCHAT IS LIVE")
print("="*60)
print("\nPublic URL generated above")
print("Share the URL with anyone to use the chatbot")
print("Keep this cell running to keep the demo alive")
print("="*60)


[4/4] Launching Gradio interface...

LAUNCHING KUCHAT DEMO

Starting Gradio interface...

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://18ff34bec9d50d3ee9.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


[06:09:29] Processing query: 'วิทยาการคอมเรียนอะไรบ้าง...'
[06:09:29] Searching curriculum documents...

                            SEARCH PROCESS                            
──────────────────────────────────────────────────────────────────────
Query: วิทยาการคอมเรียนอะไรบ้าง
Mode: LENIENT (No pre-filtering)
──────────────────────────────────────────────────────────────────────

[Stage 1] Semantic Search (Qdrant)
  Searching ALL programs (no pre-filter)
  Query embedding norm: 1.0000
  Retrieved 50 candidates
  Top 3 semantic scores: ['0.606', '0.558', '0.512']

[Stage 2] Reranking (Cross-Encoder)
  Reranking 50 candidates...
  Reranking complete
  Top 5 rerank scores: ['0.849', '0.813', '0.661', '0.604', '0.355']

[Stage 3] Post-filter + Boost + Fallback
  Applied keyword boosting
  Top 5 scores: ['0.849 (+0.00)', '0.813 (+0.00)', '0.661 (+0.00)', '0.604 (+0.00)', '0.355 (+0.00)']
  Filtered to 5 results (threshold: 0.3)

Source Documents:
  1. วท.บ. วิทยาการคอมพิวเตอร์ (Bachelor of

---

## Demo Features

### Chat Interface
- Gradio UI with chat history
- Real-time responses from GPT-OSS-120B
- Copy/paste support

### Controls
- **RAG Toggle**: Search curriculum documents
- **Web Search**: Get latest online information
- **Temperature**: Adjust creativity (0.1-1.0)
- **Max Tokens**: Control response length

### Public Access
- Share URL works for 72 hours
- Anyone can access without login
- Multiple users can chat simultaneously

### Performance
- **Model**: GPT-OSS-120B (120B parameters)
- **Quantization**: 4-bit BNB (Unsloth optimized)
- **GPU**: A100 80GB (~40GB VRAM used)
- **Speed**: 30-60 tokens/second (2x faster)
- **Quality**: Production-ready
- **VRAM Savings**: 75% compared to FP16

---

## Troubleshooting

### No A100 GPU?
- Go to: Runtime → Change runtime type → A100 GPU
- Requires Google Colab Pro+ subscription

### Model loading fails?
- Check HuggingFace token is set
- Verify internet connection
- Try restarting runtime

### Demo stops working?
- Cell must keep running for demo to work
- Colab disconnects after ~12 hours idle
- Re-run cell 10 to restart demo

---

## Cost Estimate

**Google Colab Pro+**: ~$50/month
- A100 GPU access
- ~$1-2 per hour of usage
- Background execution
- Priority access

**Alternative**: Use test version (T4 GPU) for free
- See `colab_backend_test.ipynb`
- Smaller model but still functional
- Good for testing/development

---