# Simple Agentic RAG System


## **Same Model Configuration as Original**
- **LLM**: `lmsys/vicuna-7b-v1.5` with 4-bit NF4 quantization
- **Embeddings**: `bkai-foundation-models/vietnamese-bi-encoder`
- **Text Splitter**: SemanticChunker with Vietnamese embeddings
- **Vector Database**: Chroma

## **Added Agentic Features (Simplified)**
- **Multiple Tools**: Document search, web search, calculator, summarizer
- **Simple Tool Selection**: Basic logic to choose appropriate tools
- **Interactive Interface**: Chat functionality

### How to use:
- Run: simple_chat() to start interactive mode
- Or test individual queries: agentic_rag('your question')
- Available commands in chat: 'quit', 'memory', 'clear'

In [1]:
!nvidia-smi

Thu Jul 24 15:58:05 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   36C    P8             11W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

# 1. Setup

In [2]:
# Install packages (same as original + agentic tools)
!pip install -q transformers==4.52.4
!pip install -q bitsandbytes==0.46.0
!pip install -q accelerate==1.7.0
!pip install -q langchain==0.3.25
!pip install -q langchainhub==0.1.21
!pip install -q langchain-chroma==0.2.4
!pip install -q langchain_experimental==0.3.4
!pip install -q langchain-community==0.3.24
!pip install -q langchain_huggingface==0.2.0
!pip install -q python-dotenv==1.1.0
!pip install -q pypdf

# Additional packages for agentic features
!pip install -q requests
!pip install -q beautifulsoup4

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.5/10.5 MB[0m [31m120.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.0/67.0 MB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m53.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m32.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m39.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import torch
from transformers import BitsAndBytesConfig
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_huggingface.llms import HuggingFacePipeline
from langchain.memory import ConversationBufferMemory
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_community.document_loaders import PyPDFLoader, TextLoader
from langchain.chains import ConversationalRetrievalChain
from langchain_experimental.text_splitter import SemanticChunker
from langchain_chroma import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain import hub

import requests
from bs4 import BeautifulSoup
import re
import math
from urllib.parse import quote_plus
import json



✅ All libraries imported successfully!


In [None]:
!gdown 1lWuq0COKnU9mCfMvTEq54DBLgAh3yYDx

Downloading...
From: https://drive.google.com/uc?id=1lWuq0COKnU9mCfMvTEq54DBLgAh3yYDx
To: /content/YOLOv10_Tutorials.pdf
100% 16.6M/16.6M [00:00<00:00, 18.4MB/s]


# 2. Load File

In [None]:
Loader = PyPDFLoader
file_path = "YOLOv10_Tutorials.pdf"  
loader = Loader(file_path)
documents = loader.load()

print(f"✅ Loaded {len(documents)} pages from {file_path}")

✅ Loaded 20 pages from YOLOv10_Tutorials.pdf


# 3. Model Configuration (Same as Original Simple-RAG)

In [6]:
model_name = "lmsys/vicuna-7b-v1.5"
embeddings_model = "bkai-foundation-models/vietnamese-bi-encoder"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)

text_generation_pipeline = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    do_sample=True,
    temperature=0.1,
    repetition_penalty=1.1,
    return_full_text=False,
    max_new_tokens=1000
)

llm = HuggingFacePipeline(pipeline=text_generation_pipeline)



Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


tokenizer_config.json:   0%|          | 0.00/749 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

pytorch_model.bin.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/162 [00:00<?, ?B/s]

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Device set to use cuda:0


✅ Model loaded with same configuration as original Simple-RAG!


# 4. Embeddings and Document Processing (Enhanced from Original)

In [None]:
embeddings = HuggingFaceEmbeddings(
    model_name=embeddings_model,
    encode_kwargs={'normalize_embeddings': True}
)
semantic_splitter = SemanticChunker(embeddings)
chunks = semantic_splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/123 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/777 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/540M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

bpe.codes: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/22.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/167 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/270 [00:00<?, ?B/s]

🔄 Processing documents with enhanced chunking...
✅ SemanticChunker: Created 45 chunks
📄 Final result: 45 chunks ready for vector store


In [None]:
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings
)

retriever = vectorstore.as_retriever(search_kwargs={"k": 3})



✅ Vector store created with same configuration as original Simple-RAG!


# 5. Agentic Tools (Simple Functions - No Complex Classes)

In [None]:
# Tool 1: Document Search (Enhanced from original RAG)
def search_documents(query: str) -> str:
    relevant_docs = retriever.invoke(query)
    context = "\n\n".join([doc.page_content for doc in relevant_docs])
    return f"📚 Document Search Results:\n{context}"

✅ Document search tool ready


In [None]:
# Tool 2: Web Search 
def search_web(query: str, max_results: int = 3) -> str:
    """
    Simple implementation using DuckDuckGo search.
    """
    search_url = f"https://html.duckduckgo.com/html/?q={quote_plus(query)}"
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }
    response = requests.get(search_url, headers=headers, timeout=10)
    soup = BeautifulSoup(response.content, 'html.parser')
    results = []
    for result in soup.find_all('a', class_='result__a')[:max_results]:
        title = result.get_text().strip()
        url = result.get('href', '')
        if title and url:
            results.append(f"• {title}\n  URL: {url}")
    if results:
        return f"🌐 Web Search Results:\n" + "\n\n".join(results)
    else:
        return "No web results found."


✅ Web search tool ready


In [None]:
# Tool 3: Calculator 
def calculate(expression: str) -> str:
    cleaned = re.sub(r'[^0-9+\-*/().\s]', '', expression)
    result = eval(cleaned)
    return f"🧮 Calculation Result:\n{expression} = {result}"


# Tool 4: Text Summarizer 
def summarize_text(text: str, max_length: int = 200) -> str:
    if len(text) <= max_length:
        return f"📝 Text Summary:\n{text}"

    prompt = f"Summarize this text in {max_length} characters or less:\n\n{text}"
    summary = llm.invoke(prompt)
    return f"📝 Text Summary:\n{summary}"

✅ Calculator and summarizer tools ready


# 6. Simple Tool Selection Logic (No Complex Agent Framework)

In [None]:
# Simple tool selection - no complex classes needed
available_tools = {
    "document_search": search_documents,
    "web_search": search_web,
    "calculator": calculate,
    "summarizer": summarize_text
}

def select_tool(query: str) -> str:
    query_lower = query.lower()

    # Simple keyword-based tool selection
    if any(word in query_lower for word in ['calculate', 'math', 'compute', '+', '-', '*', '/', '=']):
        return "calculator"
    elif any(word in query_lower for word in ['web', 'search online', 'google', 'internet', 'current', 'latest']):
        return "web_search"
    elif any(word in query_lower for word in ['summarize', 'summary', 'brief', 'short']):
        return "summarizer"
    else:
        return "document_search"

def use_tool(tool_name: str, query: str) -> str:
    return available_tools[tool_name](query)

✅ Simple tool selection logic ready
🔧 Available tools: ['document_search', 'web_search', 'calculator', 'summarizer']


# 7. Simple Memory and Main Agentic Function

In [None]:
# Simple conversation memory (just a list - no complex memory classes)
conversation_history = []
from datetime import datetime
def add_to_memory(user_query: str, assistant_response: str):
    conversation_history.append({
        "timestamp": datetime.now().strftime("%H:%M:%S"),
        "user": user_query,
        "assistant": assistant_response
    })
    if len(conversation_history) > 4:
        conversation_history.pop(0)

def get_memory_context() -> str:
    if not conversation_history:
        return ""

    context = "Recent conversation:\n"
    for entry in conversation_history[-3:]:  # Last 3 exchanges
        context += f"User: {entry['user']}\nAssistant: {entry['assistant']}\n\n"
    return context

def agentic_rag(query: str) -> str:
    print(f"🤖 Processing query: {query}")

    # Step 1: Select appropriate tool
    selected_tool = select_tool(query)
    print(f"🔧 Selected tool: {selected_tool}")

    # Step 2: Use the tool to get information
    tool_result = use_tool(selected_tool, query)
    print(f"📊 Tool result length: {len(tool_result)} characters")

    # Step 3: Get conversation context
    memory_context = get_memory_context()

    # Step 4: Generate final response using LLM (same model as original)
    prompt = f"""You are a helpful AI assistant. Use the information provided to answer the user's question.

{memory_context}

Tool used: {selected_tool}
Information gathered: {tool_result}

User question: {query}

Provide a helpful, accurate response based on the information gathered:"""

    try:
        response = llm.invoke(prompt)

        # Add to memory
        add_to_memory(query, response)

        return response

    except Exception as e:
        error_response = f"❌ Error generating response: {str(e)}"
        add_to_memory(query, error_response)
        return error_response



✅ Agentic RAG system ready!


# 8. Interactive Chat Interface (Simple Loop)

In [None]:
# Test the agentic RAG system with different types of queries
print("Testing Simple Agentic RAG System")


# Test 1: Document search 
print("\n📚 Test 1: Document Search")
response1 = agentic_rag("What is YOLOv10?")
print(f"Response: {response1[:200]}...")

# Test 2: Calculator
print("\n🧮 Test 2: Calculator")
response2 = agentic_rag("Calculate 15 * 7 + 23")
print(f"Response: {response2}")

# Test 3: Web search 
print("\n🌐 Test 3: Web Search")
response3 = agentic_rag("Search online for latest AI news")
print(f"Response: {response3[:200]}...")



🚀 Testing Simple Agentic RAG System

📚 Test 1: Document Search
🤖 Processing query: What is YOLOv10?
🔧 Selected tool: document_search
📊 Tool result length: 1146 characters
Response: 
YOLOv10 is an object detection algorithm introduced in 2024 by Chien-Yao Wang, I-Hau Yeh, and Hong-Yuan Mark Liao. It improves accuracy and speed over YOLOv8 and introduces new techniques such as Pro...

🧮 Test 2: Calculator
🤖 Processing query: Calculate 15 * 7 + 23
🔧 Selected tool: calculator
📊 Tool result length: 49 characters
Response: 
The result of 15 multiplied by 7 plus 23 is 128.

🌐 Test 3: Web Search
🤖 Processing query: Search online for latest AI news
🔧 Selected tool: web_search
📊 Tool result length: 721 characters
Response: 
Latest AI news can be found on various tech news websites such as TechCrunch, Google News, and WIRED. These sources provide updates on artificial intelligence research, developments, and applications...

✅ All tools tested successfully!


In [None]:
# Simple interactive chat loop (no complex UI framework)
def simple_chat():
    print("💬 Simple Agentic RAG Chat Started!")
    print("Type 'quit' to exit, 'memory' to see conversation history")
    print("=" * 60)
    while True:
        try:
            user_input = input("\nYou: ").strip()

            if user_input.lower() == 'quit':
                print("👋 Chat ended. Goodbye!")
                break
            elif user_input.lower() == 'memory':
                print("\n🧠 Conversation History:")
                for i, entry in enumerate(conversation_history, 1):
                    print(f"{i}. [{entry['timestamp']}]")
                    print(f"   User: {entry['user']}")
                    print(f"   Assistant: {entry['assistant'][:100]}...")
                continue
            elif user_input.lower() == 'clear':
                conversation_history.clear()
                print("Memory cleared!")
                continue

            if not user_input:
                continue
            print("\n🤖 Assistant: ", end="")
            response = agentic_rag(user_input)
            print(response)

        except KeyboardInterrupt:
            print("\n👋 Chat interrupted. Goodbye!")
            break
        except Exception as e:
            print(f"\n❌ Error: {e}")

# Instructions for running the chat

print("")
print("🎯 Example queries to try:")
print("• 'What is YOLOv10?' (document search)")
print("• 'Calculate 25 * 4 + 16' (calculator)")
print("• 'Search web for Python tutorials' (web search)")
print("• 'Summarize the YOLOv10 document' (summarizer)")

📋 How to use:
1. Run: simple_chat() to start interactive mode
2. Or test individual queries: agentic_rag('your question')
3. Available commands in chat: 'quit', 'memory', 'clear'

🎯 Example queries to try:
• 'What is YOLOv10?' (document search)
• 'Calculate 25 * 4 + 16' (calculator)
• 'Search web for Python tutorials' (web search)
• 'Summarize the YOLOv10 document' (summarizer)


In [21]:
simple_chat()

💬 Simple Agentic RAG Chat Started!
Type 'quit' to exit, 'memory' to see conversation history

👤 You: chào bạn

🤖 Assistant: 🤖 Processing query: chào bạn
🔧 Selected tool: document_search
📊 Tool result length: 1193 characters
 Hello! How may I assist you today?

👤 You: tổng thống việt nam là ai

🤖 Assistant: 🤖 Processing query: tổng thống việt nam là ai
🔧 Selected tool: document_search
📊 Tool result length: 990 characters
 The current President of Vietnam is Nguyen Xuan Phuc.

👤 You: quit
👋 Chat ended. Goodbye!
