<a href="https://colab.research.google.com/github/towardsai/ai-tutor-rag-system/blob/main/notebooks/13-Adding_Router.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Install Packages and Setup Variables


In [None]:
!pip install -qU llama-index==0.14.0 llama-index-llms-openai==0.5.6 openai==1.107.0 cohere==5.18.0 jedi==0.19.2 \
                 llama-index-llms-google-genai==0.5.0 chromadb==1.0.21 llama-index-vector-stores-chroma==0.5.3

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.4/295.4 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m25.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.8/19.8 MB[0m [31m74.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.2/284.2 kB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/3.5 MB[0m [31m84.1 MB/s[0m eta [36m0:00:00

In [None]:
import os

# Set the following API Keys in the Python environment. Will be used later.
# os.environ["OPENAI_API_KEY"] = "<YOUR_OPENAI_API_KEY>"
# os.environ["GOOGLE_API_KEY"] = "<YOUR_GOOGLE_API_KEY>"

from google.colab import userdata
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')
os.environ["GOOGLE_API_KEY"] = userdata.get('Google_api_key')

In [None]:
# Allows running asyncio in environments with an existing event loop, like Jupyter notebooks.
import nest_asyncio

nest_asyncio.apply()

# Load a Model


In [None]:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

Settings.llm = OpenAI(model="gpt-5-mini", additional_kwargs={'reasoning_effort':'minimal'})
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

# Load Indexes


In [None]:
# Downloading Vector store from Hugging face hub
from huggingface_hub import hf_hub_download

vectorstore = hf_hub_download(repo_id="jaiganesan/ai_tutor_knowledge", filename="vectorstore.zip", repo_type="dataset", local_dir=".")

vectorstore.zip:   0%|          | 0.00/97.2M [00:00<?, ?B/s]

In [None]:
!unzip vectorstore.zip

Archive:  vectorstore.zip
   creating: ai_tutor_knowledge/
   creating: ai_tutor_knowledge/684af133-f877-4230-bde4-575cf53b6688/
  inflating: ai_tutor_knowledge/684af133-f877-4230-bde4-575cf53b6688/length.bin  
  inflating: ai_tutor_knowledge/684af133-f877-4230-bde4-575cf53b6688/index_metadata.pickle  
  inflating: ai_tutor_knowledge/684af133-f877-4230-bde4-575cf53b6688/link_lists.bin  
  inflating: ai_tutor_knowledge/684af133-f877-4230-bde4-575cf53b6688/header.bin  
  inflating: ai_tutor_knowledge/684af133-f877-4230-bde4-575cf53b6688/data_level0.bin  
  inflating: ai_tutor_knowledge/chroma.sqlite3  


In [None]:
import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import VectorStoreIndex

# Create your index
db = chromadb.PersistentClient(path="./ai_tutor_knowledge")
chroma_collection = db.get_or_create_collection("ai_tutor_knowledge")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
vector_index = VectorStoreIndex.from_vector_store(vector_store)

In [None]:
# Query Engine
ai_tutor_knowledge_query_engine = vector_index.as_query_engine(similarity_top_k=3)

res = ai_tutor_knowledge_query_engine.query("How does Retrieval Augmented Generation (RAG) work?")
print(res.response)

Retrieval-Augmented Generation (RAG) is a hybrid approach that augments a generative large language model with retrieved external knowledge to improve factuality, currency, and domain-specific accuracy. The process has two main interacting components—Retrieval and Generation—and typically follows several ordered processing steps:

1. Query classification
   - Decide whether the input/query requires retrieval (e.g., is external knowledge needed) or can be handled directly by the LLM.

2. Retrieval (Indexing and Searching)
   - Index external documents for efficient access (sparse inverted indexes or dense vector encodings).
   - Search the index to fetch documents or document chunks relevant to the query.

3. Reranking
   - Optionally re-order the retrieved results using a reranker to prioritize the most relevant evidence.

4. Repacking (Organization)
   - Organize and assemble the selected retrieved materials into a structured context (e.g., concatenating, chunking or otherwise packagi

In [None]:
for src in res.source_nodes:
    print("Node ID\t", src.node_id)
    print("Title\t", src.metadata["title"])
    print("Text\t", src.text)
    print("Score\t", src.score)
    print("Metadata\t", src.metadata)
    print("-_" * 20)

Node ID	 2aa05360-f43a-4819-bce7-0acf7b897eab
Title	 Searching for Best Practices in Retrieval-Augmented Generation:1 Introduction
Text	 Generative large language models are prone to producing outdated information or fabricating facts, although they were aligned with human preferences by reinforcement learning [1] or lightweight alternatives [2–5]. Retrieval-augmented generation (RAG) techniques address these issues by combining the strengths of pretraining and retrieval-based models, thereby providing a robust framework for enhancing model performance [6]. Furthermore, RAG enables rapid deployment of applications for specific organizations and domains without necessitating updates to the model parameters, as long as query-related documents are provided. Many RAG approaches have been proposed to enhance large language models (LLMs) through query-dependent retrievals [6–8]. A typical RAG workflow usually contains multiple intervening processing steps: query classification (determining w

# Router

Routers are modules that take in a user query and a set of “choices” (defined by metadata), and returns one or more selected choices.

They can be used for the following use cases and more:

- Selecting the right data source among a diverse range of data sources

- Deciding whether to do summarization (e.g. using summary index query engine) or semantic search (e.g. using vector index query engine)

- Deciding whether to “try” out a bunch of choices at once and combine the results (using multi-routing capabilities).


## Lets create a different query engine with Mistral AI information


In [None]:
from pathlib import Path
import requests
import time

wiki_titles = [
    "Mistral AI",
    "Llama (language model)",
    "Claude AI",
    "OpenAI",
    "Gemini AI",
]

data_path = Path("llm_data_wiki")
if not data_path.exists():
    data_path.mkdir()

# Set up headers with User-Agent (REQUIRED by Wikipedia API)
headers = {
    'User-Agent': 'YourAppName/1.0 (your-email@example.com)'  # Replace with your info if this dummy gives an error
}

for title in wiki_titles:
    try:
        # Make the request with headers
        response = requests.get(
            "https://en.wikipedia.org/w/api.php",
            params={
                "action": "query",
                "format": "json",
                "titles": title,
                "prop": "extracts",
                "explaintext": True,
            },
            headers=headers  # Add headers here
        )

        # Check if request was successful
        response.raise_for_status()

        if not response.text:
            print(f"Empty response for '{title}'")
            continue

        data = response.json()

        # Extract the page content
        if "query" in data and "pages" in data["query"]:
            page = next(iter(data["query"]["pages"].values()))
            if "extract" in page:
                wiki_text = page["extract"]
                with open(data_path / "llm_data_wiki.txt", "a", encoding="utf-8") as fp:
                    fp.write(f"Title: {title}\n{wiki_text}\n\n")
                print(f"Successfully saved: {title}")
            else:
                print(f"No extract found for '{title}'")
        else:
            print(f"Unexpected response format for '{title}'")
        time.sleep(0.5)

    except requests.exceptions.RequestException as e:
        print(f"Request error for '{title}': {e}")
    except ValueError as e:  # JSON decode error
        print(f"JSON decode error for '{title}': {e}")
        print(f"Response text: {response.text[:200]}...")

Successfully saved: Mistral AI
Successfully saved: Llama (language model)
Successfully saved: Claude AI
Successfully saved: OpenAI
Successfully saved: Gemini AI


In [None]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.text_splitter import TokenTextSplitter
from llama_index.core.extractors import (
    SummaryExtractor,
    QuestionsAnsweredExtractor,
    KeywordExtractor,
)

# Assuming you have prepared a directory for llm data
documents = SimpleDirectoryReader("llm_data_wiki").load_data()

text_splitter = TokenTextSplitter(separator=" ", chunk_size=512, chunk_overlap=128)

transformations = [
    text_splitter,
    QuestionsAnsweredExtractor(questions=2),
    SummaryExtractor(summaries=["prev", "self"]),
    KeywordExtractor(keywords=10),
    OpenAIEmbedding(model="text-embedding-3-small"),
]

llm_index = VectorStoreIndex.from_documents(documents=documents, transformations=transformations)

llm_query_engine = llm_index.as_query_engine(similarity_top_k=2)

100%|██████████| 41/41 [00:26<00:00,  1.55it/s]
100%|██████████| 41/41 [01:24<00:00,  2.06s/it]
100%|██████████| 41/41 [00:19<00:00,  2.07it/s]


In [None]:
from llama_index.core.query_engine import RouterQueryEngine
from llama_index.core.selectors import PydanticSingleSelector
from llama_index.core.selectors import LLMSingleSelector
from llama_index.core.tools import QueryEngineTool

# initialize tools
ai_tutor_knowledge_tool = QueryEngineTool.from_defaults(
    query_engine=ai_tutor_knowledge_query_engine,
    description="Useful for questions about general generative AI concepts",
)
llm_tool = QueryEngineTool.from_defaults(
    query_engine=llm_query_engine,
    description="Useful for questions about particular LLMs like Mistral, Claude, OpenAI, Gemini",
)

# initialize router query engine (single selection, pydantic)
query_engine = RouterQueryEngine(
    selector=LLMSingleSelector.from_defaults(),
    query_engine_tools=[
        ai_tutor_knowledge_tool,
        llm_tool,
    ],
)

In [None]:
res = query_engine.query(
    "What is the LLama model?",
)
print(res.response)

The Llama model is a family of large language models developed by Meta AI. Key points:

- It began with a first release announced February 24, 2023, and spans model sizes from roughly 1 billion to about 2 trillion parameters.
- It is a foundation language model trained on publicly available data and intended to be accessible across a range of hardware sizes.
- Later versions include instruction‑tuned variants and expanded availability and licensing terms; Llama 3 added virtual-assistant features used in services like Facebook and WhatsApp.
- The project has been notable for research on scaling behavior (e.g., models trained beyond the “Chinchilla‑optimal” dataset size continuing to improve) and for third‑party tools and reimplementations enabling local inference without GPUs.


In [None]:
for src in res.source_nodes:
    print("Node ID\t", src.node_id)
    print("Text\t", src.text)
    print("Score\t", src.score)
    print("-_" * 20)

Node ID	 c1efb160-a4cf-4a00-8dc3-176e4d511761
Text	 at École Polytechnique Fédérale de Lausanne School of Computer and Communication Sciences, and the Yale School of Medicine. It shows increased performance on medical-related benchmarks such as MedQA and MedMCQA.
Zoom used Meta Llama 2 to create an AI Companion that can summarize meetings, provide helpful presentation tips, and assist with message responses. This AI Companion is powered by multiple models, including Meta Llama 2.
Reuters reported in 2024 that many Chinese foundation models relied on Llama models for their training.


=== llama.cpp ===

Software developer Georgi Gerganov released llama.cpp as open-source on March 10, 2023. It's a re-implementation of Llama in C++, allowing systems without a powerful GPU to run the model locally. The llama.cpp project introduced the GGUF file format, a binary format that stores both tensors and metadata. The format focuses on supporting different quantization types, which can reduce memo

In [None]:
res = query_engine.query("Explain parameter-efficient finetuning methods")
print(res.response)

Parameter-efficient fine-tuning (PEFT) refers to methods that adapt large pretrained models to new tasks while updating only a small fraction of the model’s parameters (or adding a small number of new parameters). This reduces compute, memory, and storage costs versus full fine-tuning. Three main PEFT approaches are described:

1. Selective
- What it is: Fine-tune only a subset of the model’s existing parameters (e.g., certain layers, biases, or normalization parameters).
- Effect: Keeps most weights frozen, so storage and compute remain low while allowing targeted adaptation.

2. Reparameterization (example: LoRA)
- What it is: Replace or augment dense weight updates with a low-rank decomposition. Instead of learning a full weight update ΔW, learn small matrices A and B such that ΔW ≈ B·A (or similar low-rank factorization).
- Effect: Greatly reduces number of trainable parameters. The rank controls the trade-off between parameter savings and approximation fidelity (lower rank → fewer

In [None]:
for src in res.source_nodes:
    print("Node ID\t", src.node_id)
    print("Text\t", src.text)
    print("Score\t", src.score)
    print("-_" * 20)

Node ID	 6be88fa3-2f8b-43e7-aba0-d874b39809fc
Text	 # FourierFT: Discrete Fourier Transformation Fine-Tuning[FourierFT](https://huggingface.co/papers/2405.03003) is a parameter-efficient fine-tuning technique that leverages Discrete Fourier Transform to compress the model's tunable weights. This method outperforms LoRA in the GLUE benchmark and common ViT classification tasks using much less parameters.FourierFT currently has the following constraints:- Only `nn.Linear` layers are supported.- Quantized layers are not supported.If these constraints don't work for your use case, consider other methods instead.The abstract from the paper is:> Low-rank adaptation (LoRA) has recently gained much interest in fine-tuning foundation models. It effectively reduces the number of trainable parameters by incorporating low-rank matrices A and B to represent the weight change, i.e., Delta W=BA. Despite LoRA's progress, it faces storage challenges when handling extensive customization adaptations or 

# Function Agent using OpenAI GPT 5 Model


In [None]:
system_message_openai_agent = """You are an AI teacher, answering questions from students of an applied AI course on Large Language Models (LLMs or llm) and Retrieval Augmented Generation (RAG) for LLMs. Topics covered include training models, fine-tuning models, giving memory to LLMs, prompting tips, hallucinations and bias, vector databases, transformer architectures, embeddings, RAG frameworks, Langchain, LlamaIndex, making LLMs interact with tools, AI agents, reinforcement learning with human feedback. Questions should be understood in this context.

Your answers are aimed to teach students, so they should be complete, clear, and easy to understand.

Use the available tools to gather insights pertinent to the field of AI. Always use two tools at the same time. These tools accept a string (a user query rewritten as a statement) and return informative content regarding the domain of AI.
e.g:
User question: 'How can I fine-tune an LLM?'
Input to the tool: 'Fine-tuning an LLM'

User question: How can quantize an LLM?
Input to the tool: 'Quantization for LLMs'

User question: 'Teach me how to build an AI agent"'
Input to the tool: 'Building an AI Agent'

Only some information returned by the tools might be relevant to the question, so ignore the irrelevant part and answer the question with what you have.

Your responses are exclusively based on the output provided by the tools. Refrain from incorporating information not directly obtained from the tool's responses.

When the conversation deepens or shifts focus within a topic, adapt your input to the tools to reflect these nuances. This means if a user requests further elaboration on a specific aspect of a previously discussed topic, you should reformulate your input to the tool to capture this new angle or more profound layer of inquiry.

Provide comprehensive answers, ideally structured in multiple paragraphs, drawing from the tool's variety of relevant details. The depth and breadth of your responses should align with the scope and specificity of the information retrieved.

Should the tools repository lack information on the queried topic, politely inform the user that the question transcends the bounds of your current knowledge base, citing the absence of relevant content in the tool's documentation.

At the end of your answers, always invite the students to ask deeper questions about the topic if they have any. Make sure to reformulate the question to the tool to capture this new angle or more profound layer of inquiry.

Do not refer to the documentation directly, but use the information provided within it to answer questions.

If code is provided in the information, share it with the students. It's important to provide complete code blocks so they can execute the code when they copy and paste them.

Make sure to format your answers in Markdown format, including code blocks and snippets.

Politely reject questions not related to AI, while being cautious not to reject unfamiliar terms or acronyms too quickly."""

In [None]:
from llama_index.core.agent.workflow import FunctionAgent
from llama_index.llms.openai import OpenAI

# Initialize the LLM
llm = OpenAI(model="gpt-5", additional_kwargs={"reasoning_effort":"minimal"})

# Create the FunctionAgent
agent = FunctionAgent(
    tools=[ai_tutor_knowledge_tool, llm_tool],
    llm=llm,
    system_prompt=system_message_openai_agent,
    verbose=False
)

# Run the agent queries
import asyncio

async def run_agent(query):
    response = await agent.run(query)
    return response

In [None]:
# Execute the async function
response = asyncio.run(run_agent("What is the LLama model?"))

print(response)

LLaMA (often stylized as LLaMA or Llama) is a family of transformer-based large language models developed by Meta AI. First announced on February 24, 2023, LLaMA models are foundation models trained on publicly available data and designed to be adapted (e.g., via instruction tuning) for a wide range of NLP tasks. The family spans multiple sizes—from small models suitable for limited hardware budgets to very large ones—and has evolved through several releases (e.g., Llama 1, Llama 2, Llama 3).

Key points:
- Architecture and purpose: Transformer-based foundation models intended for broad NLP use, with instruction-tuned variants available in later releases.
- Sizes: Early releases included variants like 13B and 65B parameters. Llama 2 was released at 7B, 13B, and 70B. Later Llama 3 releases included sizes such as 8B, 13B, and 65B, with the family ultimately extending to much larger scales.
- Training data and scaling: Trained on publicly available data. Studies around Llama 3 noted that 

In [None]:
# Execute the async function
response = asyncio.run(run_agent("Explain parameter-efficient finetuning methods"))

print(response)

Parameter-efficient fine-tuning (PEFT) adapts large language models by training a tiny fraction of parameters while keeping the base weights frozen. This reduces compute, memory, and storage, and makes it easy to maintain many task-specific variants. Here are the main methods, their trade-offs, and practical tips:

1) Adapter modules
- What: Insert small trainable layers (bottleneck MLPs) between transformer layers; only adapters are trained.
- Pros: Low parameter overhead; easy to add/remove; maintain separate adapters per task.
- Cons: Adds inference latency/compute (extra forward ops) and minor architecture changes.

2) Low-Rank Adaptation (LoRA)
- What: Replace full weight updates with low-rank matrices (W_delta = A·B) on selected projection matrices (often attention Q/K/V/O or MLPs); base W is frozen.
- Pros: Very parameter-efficient; simple; negligible inference overhead if merged into base weights; widely used.
- Cons: Performance sensitive to rank choice and which layers are ta

In [None]:
# Execute the async function
response = asyncio.run(run_agent("Write the recipe for a chocolate cake."))

print(response)

I’m here to help with questions about AI, LLMs, and RAG. A chocolate cake recipe falls outside that scope, so I can’t provide it.

If you’d like, I can share how to build an AI assistant that retrieves recipes using RAG, or how to fine-tune an LLM on a corpus of cooking instructions to generate reliable recipes. For example:
- How to build a recipe RAG system (data ingestion, embeddings, vector search, prompt design)
- How to evaluate hallucinations in generated recipes
- How to make an agent that plans a cooking workflow and shopping list

Tell me which direction you prefer, and I’ll dive in. For instance, I can query: “Building a recipe RAG system with vector databases and LLMs” or “Fine-tuning an LLM on cooking instructions and preventing hallucinations.”


# Code related questions to GPT-5, the remaining questions to Gemini

In [None]:
from llama_index.llms.openai import OpenAI
from llama_index.llms.google_genai import GoogleGenAI
from llama_index.core.query_engine import RouterQueryEngine
from llama_index.core.selectors import PydanticSingleSelector
from llama_index.core.tools import QueryEngineTool

# initialize LLMs
gpt_5_llm = OpenAI(model="gpt-5", additional_kwargs={"reasoning_effort":"minimal"})

gemini_llm = GoogleGenAI(model="gemini-2.5-flash", temperature=1, max_tokens=512)

# define query engines
llm_query_engine_code = vector_index.as_query_engine(
    llm=gpt_5_llm,
    similarity_top_k=3,
    embed_model=OpenAIEmbedding(model="text-embedding-3-small", mode="text_search"),
)

llm_query_engine_rest = vector_index.as_query_engine(
    llm=gemini_llm,
    similarity_top_k=3,
    embed_model=OpenAIEmbedding(model="text-embedding-3-small", mode="text_search"),
)

# define tools for LLM
llm_tool_code = QueryEngineTool.from_defaults(
    query_engine=llm_query_engine_code,
    description="Ideal for handling code-related queries, technical implementations, and troubleshooting involving Large Language Models.",
    name="LLMCodeTool",
)

llm_tool_rest = QueryEngineTool.from_defaults(
    query_engine=llm_query_engine_rest,
    description="Best suited for answering conceptual, theoretical, and general questions about Large Language Models.",
    name="LLMGeneralTool",

)


system_message_openai_agent_tools = """
You are a highly knowledgeable assistant specialized in Large Language Models. Your primary role is to assist users by providing accurate, detailed, and context-specific responses. You have access to two specialized tools:

1. **LLMCodeTool** – Use this tool when the query involves code-related tasks, technical implementations, debugging, or troubleshooting issues in code.
2. **LLMGeneralTool** – Use this tool for answering conceptual, theoretical, or general questions about Large Language Models that do not involve code specifics.

When a query is received:
- First, decide which tool best fits the user's request.
- If the question is technical or code-oriented, route the query to LLMCodeTool.
- If the question is more general or conceptual, route the query to LLMGeneralTool.
- If the query does not clearly fall into either category, provide a direct answer using your own capabilities.

Always ensure your responses are clear, concise, and directly address the user’s needs. Maintain a professional tone and provide detailed explanations where necessary.
"""
# Create the FunctionAgent
agent = FunctionAgent(
    tools=[llm_tool_code, llm_tool_rest],
    llm=gpt_5_llm,
    system_prompt=system_message_openai_agent_tools,
    verbose=False
)

# Run the agent queries
import asyncio

async def run_agent(query):
    response = await agent.run(query)

    return response.tool_calls[0].tool_name


In [None]:
# Execute the async function
response_code = asyncio.run(run_agent("How do I fine-tune the LLama model? Write the code for it"))

print(response_code)

LLMCodeTool


In [None]:
response_general = asyncio.run(run_agent("What is the relationship between Llama models and Meta"))

print(response_general)

LLMGeneralTool
