<a href="https://colab.research.google.com/github/rubaahmedkhan/Agentic-RAG-System/blob/main/RAG_PROJECT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📚 RAG System with Gemini + FAISS (No LangChain)

## 1. Install Libraries



In [None]:
!pip install sentence-transformers faiss-cpu PyPDF2 google-generativeai

Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-tran

# **2. Upload PDF File**

### upload files (first way)

In [None]:
from google.colab import files
uploaded = files.upload()
pdf_path = list(uploaded.keys())[0]
print("Uploaded:", pdf_path)



Saving Application for Not Attending College on Saturday1.pdf to Application for Not Attending College on Saturday1.pdf
Uploaded: Application for Not Attending College on Saturday1.pdf


In [None]:
pdf_path = "/content/resign letter.pdf"

# **3. Extract Text from PDF**

In [None]:
from PyPDF2 import PdfReader

def extract_text_from_pdf(pdf_path):
    reader = PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text() or ""
    return text

text = extract_text_from_pdf(pdf_path)
print("Extracted text length:", len(text))


Extracted text length: 641


# **4. Split Text into Chunks**

In [None]:
def split_text(text, chunk_size=500, overlap=50):
    chunks, start = [], 0
    while start < len(text):
        chunk = text[start:start + chunk_size]
        chunks.append(chunk)
        start += chunk_size - overlap
    return chunks

chunks = split_text(text)
print(f"Total chunks: {len(chunks)}")


Total chunks: 2


# **5. Generate Embeddings**

## **🧠 1. Local Embedding (e.g., SentenceTransformer)**


✅ **Pros of Local Embedding**

- ⚡Fast (after loading): Once the model is loaded, it generates embeddings quickly.

- 🔐 Private: Everything runs on your local machine — no data is sent to the internet.

- 📴 Works offline: After the model is downloaded, it can run without an internet connection.

❌ **Cons of Local Embedding**
- 🖥️ Uses your system memory (RAM & CPU/GPU): It can be demanding on lower-spec systems.

- 🧠 Initial model load time: The model can take around 5–10 seconds to load.

- 💻 May lag or crash on low-RAM systems: If your system has less than 4GB RAM, performance issues can occur.



### **Locally generate embedding**

| Model Name                  | Size    | Speed      | Accuracy         |
| --------------------------- | ------- | ---------- | ---------------- |
| `paraphrase-MiniLM-L3-v2`   | \~35MB  | ✅ Fastest  | ⚠️ Lower quality |
| `all-MiniLM-L12-v2`         | \~120MB | Medium     | High accuracy    |
| `multi-qa-MiniLM-L6-cos-v1` | \~80MB  | Good       | For QA tasks     |
| `all-MiniLM-L6-v2`          | \~80MB  | ✅ Balanced | ✅ Good           |


In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np

embed_model = SentenceTransformer('all-MiniLM-L6-v2')

def create_embeddings(chunks):
    return embed_model.encode(chunks)

embeddings = create_embeddings(chunks)
print("Embeddings shape:", embeddings.shape)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Embeddings shape: (2, 384)


# **6. Build FAISS Index**

In [None]:
import faiss

def create_faiss_index(embeddings):
    dim = embeddings.shape[1]
    index = faiss.IndexFlatL2(dim)
    index.add(np.array(embeddings))
    return index

index = create_faiss_index(embeddings)


# **7. Search Similar Chunks**

| Distance Type         | Description                                                              |
| --------------------- | ------------------------------------------------------------------------ |
| **L2 distance**       | Euclidean Distance: √(x1−y1)² + (x2−y2)² + ...                           |
| **Cosine similarity** | Measures angle between vectors (you'd use `IndexFlatIP` + normalization) |


In [None]:
def search_similar_chunks(query, k=3):
    q_emb = embed_model.encode([query])
    D, I = index.search(np.array(q_emb), k)
    return [chunks[i] for i in I[0]]

# Test search
print(search_similar_chunks("what is the main purpose of this letter"))


['Subject: Availability for Academic Session 2025–26  \nRespected Principal,  \nI trust this le Ʃer ﬁnds you in the best of health and spirits.  \nI am wriƟng to inform you that, due to personal and professional commitments, I will not be \navailable for the upcoming academic session 2025–26. I am sincerely grateful for the \nopportunity to work under your guidance, and I truly value the experience and growth I have \ngained during my Ɵme here. \nI kindly request that you take my unavailability into cons', ' request that you take my unavailability into considera Ɵon while planning for the next \nsession. \nThank you once again for your understanding and con Ɵnued support.  \nWarm regards,  \nRuba \n ', ' request that you take my unavailability into considera Ɵon while planning for the next \nsession. \nThank you once again for your understanding and con Ɵnued support.  \nWarm regards,  \nRuba \n ']


# **8. Setup Gemini API**

In [None]:
import os
import google.generativeai as genai

GOOGLE_API_KEY = "api_key"  # Paste your key here
genai.configure(api_key=GOOGLE_API_KEY)

def get_answer_from_context_gemini(context_chunks, query):
    context = "\n".join(context_chunks)
    prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"
    model = genai.GenerativeModel("gemini-2.5-flash")
    response = model.generate_content(prompt)
    return response.text.strip()


# **9. End-to-End Example**

In [None]:
query = "Why has Ruba written this letter?"
relevant_chunks = search_similar_chunks(query, k=5)
answer = get_answer_from_context_gemini(relevant_chunks, query)

# print("📄 Context Chunks:")
# for c in relevant_chunks:
#     print("-", c[:100].replace("\n"," "), "...")
print("\n💡 Gemini says:")
print(answer)



💡 Gemini says:
Ruba has written this letter to **inform the Principal that she will not be available for the upcoming academic session 2025-26** due to personal and professional commitments, and to request that this unavailability be taken into consideration during the planning for the next session.


# **AGENTIC AI with RAG**

In [None]:
!pip install  -qU openai-agents

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.1/40.1 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m130.6/130.6 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m129.3/129.3 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m150.9/150.9 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.2/45.2 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import nest_asyncio
nest_asyncio.apply()

In [None]:
import os

from agents import Agent, Runner, AsyncOpenAI, OpenAIChatCompletionsModel, ModelSettings
from agents.run import RunConfig
from google.colab import userdata



In [None]:
gemini_api_key = userdata.get("GEMINI_API_KEY")


# Check if the API key is present; if not, raise an error
if not gemini_api_key:
    raise ValueError("GEMINI_API_KEY is not set. Please ensure it is defined in your .env file.")

#Reference: https://ai.google.dev/gemini-api/docs/openai
external_client = AsyncOpenAI(
    api_key=gemini_api_key,
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
)
model_settings = ModelSettings(
    max_tokens=2000,
                                              # CHECK All PARAMETERS PICTURE IN MOBILE
)

model = OpenAIChatCompletionsModel(
    model="gemini-2.0-flash",
    openai_client=external_client
)

config = RunConfig(
    model=model,
    model_provider=external_client,
    model_settings=model_settings,
    tracing_disabled=True
)

In [None]:

@function_tool
def ask_rag_question(query: str) -> str:
    relevant_chunks = search_similar_chunks(query, k=5)
    answer = get_answer_from_context_gemini(relevant_chunks, query)

    chunks_preview = "\n".join(["- " + c[:100].replace("\n", " ") + " ..." for c in relevant_chunks])
    return f"📄 Context Chunks:\n{chunks_preview}\n\n💡 Gemini says:\n{answer}"


NameError: name 'function_tool' is not defined

In [None]:
from agents import Agent

agent = Agent(
    name="rag_agent",
    instructions="""
You are a professional assistant specialized in handling queries related to resignation letters.

You MUST use the tool called `ask_rag_question` to answer all user questions. Do not generate any answer yourself.

If the tool returns a result, present it clearly to the user.

If the tool is unable to return an answer, politely respond with something like:
"I'm sorry, but I couldn't find enough information to answer your question at the moment."

Never attempt to generate answers on your own — using the tool is mandatory.
"""
,
    tools=[ask_rag_question],
)


In [None]:
from agents import Runner

result = await Runner.run(
    agent,
    input="hello",
    run_config=config  # Make sure `config` is defined properly
)

print(result.final_output)


How can I help you with your resignation letter today?



# **🌐  Hugging Face Inference API (Cloud-based)**

✅ Pros of Using Hugging Face Inference API

- 🚫 No local compute required: Nothing heavy runs on your machine.

- 💻 Perfect for low-RAM systems: Works well even on lightweight or older devices.

- 🔄 Scales across multiple lightweight clients: Great for multiple users or small setups.

# **❌ Cons of Using Hugging Face Inference API**
- 🌐 Requires an internet connection: Cannot run offline.

- 🕓 Slightly slower: Network latency and API processing take time.

- 🔐 Data goes to the cloud: Avoid sending sensitive or private content.

- 🔁 Rate limits apply: Free tier typically allows only around 30 requests per minute.



# **1. Web Scraping**

In [None]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

base_url = "https://openai.github.io/openai-agents-python/"
visited = set()

def scrape_all(url):
    if url in visited: return []
    visited.add(url)
    resp = requests.get(url)
    soup = BeautifulSoup(resp.text, "html.parser")

    # Extract text & code
    texts = [el.get_text(separator="\n") for el in soup.select("p, li, code, pre, h1,h2,h3")]
    results = [(url, "\n".join(texts))]

    # Recursively follow internal links
    for a in soup.find_all("a", href=True):
        link = urljoin(base_url, a["href"])
        if link.startswith(base_url):
            results += scrape_all(link)

    return results

pages = scrape_all(base_url)
print(f"Scraped {len(pages)} pages")


Scraped 1839 pages


# **2. Chunk Text & Code**

In [None]:
def chunk_text(text, chunk_size=500, overlap=50):
    chunks, i = [], 0
    while i < len(text):
        chunks.append(text[i:i+chunk_size])
        i += chunk_size - overlap
    return chunks

all_chunks = []
for url, txt in pages:
    for chunk in chunk_text(txt):
        all_chunks.append((url, chunk))
print(f"Total chunks: {len(all_chunks)}")

Total chunks: 47392


# **3. Embed & Store in FAISS**

In [None]:
import faiss, numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
texts = [c for u,c in all_chunks]
embs = model.encode(texts, show_progress_bar=True)

index = faiss.IndexFlatL2(embs.shape[1])
index.add(np.array(embs))


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1481 [00:00<?, ?it/s]

# **4. Search Function**

In [None]:
def search_docs(query, k=5):
    qv = model.encode([query])
    D, I = index.search(np.array(qv), k)
    return [(all_chunks[i][0], all_chunks[i][1]) for i in I[0]]



# **5. Answering via Gemini**

In [None]:
import google.generativeai as genai

genai.configure(api_key="api_key")


# Embedding model

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")  # ✅ Rename to avoid overwrite

# Gemini model
llm = genai.GenerativeModel("gemini-2.5-flash")

def answer_query(query):
    docs = search_docs(query)
    context = "\n\n".join(f"From {url}:\n{chunk}" for url, chunk in docs)
    prompt = f"{context}\n\nQuestion: {query}\nAnswer:"
    res = llm.generate_content(prompt)
    return res.text


In [None]:
def search_docs(query, k=5):
    qv = embedding_model.encode([query])  # ✅ Correct variable name
    D, I = index.search(np.array(qv), k)
    return [(all_chunks[i][0], all_chunks[i][1]) for i in I[0]]


# **Agents with RAG**

In [None]:
query = "what is handoff and guardrils"
answer = answer_query(query)

print("🔍 Question:", query)
print("\n🧠 Gemini's Answer:\n", answer)


🔍 Question: what is handoff and guardrils

🧠 Gemini's Answer:
 Based on the provided text:

**Handoffs:**
*   Handoffs allow an agent to **delegate tasks to another agent**.
*   They are particularly useful when different agents **specialize in distinct areas** (e.g., a customer support app having agents for order status, refunds, FAQs).
*   Handoffs are **represented as tools to the LLM**.
*   There is a `handoff()` function that can be used for **customizing handoffs**.
*   They involve **handoff inputs** and **input filters**, and there are **recommended prompts** for their use.

**Guardrails:**
*   "Guardrails" is listed as a feature or section in the documentation, appearing under "Orchestrating multiple agents" and also as a standalone concept.
*   **The provided text does not offer a definition or explanation of what guardrails are.** It only indicates that it's a topic covered in the documentation.


In [None]:
from agents import Agent , function_tool, Runner


@function_tool
def search_openai_docs(query: str) -> str:
    """Search OpenAI Agents documentation and answer using Gemini Flash."""
    docs = search_docs(query)
    context = "\n\n".join(f"From {url}:\n{chunk}" for url, chunk in docs)
    prompt = f"{context}\n\nQuestion: {query}\nAnswer:"
    res = llm.generate_content(prompt)
    return res.text


In [None]:


@function_tool
def summarize_text() -> str:
    """
    Summarize the OpenAI Agents SDK documentation.
    This version is safe for testing and avoids hanging by limiting chunk count.
    """
    # Limit for testing – only first 10 chunks
    selected_chunks = [chunk for _, chunk in all_chunks[:10]]

    partial_summaries = []
    for i, chunk in enumerate(selected_chunks):
        prompt = f"Summarize the following part of the OpenAI SDK documentation:\n\n{chunk}"
        try:
            print(f"⏳ Summarizing chunk {i+1}/{len(selected_chunks)}...")
            response = llm.generate_content(prompt)
            partial_summaries.append(response.text)
        except Exception as e:
            print(f"⚠️ Skipping chunk {i+1} due to error: {e}")
            continue  # skip if Gemini fails

    # Combine all summaries
    if not partial_summaries:
        return "❌ Summary could not be generated. Please try again."

    final_prompt = (
        "Combine the following into a clear summary of the OpenAI Agents SDK:\n\n"
        + "\n\n".join(partial_summaries)
    )

    try:
        final_response = llm.generate_content(final_prompt)
        return final_response.text
    except Exception as e:
        return "⚠️ Final summarization failed. Try again later."


In [None]:
agent = Agent(
    name="docs_agent",
    instructions=(
         "You answer questions using information from the OpenAI Agents SDK documentation. "
        "You MUST use the `answer_query` tool for answers. "
        "If no relevant info is found, politely decline: 'Sorry, I couldn't find anything.' "
        "If the user asks for a summary, overview, or explanation of the entire SDK, "
        "use the `summarize_text` tool to generate a summary from the documentation."),
    tools=[search_openai_docs, summarize_text],
)
result = await Runner.run(agent,"give summary",run_config=config)
print(result.final_output)


⏳ Summarizing chunk 1/10...
⏳ Summarizing chunk 2/10...
⏳ Summarizing chunk 3/10...
⏳ Summarizing chunk 4/10...
⏳ Summarizing chunk 5/10...
⏳ Summarizing chunk 6/10...
⏳ Summarizing chunk 7/10...
⏳ Summarizing chunk 8/10...
⏳ Summarizing chunk 9/10...
⏳ Summarizing chunk 10/10...
The OpenAI Agents SDK is a toolkit for developing, deploying, and managing AI agents, providing a structured approach to building agent-powered applications. It includes functionalities such as: core agent functions, workflow orchestration, control and safety features like guardrails, tool integration, and handoffs between agents. The SDK supports voice agents, various models (including those via LiteLLM), and offers configuration options. It also provides observability tools like tracing and agent visualization, along with utilities like a REPL for development. The SDK covers aspects like inputs, results, events, exceptions and extensions.

