# RAG Q&A with Google Gemini 2.5 Flash

This notebook demonstrates a Retrieval-Augmented Generation (RAG) pipeline:
1. Load a PDF document
2. Split into chunks
3. Create embeddings & vector store (FAISS)
4. Ask questions using Gemini 2.5 Flash

## Step 1: Load PDF Document

In [1]:
!pip install qdrant-client langchain langchain-google-genai \
             langchain-text-splitters \
             sentence-transformers \
             python-dotenv



In [54]:
response_cache = {}

In [2]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("Transformers.pdf")
data = loader.load()
print(f"Loaded {len(data)} pages")
data[0]

  from pydantic.v1.fields import FieldInfo as FieldInfoV1
  from .autonotebook import tqdm as notebook_tqdm


Loaded 11 pages


Document(metadata={'producer': 'PyPDF2', 'creator': 'PyPDF', 'creationdate': '', 'author': 'Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin', 'book': 'Advances in Neural Information Processing Systems 30', 'created': '2017', 'date': '2017', 'description': 'Paper accepted and presented at the Neural Information Processing Systems Conference (http://nips.cc/)', 'description-abstract': 'The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms.  We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly les

## Step 2: Split Documents into Chunks

In [4]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs = text_splitter.split_documents(data)

print(f"Total number of chunks: {len(docs)}")
docs[0]

Total number of chunks: 43


Document(metadata={'producer': 'PyPDF2', 'creator': 'PyPDF', 'creationdate': '', 'author': 'Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin', 'book': 'Advances in Neural Information Processing Systems 30', 'created': '2017', 'date': '2017', 'description': 'Paper accepted and presented at the Neural Information Processing Systems Conference (http://nips.cc/)', 'description-abstract': 'The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms.  We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly les

## Step 3: Vector Database connection using Qdrant Cloud


In [41]:
import os
from dotenv import load_dotenv
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance
from langchain_community.embeddings import HuggingFaceEmbeddings

load_dotenv()

qdrant = QdrantClient(
    url=os.getenv("QDRANT_URL"),
    api_key=os.getenv("QDRANT_API_KEY"),
    timeout = 60 
)

print("Connected to Qdrant Cloud")

Connected to Qdrant Cloud


## Adding the Embeddings from HuggingFace
 model idsentence-transformers/all-MiniLM-L6-v2"

In [6]:
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

vector_size = len(embeddings.embed_query("test"))
print("Embedding dimension:", vector_size)

  embeddings = HuggingFaceEmbeddings(
Loading weights: 100%|██████████| 103/103 [00:00<00:00, 917.90it/s, Materializing param=pooler.dense.weight]                             
BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Embedding dimension: 384


## Adding BM25 on retrievd docs

In [48]:
!pip install rank_bm25

Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25
Successfully installed rank_bm25-0.2.2


In [50]:
from rank_bm25 import BM25Okapi

async def hybrid_retrieve(query, selected):
    tasks = [retrieve(c, query) for c in selected]
    results = await asyncio.gather(*tasks)

    merged = []
    for r in results:
        merged.extend(r)

    # Step 1: Vector ranking
    merged.sort(key=lambda x: x[1], reverse=True)
    top_vector_docs = [doc for doc, score in merged[:6]]

    # Step 2: BM25 reranking
    tokenized_docs = [doc.split() for doc in top_vector_docs]
    bm25 = BM25Okapi(tokenized_docs)

    tokenized_query = query.split()
    bm25_scores = bm25.get_scores(tokenized_query)

    combined = list(zip(top_vector_docs, bm25_scores))
    combined.sort(key=lambda x: x[1], reverse=True)

    # Return top 5 after hybrid rerank
    return [doc for doc, score in combined[:5]]

## Creating the collections of 4 different systems 
1.reserach_papers: any research paper data will give based on them 
2.knowledge_base: any knowledge based business ans can give
3.code_docs: any documenation github repo based api documentation
4.faq_data:Question AND ANSWERS Combine ans can give

In [39]:
collections = [
    "research_papers",
    "knowledge_base",
    "code_docs",
    "faq_data"
]

for name in collections:
    if name not in [c.name for c in qdrant.get_collections().collections]:
        qdrant.create_collection(
            collection_name=name,
            vectors_config=VectorParams(
                size=vector_size,
                distance=Distance.COSINE
            )
        )

print("Collections ready")

Collections ready


## Now Adding the Ingestion Pipeline

In [10]:
# Research pipeline

from langchain_community.document_loaders import PyPDFLoader
import uuid

def ingest_research(pdf_path):
    loader = PyPDFLoader(pdf_path)
    docs = loader.load()

    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200
    )

    chunks = splitter.split_documents(docs)

    vectors = embeddings.embed_documents(
        [doc.page_content for doc in chunks]
    )

    points = [
        {
            "id": str(uuid.uuid4()),
            "vector": vectors[i],
            "payload": {
                "text": chunks[i].page_content,
                "page": chunks[i].metadata.get("page"),
                "source_file": pdf_path,
                "collection": "research_papers"
            }
        }
        for i in range(len(chunks))
    ]

    qdrant.upsert(
        collection_name="research_papers",
        points=points
    )

    print(f"Research ingested: {len(points)} chunks")

In [11]:
# Knowledge Base Ingestion Pipeline

from langchain_community.document_loaders import DirectoryLoader, TextLoader

def ingest_knowledge_base(folder_path):
    loader = DirectoryLoader(
        folder_path,
        glob="**/*.md",
        loader_cls=TextLoader
    )

    docs = loader.load()

    splitter = RecursiveCharacterTextSplitter(
        chunk_size=800,
        chunk_overlap=100
    )

    chunks = splitter.split_documents(docs)

    vectors = embeddings.embed_documents(
        [doc.page_content for doc in chunks]
    )

    points = [
        {
            "id": str(uuid.uuid4()),
            "vector": vectors[i],
            "payload": {
                "text": chunks[i].page_content,
                "source_file": chunks[i].metadata.get("source"),
                "collection": "knowledge_base"
            }
        }
        for i in range(len(chunks))
    ]

    qdrant.upsert(
        collection_name="knowledge_base",
        points=points
    )

    print(f"Knowledge base ingested: {len(points)} chunks")

In [12]:
# Code Docs Ingestion Pipeline

def ingest_code_docs(repo_path):
    loader = DirectoryLoader(
        repo_path,
        glob="**/*.py",
        loader_cls=TextLoader
    )

    docs = loader.load()

    splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=50
    )

    chunks = splitter.split_documents(docs)

    vectors = embeddings.embed_documents(
        [doc.page_content for doc in chunks]
    )

    points = [
        {
            "id": str(uuid.uuid4()),
            "vector": vectors[i],
            "payload": {
                "text": chunks[i].page_content,
                "file": chunks[i].metadata.get("source"),
                "collection": "code_docs"
            }
        }
        for i in range(len(chunks))
    ]

    qdrant.upsert(
        collection_name="code_docs",
        points=points
    )

    print(f"Code docs ingested: {len(points)} chunks")

In [13]:
# FAQ Data Ingestion Pipeline
def ingest_faq(csv_path):
    df = pd.read_csv(csv_path)

    documents = []

    for _, row in df.iterrows():
        content = f"Question: {row['question']}\nAnswer: {row['answer']}"

        documents.append(
            {
                "id": str(uuid.uuid4()),
                "text": content,
                "category": row.get("category", "general")
            }
        )

    vectors = embeddings.embed_documents(
        [doc["text"] for doc in documents]
    )

    points = [
        {
            "id": documents[i]["id"],
            "vector": vectors[i],
            "payload": {
                "text": documents[i]["text"],
                "category": documents[i]["category"],
                "collection": "faq_data"
            }
        }
        for i in range(len(documents))
    ]

    qdrant.upsert(
        collection_name="faq_data",
        points=points
    )

    print(f"FAQ data ingested: {len(points)} entries")

In [14]:
#MASTER INGEST FUNCTION

def ingest_all():
    ingest_research("Transformers.pdf")
    ingest_knowledge_base("./knowledge_docs/")
    ingest_code_docs("./repo/")
    ingest_faq("faq.csv")

## Step 5: Set Up Gemini 2.5 Flash & Ask Questions

In [15]:
from langchain_google_genai import ChatGoogleGenerativeAI
from dotenv import load_dotenv
load_dotenv()

llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash", temperature=0.3, max_tokens=500)
print("Gemini 2.5 Flash LLM ready!")

Gemini 2.5 Flash LLM ready!


In [46]:
def stream_answer(prompt):
    response = llm.stream(prompt)

    full_answer = ""
    for chunk in response:
        print(chunk.content, end="", flush=True)
        full_answer += chunk.content

    print()  # newline
    return full_answer

# Rewrite query 

In [17]:
def rewrite_query(question):
    rewrite_prompt = f"""
    Rewrite the following question into ONE clear standalone search query.

    Return ONLY the rewritten query.
    Do NOT explain.
    Do NOT give options.

    Question: {question}
    """

    response = llm.invoke(rewrite_prompt)
    return response.content.strip()

## Context Builder

In [18]:
def build_context_with_sources(docs):
    context = ""
    sources = []

    for doc in docs:
        page = doc.metadata.get("page", "unknown")
        context += f"[Page {page}]\n{doc.page_content}\n\n"
        sources.append(page)

    return context, list(set(sources))


# Adding the Memory For RAG 

In [44]:
import langchain
print(langchain.__version__)

1.2.10


In [19]:
# Simple memory storage
chat_history = []

def update_memory(user_input, assistant_output):
    chat_history.append({
        "user": user_input,
        "assistant": assistant_output
    })

    # Optional: keep only last 5 conversations
    if len(chat_history) > 5:
        chat_history.pop(0)


def format_chat_history():
    formatted = ""
    for turn in chat_history:
        formatted += f"User: {turn['user']}\n"
        formatted += f"Assistant: {turn['assistant']}\n\n"
    return formatted

In [20]:
# Adding the Hybrid logic 

COLLECTION_CONFIDENCE = {
    "research_papers": 1.0,
    "knowledge_base": 0.8,
    "code_docs": 1.2,
    "faq_data": 0.7
}

In [21]:
# adding the Dynamic K
def dynamic_k(name):
    if name == "code_docs":
        return 5
    return 3


In [22]:
# Hybrid Planner

def planner(question):
    if "api" in question or "function" in question:
        return ["code_docs", "research_papers"]
    if "how" in question:
        return ["faq_data", "knowledge_base"]
    return ["research_papers", "knowledge_base"]

In [28]:
!pip install nest_asyncio



In [29]:
import nest_asyncio
nest_asyncio.apply()

In [52]:
# Async Hybrid Retrieval

import asyncio

async def retrieve(collection, query, filter_condition=None):
    vector = embeddings.embed_query(query)
    k = dynamic_k(collection)

    results = qdrant.query_points(
        collection_name=collection,
        query=vector,
        limit=k,
        query_filter=filter_condition
    )

    return [
        (
            point.payload["text"],
            point.score * COLLECTION_CONFIDENCE[collection]
        )
        for point in results.points
    ]


async def hybrid_retrieve(query, selected):
    tasks = [retrieve(c, query) for c in selected]
    results = await asyncio.gather(*tasks)

    merged = []
    for r in results:
        merged.extend(r)

    merged.sort(key=lambda x: x[1], reverse=True)

    return [doc for doc, score in merged[:6]]

## Convert to agent 

In [24]:
def document_search(query):
    docs = retriever.invoke(query)
    return docs

In [56]:
async def hybrid_agent(question):

    if question in response_cache:
        print(" Cached response")
        return response_cache[question]

    rewritten = rewrite_query(question)
    selected = planner(rewritten)

    docs = await hybrid_retrieve(rewritten, selected)

    context = "\n\n".join(docs)
    memory = format_chat_history()

    prompt = f"""
    You are a multi-resource AI assistant.

    Previous Conversation:
    {memory}

    Context:
    {context}

    Question:
    {question}
    """

    answer = stream_answer(prompt)

    update_memory(question, answer)

    response_cache[question] = answer

    return answer

## Adding the Evaluation Cell

In [57]:
await hybrid_agent("Explain how attention works in the transformer model.")

range dependencies and contextual relationships between words, regardless of their position in the sequence.

At its


'range dependencies and contextual relationships between words, regardless of their position in the sequence.\n\nAt its'