# RAG Q&A with Google Gemini 2.5 Flash

This notebook demonstrates a Retrieval-Augmented Generation (RAG) pipeline:
1. Load a PDF document
2. Split into chunks
3. Create embeddings & vector store (FAISS)
4. Ask questions using Gemini 2.5 Flash

## Step 1: Load PDF Document

In [34]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("Transformers.pdf")
data = loader.load()
print(f"Loaded {len(data)} pages")
data[0]

Loaded 11 pages


Document(metadata={'producer': 'PyPDF2', 'creator': 'PyPDF', 'creationdate': '', 'author': 'Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin', 'book': 'Advances in Neural Information Processing Systems 30', 'created': '2017', 'date': '2017', 'description': 'Paper accepted and presented at the Neural Information Processing Systems Conference (http://nips.cc/)', 'description-abstract': 'The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms.  We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly les

## Step 2: Split Documents into Chunks

In [35]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs = text_splitter.split_documents(data)

print(f"Total number of chunks: {len(docs)}")
docs[0]

Total number of chunks: 43


Document(metadata={'producer': 'PyPDF2', 'creator': 'PyPDF', 'creationdate': '', 'author': 'Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin', 'book': 'Advances in Neural Information Processing Systems 30', 'created': '2017', 'date': '2017', 'description': 'Paper accepted and presented at the Neural Information Processing Systems Conference (http://nips.cc/)', 'description-abstract': 'The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms.  We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly les

## Step 3: Create Embeddings & Vector Store

Using HuggingFace `all-MiniLM-L6-v2` embeddings with FAISS vector store.

In [36]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Test embedding
vector = embeddings.embed_query("hello, world!")
print(f"Embedding dimension: {len(vector)}")
vector[:5]

Loading weights: 100%|██████████| 103/103 [00:00<00:00, 1038.30it/s, Materializing param=pooler.dense.weight]                             
BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Embedding dimension: 384


[-0.038177136331796646,
 0.03291105851531029,
 -0.00545938266441226,
 0.014369907788932323,
 -0.040291059762239456]

In [37]:
vectorstore = FAISS.from_documents(documents=docs, embedding=embeddings)
print("Vector store created successfully!")

Vector store created successfully!


## Step 4: Test Retrieval

In [38]:
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 5})

retrieved_docs = retriever.invoke("What is the main topic of this paper?")
print(f"Retrieved {len(retrieved_docs)} chunks")

# Preview first 3 chunks
for i, doc in enumerate(retrieved_docs[:3]):
    print(f"\n--- Chunk {i+1} (Page {doc.metadata.get('page', '?')}) ---")
    print(doc.page_content[:200] + "...")

Retrieved 5 chunks

--- Chunk 1 (Page 8) ---
tensorflow/tensor2tensor.
Acknowledgements We are grateful to Nal Kalchbrenner and Stephan Gouws for their fruitful
comments, corrections and inspiration.
9...

--- Chunk 2 (Page 9) ---
2017.
[16] Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Rush. Structured attention networks.
In International Conference on Learning Representations , 2017.
[17] Diederik Kingma and Jimmy Ba. ...

--- Chunk 3 (Page 0) ---
efﬁcient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and
implementing tensor2tensor, replacing our earlier codebase, greatly improving results a...


In [39]:
# adding the similarity score debug

def debug_retrieval(query):
    results = vectorstore.similarity_search_with_score(query, k=5)

    for doc, score in results:
        print("Score:", round(score, 4))
        print("Page:", doc.metadata.get("page"))
        print(doc.page_content[:150])
        print("-" * 50)

In [40]:
# Adding the Re-ranking

def rerank_docs(question, docs):
    scored_docs = []

    for doc in docs:
        score_prompt = f"""
        Rate from 1-10 how relevant this passage is 
        for answering the question.

        Question: {question}

        Passage:
        {doc.page_content}

        Only return a number.
        """

        score = llm.invoke(score_prompt).content.strip()

        try:
            score = int(score)
        except:
            score = 0

        scored_docs.append((doc, score))

    scored_docs.sort(key=lambda x: x[1], reverse=True)

    return [doc for doc, _ in scored_docs[:3]]

## Step 5: Set Up Gemini 2.5 Flash & Ask Questions

In [41]:
from langchain_google_genai import ChatGoogleGenerativeAI
from dotenv import load_dotenv
load_dotenv()

llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash", temperature=0.3, max_tokens=500)
print("Gemini 2.5 Flash LLM ready!")

Gemini 2.5 Flash LLM ready!


# Rewrite query 

In [42]:
def rewrite_query(question):
    rewrite_prompt = f"""
    Rewrite the following question into ONE clear standalone search query.

    Return ONLY the rewritten query.
    Do NOT explain.
    Do NOT give options.

    Question: {question}
    """

    response = llm.invoke(rewrite_prompt)
    return response.content.strip()

## Context Builder

In [43]:
def build_context_with_sources(docs):
    context = ""
    sources = []

    for doc in docs:
        page = doc.metadata.get("page", "unknown")
        context += f"[Page {page}]\n{doc.page_content}\n\n"
        sources.append(page)

    return context, list(set(sources))


def ask_question(question):

    # 1️ Retrieve relevant docs
    docs = retriever.invoke(question)

    context = "\n\n".join([doc.page_content for doc in docs])

    # 2️ Get formatted memory
    memory_text = format_chat_history()

    # 3️ Build prompt with memory
    prompt = f"""
    You are an AI assistant answering questions from a document.

    Previous Conversation:
    {memory_text}

    Use ONLY the following context to answer.
    If answer is not found, say 'Not found in document.'

    Context:
    {context}

    Question:
    {question}
    """

    # 4️ Generate response
    response = llm.invoke(prompt)
    answer = response.content

    # 5️ Save memory
    update_memory(question, answer)

    return answer

# Adding the Memory For RAG 

In [44]:
import langchain
print(langchain.__version__)

1.2.10


In [45]:
# Simple memory storage
chat_history = []

def update_memory(user_input, assistant_output):
    chat_history.append({
        "user": user_input,
        "assistant": assistant_output
    })

    # Optional: keep only last 5 conversations
    if len(chat_history) > 5:
        chat_history.pop(0)


def format_chat_history():
    formatted = ""
    for turn in chat_history:
        formatted += f"User: {turn['user']}\n"
        formatted += f"Assistant: {turn['assistant']}\n\n"
    return formatted

## Convert to agent 

In [46]:
def document_search(query):
    docs = retriever.invoke(query)
    return docs

In [47]:
def agentic_rag(question):

    print(" Step 1: Rewrite Query")
    rewritten_query = rewrite_query(question)
    print("Rewritten:", rewritten_query)

    print(" Step 2: Retrieve Documents")
    docs = retriever.invoke(rewritten_query)

    print(" Step 3: Re-rank Documents")
    docs = rerank_docs(question, docs)

    print(" Step 4: Build Context")
    context, sources = build_context_with_sources(docs)

    memory_text = format_chat_history()

    prompt = f"""
    You are an AI assistant answering questions from a research paper.

    Previous Conversation:
    {memory_text}

    Use ONLY the context below to answer.
    Cite pages if relevant.
    If answer not found, say "Not found in document."

    Context:
    {context}

    Question:
    {question}

    Answer clearly in 2-4 sentences.
    """

    print(" Step 5: Generate Answer")
    response = llm.invoke(prompt)
    answer = response.content

    update_memory(question, answer)

    return answer, sources

## Adding the Evaluation Cell

In [48]:
test_questions = [
    "What is Transformer?",
    "How is it different from RNN?",
    "What is the main contribution?"
]

for q in test_questions:
    print("="*80)
    print("Question:", q)
    answer, sources = agentic_rag(q)
    print("Answer:", answer)
    print("Sources:", sources)

Question: What is Transformer?
 Step 1: Rewrite Query
Rewritten: What is Transformer AI
 Step 2: Retrieve Documents
 Step 3: Re-rank Documents
 Step 4: Build Context
 Step 5: Generate Answer
Answer: The Transformer is a neural sequence transduction model that follows an encoder-decoder architecture. It utilizes stacked self-attention and point-wise, fully connected layers for both its encoder and decoder components. Notably, it is the first transduction model to rely entirely on self-attention to compute representations of its input and output, without employing sequence-aligned RNNs or convolution. (Page 1)
Sources: [1, 2]
Question: How is it different from RNN?
 Step 1: Rewrite Query
Rewritten: LSTM vs RNN differences
 Step 2: Retrieve Documents
 Step 3: Re-rank Documents
 Step 4: Build Context
 Step 5: Generate Answer
Answer: The Transformer relies entirely on self-attention to compute representations of its input and output, without employing sequence-aligned RNNs or convolution (P

ChatGoogleGenerativeAIError: Error calling model 'gemini-2.5-flash' (RESOURCE_EXHAUSTED): 429 RESOURCE_EXHAUSTED. {'error': {'code': 429, 'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, head to: https://ai.google.dev/gemini-api/docs/rate-limits. To monitor your current usage, head to: https://ai.dev/rate-limit. \n* Quota exceeded for metric: generativelanguage.googleapis.com/generate_content_free_tier_requests, limit: 20, model: gemini-2.5-flash\nPlease retry in 31.896004035s.', 'status': 'RESOURCE_EXHAUSTED', 'details': [{'@type': 'type.googleapis.com/google.rpc.Help', 'links': [{'description': 'Learn more about Gemini API quotas', 'url': 'https://ai.google.dev/gemini-api/docs/rate-limits'}]}, {'@type': 'type.googleapis.com/google.rpc.QuotaFailure', 'violations': [{'quotaMetric': 'generativelanguage.googleapis.com/generate_content_free_tier_requests', 'quotaId': 'GenerateRequestsPerDayPerProjectPerModel-FreeTier', 'quotaDimensions': {'location': 'global', 'model': 'gemini-2.5-flash'}, 'quotaValue': '20'}]}, {'@type': 'type.googleapis.com/google.rpc.RetryInfo', 'retryDelay': '31s'}]}}