# **Week 3 Assignment: Building an Advanced RAG System**
---

### **Objective**

The goal of this assignment is to build, evaluate, and iteratively improve a Retrieval-Augmented Generation (RAG) system using a state-of-the-art Large Language Model from Google's Gemini family. You will move beyond a basic pipeline to implement advanced techniques like reranking, with the final application answering complex questions from a real-world financial document.

### **Problem Statement**

You are an AI Engineer at a top financial services firm. Your team has been tasked with creating a tool to help financial analysts quickly extract key information from lengthy, complex annual reports (10-K filings). Manually searching these 100+ page documents for specific figures or risk assessments is slow and error-prone.

Your task is to build a RAG-based Q&A system that allows an analyst to ask natural language questions about a company's 10-K report and receive accurate, grounded answers powered by Gemini.

### **Dataset**

You will be using the official 2022 10-K annual report for **Microsoft**. A 10-K report is a comprehensive summary of a company's financial performance.
*   **Download Link:** [Microsoft Corp. 2022 10-K Report (PDF)](https://www.sec.gov/Archives/edgar/data/789019/000156459022026876/msft-10k_20220630.htm)
    *   *Instructions: Go to the link, and save the webpage as a `.txt` file or copy-paste the relevant sections into a text file for easier processing.*

---

### **Tasks & Instructions**

Structure your work in a Jupyter Notebook (`.ipynb`) or Python files. Use markdown cells or comments (in case of Python file-based submissions) to explain your methodology, justify your choices, and present your findings at each stage.

**Part 1: Setup and API Configuration**
*   **Objective:** To configure your environment to use the Google Gemini API (or an equivalent model).
*   **Tasks:**
    1.  **Get Your API Key:**
        *   Go to [Google AI Studio](https://aistudio.google.com/).
        *   Sign in with your Google account.
        *   Click on **"Get API key"** and create a new API key. **Treat this key like a password and do not share it publicly.**
    2.  **Environment Setup:**
        *   In your development environment (for example, Google Colab notebook or VSCode on your local machine), install the necessary libraries: `pip install -q -U google-generativeai langchain-google-genai langchain chromadb sentence-transformers`.
        *   If you're using Colab, use the "Secrets" feature (look for the key icon 🔑 on the left sidebar) to securely store your API key. Create a new secret named `GEMINI_API_KEY` and paste your key there.
    3.  **Configure the LLM:** In your code, import the necessary libraries and configure your LLM. For example, if you're using Colab:
        ```python
        import google.generativeai as genai
        from langchain_google_genai import ChatGoogleGenerativeAI
        from google.colab import userdata

        # Configure the API key
        api_key = userdata.get('GEMINI_API_KEY')
        genai.configure(api_key=api_key)

        # Instantiate the Gemini model
        llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash")
        ```

**Part 2: Building the Baseline RAG System**
*   **Objective:** To construct a standard, vector-search-only RAG pipeline using Gemini (or an equivalent model) as the generator.
*   **Tasks:**
    1.  **Document Loading:** Load the Microsoft 10-K report into your application.
    2.  **Chunking:** Split the document into chunks. **In a markdown cell (or in a comment, if using Python instead of Jupyter), explicitly state your chosen `chunk_size` and `chunk_overlap` and briefly explain why you chose those values.**
    3.  **Vector Store:** Create embeddings for your chunks using an open-source model (e.g., `sentence-transformers/all-MiniLM-L6-v2`) and store them in a vector database (e.g., ChromaDB).
    4.  **QA Chain:** Create a standard `RetrievalQA` chain using the `llm` object (Gemini 2.5 Flash or equivalent) you configured in Part 1.
    5.  **Initial Test:** Test your baseline system with the following question: `"What were the company's total revenues for the fiscal year that ended on June 30, 2022?"`. Display the answer.

**Part 3: Evaluating the Baseline**
*   **Objective:** To quantitatively and qualitatively assess the performance of your LLM-powered system.
*   **Tasks:**
    1.  **Create a Test Set:** Create a small evaluation set of at least **five** questions. These questions should be a mix of:
        *   **Specific Fact Retrieval:** (e.g., "What is the name of the company's independent registered public accounting firm?")
        *   **Summarization:** (e.g., "Summarize the key risks related to competition.")
        *   **Keyword-Dependent:** (e.g., "What does the report say about 'Azure'?")
    2.  **Qualitative Evaluation:** Run your five questions through the baseline RAG system. For each question, display the generated answer and the source chunks that were retrieved.
    3.  **Analysis:** In a markdown cell (or in a comment, if using Python instead of Jupyter), write a brief analysis. Did the system answer correctly? Were the retrieved chunks relevant? Did you notice any failures?

**Part 4: Implementing an Advanced RAG Technique**
*   **Objective:** To improve upon the baseline by implementing a reranker.
*   **Tasks:**
    1.  **Implement a Reranker:** Add a reranker (e.g., using `CohereRerank` or a Hugging Face cross-encoder model) into your pipeline. The flow should be: Retrieve top 10 docs -> Rerank to get the best 3 -> Pass only these 3 to LLM for the final answer.
    2.  **Re-Evaluation:** Run your same five evaluation questions through your new, advanced RAG pipeline. Display the generated answer and the final source chunks for each.

**Part 5: Final Analysis and Conclusion**
*   **Objective:** To compare the baseline and advanced systems and articulate the value of the advanced technique.
*   **Tasks:**
    1.  **Comparison:** In a markdown cell (or in a comment, if using Python instead of Jupyter), create a simple table or a structured list comparing the answers from the **Baseline RAG** vs. the **Advanced RAG** for your five evaluation questions.
    2.  **Conclusion:** Write a concluding paragraph answering the following:
        *   Did adding the reranker improve the results? How?
        *   Based on your experience, what is the biggest challenge in building a reliable RAG system for dense documents?

**Bonus Section (Optional)**
*   **Objective:** To demonstrate a deeper understanding by implementing more complex features.
*   **Choose any of the following to implement:**
    *   **Implement Query Rewriting:** Before the retrieval step, use Gemini itself to rewrite the user's query to be more effective for a financial document.
    *   **Automated Evaluation with RAGAS:** Use the `ragas` library to automatically score the faithfulness and relevance of your baseline vs. your advanced system.
    *   **Source Citing:** Modify your pipeline to not only return the answer but also explicitly cite the source chunk(s) it used.

---

### **Submission Instructions**

1.  **Deadline:** You have **two weeks** from the assignment release date to submit your work.
2.  **Platform:** All submissions must be made to your allocated private GitLab repository. You **must** submit your work in a branch named `week_3`.
3.  **Format:** You can submit your work as either a Jupyter Notebook (`.ipynb`) or a collection of Python scripts (`.py`).
4.  After pushing, you should verify that your branch and files are visible on the GitLab web interface. No further action is needed. The trainers will review all submissions on the `week_3` branch after the deadline. Any assignments submitted after the deadline won't be reviewed and will reflect in your course score.
5. The use of LLMs is encouraged, but ensure that you’re not copying solutions blindly. Always review, test, and understand any code generated, adapting it to the specific requirements of your assignment. Your submission should demonstrate your own comprehension, problem-solving process, and coding style, not just an unedited output from an AI tool.

## Part 1: Setup and API Configuration


In [3]:
!pip install -U \
  langchain==0.3.27 \
  langchain-core==0.3.72 \
  langchain-text-splitters==0.3.9 \
  langchain-google-genai==2.0.10 \
  langchain-community \
  chromadb \
  sentence-transformers \
  opentelemetry-api==1.37.0 \
  opentelemetry-sdk==1.37.0 \
  opentelemetry-proto==1.37.0 \
  opentelemetry-exporter-otlp-proto-common==1.37.0 \
  opentelemetry-exporter-otlp-proto-http==1.37.0

Collecting langchain-core==0.3.72
  Downloading langchain_core-0.3.72-py3-none-any.whl.metadata (5.8 kB)
Collecting langchain-text-splitters==0.3.9
  Downloading langchain_text_splitters-0.3.9-py3-none-any.whl.metadata (1.9 kB)
Collecting langchain-google-genai==2.0.10
  Downloading langchain_google_genai-2.0.10-py3-none-any.whl.metadata (3.6 kB)
Collecting langchain-community
  Downloading langchain_community-0.4.1-py3-none-any.whl.metadata (3.0 kB)
Collecting chromadb
  Downloading chromadb-1.2.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.2 kB)
Collecting sentence-transformers
  Downloading sentence_transformers-5.1.2-py3-none-any.whl.metadata (16 kB)
Collecting filetype<2.0.0,>=1.2.0 (from langchain-google-genai==2.0.10)
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
INFO: pip is looking at multiple versions of langchain-community to determine which version is compatible with other requirements. This could take a while.
Collecting langc

In [32]:
# Import necessary libraries
import google.generativeai as genai
from langchain_google_genai import ChatGoogleGenerativeAI
from google.colab import userdata
import requests
from bs4 import BeautifulSoup
import chromadb
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains.retrieval_qa.base import RetrievalQA
from langchain.schema import Document
import os
import tempfile

import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = ''
os.environ['NO_GCE_CHECK'] = 'True'

os.environ["GOOGLE_API_KEY"] = userdata.get('GOOGLE_API_KEY')

llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash",
    temperature=0,
    max_retries=2
)

print("Successfully imported and configured model")

Successfully imported and configured model


## Part 2: Building the Baseline RAG System

### Step 1: Extract Microsoft 10-K Report


In [34]:
with open('data.txt', 'r', encoding='utf-8') as file:
    document_text = file.read()
    print(f"Loaded {len(document_text):,} characters")

Loaded 392,586 characters


### Step 2: Chunking
chunk_size = 1000 : Detects enough financial context without mixing other sections.

chunk_overlap = 200 : Preserves continuity across chunks and avoids losing context at boundaries.

In [35]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000,chunk_overlap=200)
chunks = text_splitter.split_text(document_text)
documents = [Document(page_content=chunk, metadata={"chunk_id": i}) for i, chunk in enumerate(chunks)]

print(f"Created {len(documents)} chunks")

Created 542 chunks


### Step 3: Create Vector Store with Embeddings


In [36]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = Chroma.from_documents(documents=documents,embedding=embeddings)

print(f"Vector store created with {len(documents)} chunks")

Vector store created with 542 chunks


### Step 4: Create Baseline QA Chain

In [37]:
retriever = vectorstore.as_retriever(search_type="similarity",search_kwargs={"k": 4})

baseline_qa_chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=True)

print("Baseline QA chain created!")

Baseline QA chain created!


### Step 5: Initial Test - Baseline System

In [60]:
test_question = "What were the company's total revenues for the fiscal year that ended on June 30, 2022?"
print(f"Question: {test_question}")
result = baseline_qa_chain.invoke({"query": test_question})
print(f"Answer: {result['result']}")

print(f"\nBaseline Sources ({len(result['source_documents'])}):")
for j, doc in enumerate(result["source_documents"]):
    print(f"\nSource {j+1}:")
    print(doc.page_content[:200] + "...")


Question: What were the company's total revenues for the fiscal year that ended on June 30, 2022?
Answer: The company's total revenues for the fiscal year that ended on June 30, 2022, were $28,033.

Baseline Sources (4):

Source 1:
Total

 

 

 

 

 

 

 

 

 

$

2.48

 

 

$

18,556

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Fiscal Year 2021

 

 

 

 

 

 
...

Source 2:
Total

 

 

 

 

 

 

 

 

 

$

2.48

 

 

$

18,556

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Fiscal Year 2021

 

 

 

 

 

 
...

Source 3:
Total

 

 

 

 

 

 

 

 

 

$

2.48

 

 

$

18,556

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Fiscal Year 2021

 

 

 

 

 

 
...

Source 4:
Shares

 

 

Amount

 

 

 

 

 

 

 

 

 

 

 

Year Ended June 30,

 

 

2022

 

 

 

2021

 

 

 

2020

 

 


## Part 3: Evaluating the Baseline System


In [59]:
evaluation_questions = [
    {
        "question": "What is the name of the company's independent registered public accounting firm?",
        "type": "Specific Fact Retrieval",
        "expected_info": "Should find auditor name"
    },
    {
        "question": "What were Microsoft's total revenues for fiscal year 2022?",
        "type": "Specific Fact Retrieval",
        "expected_info": "Should find exact revenue figure"
    },
    {
        "question": "Summarize the key risks related to competition mentioned in the report.",
        "type": "Summarization",
        "expected_info": "Should provide overview of competitive risks"
    },
    {
        "question": "What does the report say about Azure's performance and growth?",
        "type": "Keyword-Dependent",
        "expected_info": "Should find Azure-related information"
    },
    {
        "question": "What are the main segments of Microsoft's business according to the 10-K?",
        "type": "Summarization",
        "expected_info": "Should identify business segments"
    }
]

baseline_results = {}

for i, eval_item in enumerate(evaluation_questions, 1):
    print(f"\nTest {i}/5 - {eval_item['type']}")

    question = eval_item['question']
    print(f"Question: {question}")

    result = baseline_qa_chain.invoke({"query": question})

    print(f"Answer:")
    print(result["result"])

    print(f"\nSources ({len(result['source_documents'])}):")
    for j, doc in enumerate(result["source_documents"]):
        print(f"\nSource {j+1}:")
        print(doc.page_content[:200] + "...")

    baseline_results[f"Q{i}"] = {
        "question": question,
        "type": eval_item['type'],
        "answer": result["result"]
    }



Test 1/5 - Specific Fact Retrieval
Question: What is the name of the company's independent registered public accounting firm?
Answer:
I don't know the answer, as the provided text does not state the name of the independent registered public accounting firm. It only states that "We have audited..." but does not identify "We."

Sources (4):

Source 1:
Item 9A

 

 

REPORT OF INDEPENDENT REGISTERED PUBLIC ACCOUNTING FIRM

To the Stockholders and the Board of Directors of Microsoft Corporation

Opinion on Internal Control over Financial Reporting

W...

Source 2:
Item 9A

 

 

REPORT OF INDEPENDENT REGISTERED PUBLIC ACCOUNTING FIRM

To the Stockholders and the Board of Directors of Microsoft Corporation

Opinion on Internal Control over Financial Reporting

W...

Source 3:
Item 9A

 

 

REPORT OF INDEPENDENT REGISTERED PUBLIC ACCOUNTING FIRM

To the Stockholders and the Board of Directors of Microsoft Corporation

Opinion on Internal Control over Financial Reporting

W...

Source 4:
12

# Analysis

**Score: 3/5**

### What worked Good
- Found business segments perfectly
- Good summaries of risks and competition
- Retrieved relevant documents

### What Failed
- Missed specific facts (auditor name, revenue numbers)
- Can't extract exact details from text

### Summary
The system shows promise for document summarization but needs improvement for precise fact extraction.

## Part 4: Implementing Advanced RAG with Reranking


In [58]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank

cohere_api_key = userdata.get('COHERE_API_KEY')

base_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

compressor = CohereRerank(cohere_api_key=cohere_api_key,model="rerank-english-v3.0")

reranked_retriever = ContextualCompressionRetriever(base_compressor=compressor,base_retriever=base_retriever)

advanced_qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=reranked_retriever,
    return_source_documents=True
)

for i, eval_item in enumerate(evaluation_questions, 1):
    print(f"\nTest {i}/5 - {eval_item['type']}")

    question = eval_item['question']
    print(f"Question: {question}")

    result = advanced_qa_chain.invoke({"query": question})

    print(f"Answer:{result["result"]}")

    print(f"\nCohere Reranked Sources ({len(result['source_documents'])}):")
    for j, doc in enumerate(result["source_documents"]):
        print(f"\nSource {j+1}:")
        print(doc.page_content[:200] + "...")


Test 1/5 - Specific Fact Retrieval
Question: What is the name of the company's independent registered public accounting firm?
Answer:I don't know the answer. The provided text states that an independent registered public accounting firm audited Microsoft Corporation, but it does not name the firm.

Cohere Reranked Sources (3):

Source 1:
Item 9A

 

 

REPORT OF INDEPENDENT REGISTERED PUBLIC ACCOUNTING FIRM

To the Stockholders and the Board of Directors of Microsoft Corporation

Opinion on Internal Control over Financial Reporting

W...

Source 2:
Item 9A

 

 

REPORT OF INDEPENDENT REGISTERED PUBLIC ACCOUNTING FIRM

To the Stockholders and the Board of Directors of Microsoft Corporation

Opinion on Internal Control over Financial Reporting

W...

Source 3:
Item 9A

 

 

REPORT OF INDEPENDENT REGISTERED PUBLIC ACCOUNTING FIRM

To the Stockholders and the Board of Directors of Microsoft Corporation

Opinion on Internal Control over Financial Reporting

W...

Test 2/5 - Specific Fact

## Part 5: Final Analysis and Conclusion

###**Analysis**

| Test | Type                              | Baseline RAG                | Advanced RAG (with Reranker)          | Result             |
| ---- | --------------------------------- | --------------------------- | ------------------------------------- | ------------------ |
| 1    | Fact – Accounting firm            | Couldn’t find firm name     | Same, but retrieved more focused text | Slight improvement |
| 2    | Fact – Revenue 2022               | Found % growth only         | Same result                           | No major change    |
| 3    | Summarization – Competition risks | Correct but brief           | More complete and clear               | Better summary     |
| 4    | Keyword – Azure growth            | General info                | More focused and detailed             | Improved accuracy  |
| 5    | Summarization – Business segments | Correct but with extra info | Cleaner and precise                   | Improved relevance |

###**Conclusion**
The advanced RAG system produced more detailed and focused answers by retrieving more relevant context compared to the baseline. It showed clear improvement in summarization and context-heavy questions, offering better organization and precision. However, it still struggled with specific factual details like company names or exact figures when those facts weren’t present in the retrieved text. Overall, adding the reranker improved relevance and answer quality but didn’t fully solve the challenge of missing exact facts in dense documents.

###**Biggest Challenge**
Getting specific facts from long, complex documents. While the system can find the right sections, it often struggles to extract exact details—like company names or revenue numbers—because key information may be split across chunks or partially missed during retrieval.

## Bonus Section: Advanced Features (Optional)

### Query Rewriting with Gemini

This bonus feature uses Gemini itself to rewrite user queries to be more effective for financial document search.

In [62]:

original_question = "How much money did the company make last year?"
print(f"Original Question: {original_question}")

rewrite_prompt = f"""
You are an expert at searching financial documents like 10-K reports.
Rewrite the following user query to be more effective for document retrieval.

Guidelines:
- Add relevant financial terminology
- Include context about annual reports/10-K filings
- Make the query more specific and searchable
- Keep the original intent

Original query: "{original_question}"

Rewritten query:
"""

response = llm.invoke(rewrite_prompt)
rewritten_query = response.content.strip()
print(f"Rewritten Query: {rewritten_query}")
search_query = rewritten_query

result = advanced_qa_chain.invoke({"query": search_query})

print(f"Answer:")
print(f"Answer:{result["result"]}")

print(f"\nSource Citations ({len(result['source_documents'])}):")

for i, doc in enumerate(result["source_documents"]):
    print(f"\n[Citation {i+1}]:")
    print(f"Content: {doc.page_content[:200]}...")

Original Question: How much money did the company make last year?
Rewritten Query: Retrieve the company's **net income** (or **net earnings** / **profit**) for the **most recent fiscal year** as reported in the **Consolidated Statements of Operations** (or **Income Statement**) within its **annual 10-K filing**. Additionally, identify the **total revenue** for the same period.
Answer:
Answer:I'm sorry, but the provided text does not contain the company's net income (or net earnings/profit) or total revenue for the most recent fiscal year. The text discusses critical accounting estimates and recent accounting guidance, but it does not include the actual financial figures from the Consolidated Statements of Operations.

Source Citations (3):

[Citation 1]:
Content: RECENT ACCOUNTING GUIDANCE

Refer to Note 1 – Accounting Policies of the Notes to Financial Statements (Part II, Item 8 of this Form 10-K) for further discussion.

CRITICAL ACCOUNTING ESTIMATES

Our c...

[Citation 2]:
Content

In [63]:
# Check if vectorstore has diverse content
print("🔍 Checking document diversity:")
test_search = vectorstore.similarity_search("total revenue fiscal year 2022", k=10)
for i, doc in enumerate(test_search):
    print(f"Doc {i+1}: {doc.page_content[:100]}...")
    print(f"Metadata: {doc.metadata}")
    print("-" * 40)

🔍 Checking document diversity:
Doc 1: Total

 

 

 

 

 

 

 

 

 

$

2.48

 

 

$

18,556

 

 

 

 

 

 

 

 

 

 

 

 

 

 ...
Metadata: {'chunk_id': 467}
----------------------------------------
Doc 2: Total

 

 

 

 

 

 

 

 

 

$

2.48

 

 

$

18,556

 

 

 

 

 

 

 

 

 

 

 

 

 

 ...
Metadata: {'chunk_id': 467}
----------------------------------------
Doc 3: Total

 

 

 

 

 

 

 

 

 

$

2.48

 

 

$

18,556

 

 

 

 

 

 

 

 

 

 

 

 

 

 ...
Metadata: {'chunk_id': 467}
----------------------------------------
Doc 4: 19,439

 

 

 

15,911

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Tot...
Metadata: {'chunk_id': 486}
----------------------------------------
Doc 5: 19,439

 

 

 

15,911

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Tot...
Metadata: {'chunk_id': 486}
----------------------------------------
Doc 6: 19,439

 

 

 

15,911

 

 

 

 

 

 

 

 

 

 

 

 

 