<!-- # CS224V Fall 2025 HW1: Autonomous Research Agents

## From Literature Search to Investigative Analysis

> **Reference:** This notebook implements the coding exercises described in the [**CS224V HW1 Handout**](handout/cs224v_hw1_handout.tex). Please refer to the handout for detailed background, theory, and implementation context.

### Assignment Overview
This assignment involves building a **Deep Research Lite (DRLite)** system that progresses through two main phases:
- **Foundation Building Blocks:** RAG Systems and Autonomous Literature Search
- **Investigative Research:** Database Exploration and Automated Report Synthesis

--- -->

# CS224V Fall2025 Homework1 Code

### Environment configuration

In [4]:
# run this cell if you haven't installed the requirements
! pip install -r requirements.txt
! playwright install



In [5]:
TOPIC = "Evolving Military Strategies in the Russia-Ukraine War and Future Implications"

ACLED_DB_DESCRIPTION = """ACLED is a global conflict and event data repository. It captures and analyzes conflicts worldwide, from local conflicts in Africa to international armed conflicts. It includes events like civil wars, military operations, and terrorist attacks. Acled's data helps monitor conflicts, understand conflict dynamics, and inform policy decisions."""

In [7]:
import json
import os
from typing import List, Tuple

import dspy
import httpx
from dotenv import load_dotenv
from tqdm import tqdm

from src.dataclass import RetrievedDocument, LiteratureSearchAgentResponse, LiteratureSearchAgentRequest
from src.encoder import Encoder
from src.literature_search import LiteratureSearchAgent
from src.lm import init_lm, LanguageModelProviderConfig, LanguageModelProvider, LiteLLMServerConfig
from src.retriever_agent.serper_rm import SerperRM
from src.rag import RagAgent
from src.dataclass import RagResponse, RagRequest

load_dotenv()

True

## Action Item 1: LLM API Configuration

> **Reference:** See **Section 2 (Background) → Action Item 1** in the handout for complete setup instructions and theoretical background.

### Setup Requirements

| Step | Task | Implementation Details |
|------|------|------------------------|
| 1 | **Obtain API Key** | Access the provided portal at [cs224v-litellm-portal.genie.stanford.edu](http://cs224v-litellm-portal.genie.stanford.edu) |
| 2 | **Configure Secrets** | Add `LITELLM_API_KEY` and `LITELLM_API_BASE` to .env |
| 3 | **Verify Configuration** | Execute the validation code block below |

---

In [8]:
## Action item 1. 
load_dotenv()

test_lm_config = LanguageModelProviderConfig(
      provider=LanguageModelProvider.LANGUAGE_MODEL_PROVIDER_LITELLM_SERVER,
      model_name="gpt-4.1-mini",
      temperature=0.0,
      max_tokens=10,
      litellm_server_config=LiteLLMServerConfig(api_key=os.getenv("LITELLM_API_KEY"), api_base=os.getenv("LITELLM_API_BASE"))
  )
test_lm = init_lm(test_lm_config)
print(test_lm("say 'Hello!' as is")[0]) # Expect to see "Hello!" or something similar

Hello!


In [29]:
import requests
import json

# Configure your API key
API_KEY = "sk-RG-FxDXC3TZW5eEQsUPdhg"
BASE_URL = "https://cs224v-litellm.genie.stanford.edu"

# Make the request
response = requests.post(
    f"{BASE_URL}/chat/completions",
    headers={
        "Authorization": f"Bearer {API_KEY}",
        "Content-Type": "application/json"
    },
    json={
        "model": "gpt-4.1",
        "messages": [
            {"role": "user", "content": "Say hi in one word"}
        ],
        "max_tokens": 5
    }
)

# Check if request was successful
if response.status_code == 200:
    result = response.json()
    print("Success!")
    print(f"Response: {result['choices'][0]['message']['content']}")
    print(f"Tokens used: {result.get('usage', {}).get('total_tokens', 'N/A')}")
else:
    print(f"Error {response.status_code}: {response.text}")

Success!
Response: Hello!
Tokens used: 15


In [40]:
encoder = Encoder(model_name="text-embedding-3-small", **{"api_key": os.getenv("LITELLM_API_KEY"), "api_base": os.getenv("LITELLM_API_BASE")})
embedding = await encoder.aencode("hello")
assert len(embedding[0]) == 1536
print(f"✅ encoder is working")

✅ encoder is working


## Action Item 2: Web Retrieval Service Configuration

> **Reference:** See **Section 3.1 (RAG Systems) → Action Item 2** in the handout for detailed background on web retrieval architectures and alternative search providers.

### Serper.dev Setup Process

The Retrieval-Augmented Generation pipeline requires a web search backend for information retrieval. This implementation uses Serper.dev as the search provider.

**Configuration Steps:**
1. **API Key Acquisition:** Register at [serper.dev](https://serper.dev) to obtain 2,500 free search credits
2. **Secret Configuration:** Add `SERPER_API_KEY` to .env file under repo root
3. **System Validation:** Execute the test code block below

### Expected Output
Successful configuration should produce:
- Confirmation message: "✅ retriever is working"
- Example document structure demonstrating the retrieved data format

### Technical Context
This setup establishes the foundation for the RAG pipeline's retrieval component. The retrieved document structure includes URL sources, content snippets, and metadata that will be processed through subsequent pipeline stages including content extraction, chunking, and semantic reranking.

---

In [None]:
load_dotenv()

serper_retriever = SerperRM(api_key=os.getenv("SERPER_API_KEY"), encoder=encoder)
retrieved_document: RetrievedDocument = await serper_retriever.aretrieve("stanford new AI research")
assert len(retrieved_document) > 0
print(f"✅ retriever is working")
print(f"example output")
print(json.dumps(retrieved_document[0].to_dict(), indent=2))

## Action Item 3: End-to-End RAG System Evaluation

> **Reference:** See **Section 3.1 (RAG Systems) → Action Item 3** in the handout for comprehensive RAG pipeline theory and component analysis.

### Objective
Evaluate the complete Retrieval-Augmented Generation pipeline to understand information flow through the following stages:

```
Query → Internet Search → Content Extraction → Document Chunking → Semantic Reranking → Answer Generation
```

### Step 1: Basic Pipeline Test
Execute the code block below to test the RAG system with a basic query. The system will:
- Process the input question through the complete pipeline
- Save results to `output/action_item_3_rag_response.json`
- Demonstrate the integration of retrieval and generation components

### Analysis Requirements
Review the output format to understand:
- Retrieved document structure and source attribution
- Semantic reranking results and relevance scoring
- Generated answer quality and citation methodology

---

In [None]:
# configure the language model for RAG agent
rag_lm_config = LanguageModelProviderConfig(
    provider=LanguageModelProvider.LANGUAGE_MODEL_PROVIDER_LITELLM_SERVER,
    model_name="gpt-4.1-mini",
    temperature=1.0,
    max_tokens=10000,
    litellm_server_config=LiteLLMServerConfig(api_key=os.getenv("LITELLM_API_KEY"), api_base=os.getenv("LITELLM_API_BASE"))
)
rag_lm = init_lm(rag_lm_config)

# initialize the RAG agent
rag = RagAgent(retriever=serper_retriever, rag_lm=rag_lm)

# forward the request to the RAG agent
rag_response: RagResponse = await rag.aforward(RagRequest(question="Tell me about the latest advances in precision neuroscience", max_retriever_calls=1))

# make output directory if it doesn't exist
os.makedirs("output", exist_ok=True)

# save the response to a file
with open("output/action_item_3_rag_response.json", "w") as f:
    json.dump(rag_response.to_dict(), f, indent=2)

print(f"✅ Result saved to output/action_item_3_rag_response.json")


### Step 2: Advanced Query Testing

> **Reference:** Section 3.1 emphasizes the importance of evaluating RAG systems across diverse query types to assess robustness and performance characteristics.

#### Test Case A: Recency-Focused Query (`rag_question_recency`)
**Objective:** Evaluate the system's capability to retrieve and synthesize recent information.

**Requirements:**
- Formulate a factual question requiring current information for `rag_question_recency` (e.g., "What are the latest products from Apple in 2025?")
- **Note:** Do not use the provided example; create an original query
- Focus on testing temporal information retrieval accuracy

#### Evaluation Criteria

For this test question, document and analyze:
- **Quality of retrieved sources** (`rag_question_recency_quality_comment`, `rag_question_depth_quality_comment`): List all unique URLs from the RAG results one line per URL followed by [yes] or [no, {reason}] to indicate whether the URL is of high quality and should be included and cited when generating the final answer.
- **Relevance retrieved information** (`rag_question_recency_relevance_comment`, `rag_question_depth_relevance_comment`): Do all retrieved documents contain information that directly answers the question? Take a look at uncited documents, make comment on whether you think they are irrelevant to the question and thus LLM didn't cited it.
- **Comprehensiveness of generated answer** (`rag_question_recency_comprehensiveness_comment`, `rag_question_depth_comprehensiveness_comment`): Does the final answer synthesize information well and provide a complete response? Does LLM fully make use of all retrieved information? What are details that are left not included in the final answer.

---

In [None]:
rag_question_recency = ...

rag_response: RagResponse = await rag.aforward(RagRequest(question=rag_question_recency, max_retriever_calls=1))

# save the response to a file
with open("output/action_item_3_rag_response_recency.json", "w") as f:
    json.dump(rag_response.to_dict(), f, indent=2)

print(f"✅ Result saved to output/action_item_3_rag_response_recency.json")

In [None]:
rag_question_recency_quality_comment = """
 {YOUR COMMENT HERE}
"""
rag_question_recency_relevance_comment = """
 {YOUR COMMENT HERE}
"""
rag_question_recency_comprehensiveness_comment = """
 {YOUR COMMENT HERE}
"""


with open("output/action_item_3_rag_response_recency_comment.json", "w") as f:
    json.dump({
        "quality_comment": rag_question_recency_quality_comment,
        "relevance_comment": rag_question_recency_relevance_comment,
        "comprehensiveness_comment": rag_question_recency_comprehensiveness_comment
    }, f, indent=2)
print(f"✅ Result saved to output/action_item_3_rag_response_recency_comment.json")

#### Test Case B: Technical Depth Query (`rag_question_depth`)
**Objective:** Assess the system's performance on specialized technical topics requiring domain expertise.

**Requirements:**
- Design a question addressing a niche technical domain for `rag_question_depth` (e.g., "How do transformer attention mechanisms handle positional encoding?")
- **Note:** Do not use the provided example; formulate an original technical query
- Evaluate the system's ability to retrieve and synthesize specialized knowledge

**Apply the same evaluation framework as detailed above for comprehensive analysis.**

---

In [None]:
rag_question_depth = "Which biotypes are involved in remission rates for major depressive disorder according to the latest research from Stanford?"

rag_response: RagResponse = await rag.aforward(RagRequest(question=rag_question_depth, max_retriever_calls=1))

# save the response to a file
with open("output/action_item_3_rag_response_depth.json", "w") as f:
    json.dump(rag_response.to_dict(), f, indent=2)

print(f"✅ Result saved to output/action_item_3_rag_response_depth.json")

In [11]:
rag_question_depth_quality_comment = """
 "url": "https://www.insideprecisionmedicine.com/topics/patient-care/stanford-researchers-identify-six-subtypes-of-depression-using-imaging-technology-and-machine-learning/"
 This is a low-quality blog source that summarizes Dr. Williams' research (who the question is primarily inspired after). 
 "url": "https://www.techexplorist.com/new-category-depression-found-affects-about-quarter-patients/62996/"
 This is essentially a copy of the above source but by a different news outlet, and is similarly low-quality.
"url": "https://pmc.ncbi.nlm.nih.gov/articles/PMC10273022/"
This is the recent research which provided new insights into the biotypes involved in MDD by Dr. Williams, which is a high-quality primary source. 
"""
rag_question_depth_relevance_comment = """
Each document is certainly relevant, as it has content pertaining to Dr Williams' recent research on the biotypes involved in MDD. However, the ranking is incorrect as the third document provides the most essential knowledge, and the other two do not neccesarily provide anything new. In an ideal world the research should have included Dr. Williams' 2025 paper as well. 
"""
rag_question_depth_comprehensiveness_comment = """
 The response comprehensively summarizes Dr. Williams' 2023 papers regarding how the cognitive function 'biotype' impacts MDD, but excludes her more recent research on the more specific brain networks involved (specifically the default mode network). It also hallucinates with some statistics (e.g "remission was 38% in the biotype group and 50% in others...") Overall I would characterize the response as non-comprehensive. 
"""


with open("output/action_item_3_rag_response_depth_comment.json", "w") as f:
    json.dump({
        "quality_comment": rag_question_depth_quality_comment,
        "relevance_comment": rag_question_depth_relevance_comment,
        "comprehensiveness_comment": rag_question_depth_comprehensiveness_comment
    }, f, indent=2)
print(f"✅ Result saved to output/action_item_3_rag_response_depth_comment.json")

✅ Result saved to output/action_item_3_rag_response_depth_comment.json


## Action Item 4: Autonomous Literature Search Evaluation

> **Reference:** See **Section 3.2 (Autonomous Literature Search)** in the handout for comprehensive coverage of deep research paradigms, trajectory tracking, and autonomous exploration strategies.

### Objective
Evaluate the autonomous literature search system using a complex, multi-dimensional research topic that requires systematic exploration and evidence synthesis.

**Research Topic:** *"Evolving Military Strategies in the Russia-Ukraine War and Future Implications"*

### Step 1: System Execution
Execute the literature search agent using the code block below. The system will:
- Autonomously decompose the research topic into focused sub-queries
- Maintain an exploration trajectory across multiple search iterations
- Synthesize findings into a comprehensive research summary

**Output Location:** `output/action_item_4_literature_search_response.json`

---

In [None]:
literature_search_planning_lm = init_lm(LanguageModelProviderConfig(
    provider=LanguageModelProvider.LANGUAGE_MODEL_PROVIDER_LITELLM_SERVER,
    model_name="gpt-4.1", # NOTE: planning invovles intelligence, so we use a more powerful model.
    temperature=1.0,
    max_tokens=10000,
    litellm_server_config=LiteLLMServerConfig(api_key=os.getenv("LITELLM_API_KEY"), api_base=os.getenv("LITELLM_API_BASE"))
))

answer_synthesis_lm = init_lm(LanguageModelProviderConfig(
    provider=LanguageModelProvider.LANGUAGE_MODEL_PROVIDER_LITELLM_SERVER,
    model_name="gpt-5-mini", # NOTE: synthesis does not require high intelligence, but requires minimal hallucination. GPT-5-mini is a good balance.
    temperature=1.0,
    max_tokens=20000,
    litellm_server_config=LiteLLMServerConfig(api_key=os.getenv("LITELLM_API_KEY"), api_base=os.getenv("LITELLM_API_BASE"))
))

# initialize the literature search agent
literature_search_agent = LiteratureSearchAgent(rag_agent=rag, literature_search_lm=literature_search_planning_lm, answer_synthesis_lm=answer_synthesis_lm)

# run the literature search agent
literature_search_response: LiteratureSearchAgentResponse = await literature_search_agent.aforward(LiteratureSearchAgentRequest(topic=TOPIC))

# save the response to a file
with open("output/action_item_4_literature_search_response.json", "w") as f:
    json.dump(literature_search_response.to_dict(), f, indent=2)

print(f"✅ Result saved to output/action_item_4_literature_search_response.json")

### Step 2: Systematic Literature Search Analysis

> **Analysis Target:** `output/action_item_4_literature_search_response.json`

#### Analysis Questions

1. **Information Quality:** What information did you find most valuable? Which insights provided new understanding of the topic?
2. **Coverage Gaps:** What important aspects or perspectives appear missing from the investigation?
3. **Source Diversity:** How well did the system identify and incorporate diverse viewpoints and source types?
4. **Research Depth:** Did the investigation progress beyond surface-level information to uncover nuanced insights?

#### Algorithm Question
5. **Read the code and make comment on what's the stop criteria of this pipeline?** (hint: code entry point is `cs224v_hw1/src/literature_search.py` line 311)

---

In [None]:
literature_search_information_quality_comment = """
The information quality is pretty mediocre. It does not cite relevant statistics very often, and it rambles on. For example, there are 5 sentences that mention the use of drones and UAVs, without providing any detail. I am relatively uninformed about the conflict, but I have a more detailed understanding of Ukraine's offensive strategy using drones. Also the main thesis is weak (This survey synthesizes observed strategic shifts, tactical and technological innovations, external influences, and projected future implications arising from the Russia\u2013Ukraine war.) It provides a very high level narrative of general dynamics of the war without providing an interesting storyline with specific details to follow.  
"""

literature_search_coverage_gaps_comment = """
Although the prompt is 'Evolving Military Strategies in the Russia-Ukraine War and Future Implications,' I still think it would be appropiate to mention civilian tragedies that have defined the conflict. In addition, the literature search makes it out to seem that small groups of Russian soldiers striking and then retreating is a part of some high level strategy that departs traditional Russian doctrine, but I think it would benefit from some nuance that some actions (especially the aforementioned small Russian strikes) are actually a result of diosganization. Finally, I think the report should mention Putin, Zelenski, and other political factors that affect strategy. 
"""

literature_search_source_diversity_comment = """
It appears that the sources are balanced and offer diverse viewpoints. It surveys the mainstream media, bipartisan research agencies like the CSIS, research articles, video investigative journalism from YouTube, and many other different kinds of sources. 
"""

literature_search_research_depth_comment = """
I think that the report offers excellent breadth of the technologies being utilized, but has limited depth going into how these technologies work, why there are being used, specific examples, and implications of their use. 
"""

literature_search_stop_criteria_comment = """
The pipeline uses logic based tree exploration capped at a certain number of RAG function calls. It also uses the NextStepPlanner DSpy signature to check if the literature search is 'complete' based on the guidlines to end the search early. 
"""

with open("output/action_item_4_literature_search_response_comment.json", "w") as f:
    json.dump({
        "information_quality_comment": literature_search_information_quality_comment,
        "coverage_gaps_comment": literature_search_coverage_gaps_comment,
        "source_diversity_comment": literature_search_source_diversity_comment,
        "research_depth_comment": literature_search_research_depth_comment,
        "stop_criteria_comment": literature_search_stop_criteria_comment
    }, f, indent=2)
print(f"✅ Result saved to output/action_item_4_literature_search_response_comment.json")

## Action Item 5: Primary Source Database Exploration

> **Reference:** See **Section 4.2 (ACLED Database & Primary Source Analysis)** in the handout for comprehensive coverage of investigative research methodologies and database exploration strategies.

### Objective
Transition from literature synthesis to investigative analysis through systematic exploration of primary source data. This exercise demonstrates capabilities beyond web-accessible information by leveraging structured conflict databases.

**Data Source:** Armed Conflict Location & Event Data Project (ACLED)
- **Coverage:** Global conflict events with structured metadata
- **Temporal Scope:** Real-time updates with historical records
- **Analytical Advantages:** Systematic pattern detection, geographic clustering, temporal trend analysis

### Step 1: Database Exploration Agent Initialization

---

In [23]:
# This dspy signature is used to generate seed questions using writeup from previous literature search to kick start the database exploration. 
class ResearchQuestionGenerator(dspy.Signature):
    """You are conducting research to extract previously unknown insights by exploring and observing information in a database. Generate research questions that an investigator will be interested in. The questions will be used to generate search queries in the database to help answer them. The questions should be self-contained, meaning they must include any specific years, months, locations, or other details instead of references that require the reader to know additional context. All questions must be related to the research goal and topic. Investigate any correlations as you see fit. The questions should be completely independent of each other - if you believe some questions need to be answered first before others can be meaningful, only generate those foundational questions and not others that depend on the answers to the first questions. You do not need to generate the maximum number of questions if you believe fewer questions would be better for the research goals."""
    
    topic: str = dspy.InputField(desc="The research goal or topic being investigated")
    db_description: str = dspy.InputField(desc="Description of the available database including its structure, contents, and capabilities")
    max_questions: int = dspy.InputField(desc="Maximum number of questions to generate - you may generate fewer if appropriate")
    previous_insights: str = dspy.InputField(desc="Previous insights from the database")
    
    questions: List[str] = dspy.OutputField(desc="List of independent, self-contained research questions that will extract previously unknown insights related to the topic from the database")

generate_exploration_questions = dspy.Predict(ResearchQuestionGenerator)

In [None]:
# research question generation
database_exploration_lm_config = LanguageModelProviderConfig(
    provider=LanguageModelProvider.LANGUAGE_MODEL_PROVIDER_LITELLM_SERVER,
    model_name="gpt-4.1", # database exploration planning requires high intelligence, so we use a more powerful model.
    temperature=1.0,
    max_tokens=10000,
    litellm_server_config=LiteLLMServerConfig(api_key=os.getenv("LITELLM_API_KEY"), api_base=os.getenv("LITELLM_API_BASE"))
)
database_exploration_lm = init_lm(database_exploration_lm_config)

# load the literature search response
assert os.path.exists("output/action_item_4_literature_search_response.json"), "Please run the literature search first"
literature_search_response = LiteratureSearchAgentResponse.from_dict(json.load(open("output/action_item_4_literature_search_response.json")))

with dspy.context(lm=database_exploration_lm):
    seed_questions = (await generate_exploration_questions.aforward(
        topic=TOPIC, 
        db_description=ACLED_DB_DESCRIPTION, 
        max_questions=4, 
        previous_insights=literature_search_response.writeup
    )).questions

seed_questions_formatted = "\n\t- ".join(seed_questions)
print(f"✅ Seed questions generated. Seed questions:\n\t- {seed_questions_formatted}")

### Step 2: Primary Source Analysis Framework

> **Reference:** See **Action Item 5 Step 2** in the handout for detailed investigative analysis methodology.


Database exploration may take up to 30-60 minutes with given topic and seed questions generated above. As a result, we have pre-computed a result from database explroation agent available at `data/action_item_5_database_exploration_precomputed.json`  
> **Implementation Reference:** Complete code provided in Appendix section

#### Key Analysis Dimensions:
1. **Database Structure and Operations:** Document which specific ACLED data tables, fields, and query patterns the agent utilized. Understanding the data architecture helps assess the comprehensiveness and reliability of findings.

2. **Quantitative Evidence Gaps:** Compare database findings with your literature search results from Action Item 4. Identify novel insights unavailable through web-based sources and assess how structured data analysis reveals patterns obscured in traditional reporting.

#### Step 3: Investigative Hypothesis Formation

Based on your comparative analysis, develop at least **2 investigative thesis statements** that synthesize insights from both literature search and database exploration. Each thesis should: be specific, highlight the tension or contradiction, name the actor and the action if presented, use concrete numbers or data if central to the story, signal impact on readers or society, and keep it concise/punchy/memorable while grounded on the facts.

---

In [None]:
action_item_5_database_operation_comment = """
{YOUR COMMENT HERE}
"""

action_item_5_database_evidence_comment = """
{YOUR COMMENT HERE}
"""

action_item_5_proposed_theses = """
1. 
2. 
(optional) more...
"""

with open("output/action_item_5_comments.json", "w") as f:
    json.dump({
        "action_item_5_database_operation_comment": action_item_5_database_operation_comment,
        "action_item_5_database_evidence_comment": action_item_5_database_evidence_comment,
        "action_item_5_proposed_theses": action_item_5_proposed_theses
    }, f, indent=2)

print(f"✅ Result saved to output/action_item_5_comments.json")


## Action Item 6: DSPy Implementation for Automated Thesis Generation

> **Reference:** See **Section 4.3 (DSPy Framework)** in the handout.

### Objective
Implement a LLM thesis generation mechanism using the DSPy framework to automate the investigative hypothesis development process demonstrated manually in Action Item 5.

### Implementation Requirements

**DSPy Signature Specification:**
```python
class ThesisGenerator(dspy.Signature):
    # Implementation required
```

**Technical Specifications:**
- **Input Fields:** Research topic and database exploration insights
- **Output Specification:** Exactly 5 thesis statements
- **Constraint Requirements:** Each thesis must be specific, evidence-based, and analytically defensible
---

In [None]:
from typing import Dict
class ThesisGenerator(dspy.Signature):
    """
    Generate one sentence, argumentative thesis that analyze evidence to produce insightful and relevant theses that are like an investigative journalist.
    """
    original_topic: str = dspy.InputField(
        desc="The initial research topic that was explored"
    )
    cited_documents: List[Dict[str, str]] = dspy.InputField(
        desc = "Flattened list of the questions and answers of each query."
    )

    # ===============================================
    # additional input fields here
    # Hint1: how to pass insights from database exploration agent to the thesis generator?
    # Hint2: will directly concatenate question and answers from database exploration agent work? Do we need any formatting?
    # ===============================================


    # Should only have one output field as defined below. Do not change the name of the output field.
    proposed_theses: List[str] = dspy.OutputField(
        desc="""Five Research thesis that are argumentative, well researched, and relevant like an investigative journalist. """
    )

thesis_generator = dspy.Predict(ThesisGenerator)
thesis_generator_lm = init_lm(LanguageModelProviderConfig(
    provider=LanguageModelProvider.LANGUAGE_MODEL_PROVIDER_LITELLM_SERVER,
    model_name="gpt-4.1", # NOTE: thesis generation requires high intelligence, so we use a more powerful model.
    temperature=1.0,
    max_tokens=10000,
    litellm_server_config=LiteLLMServerConfig(api_key=os.getenv("LITELLM_API_KEY"), api_base=os.getenv("LITELLM_API_BASE"))
))

In [20]:
database_exploration_rag_responses = []
with open("data/action_item_5_database_exploration_precomputed.json", "r") as f:
    database_exploration_precomputed = json.load(f)
    database_exploration_rag_responses = [{"question": entry["question"], "answer": entry["answer"]} for entry in database_exploration_precomputed['rag_service_responses']]

with dspy.context(lm=thesis_generator_lm):
    generated_theses = (await thesis_generator.aforward(
        original_topic=TOPIC,
        cited_documents=database_exploration_rag_responses
    )).proposed_theses

formatted_generated_theses = "\n\t-".join(generated_theses)
print(f"generated_theses:\n\t- {formatted_generated_theses}")

with open("output/action_item_6_generated_theses.json", "w") as f:
    json.dump({
        "generated_theses": generated_theses
    }, f, indent=2)
print(f"✅ Result saved to output/action_item_6_generated_theses.json")

print(f"\n\nTake a look at the output format and skim through the content. Are you satisfied with the results? Will audience be interested in your proposed theses? With doubt, it's always a good idea to revise the prompt and try again.")
if len(generated_theses) != 5:
    raise ValueError(f"❌ Please generate exactly 5 proposed theses but found {len(generated_theses)}.")


generated_theses:
	- The proliferation of drone warfare in the Russia-Ukraine conflict from 2022–2024 evidences a paradigmatic shift toward technology-driven, asymmetric military strategies, with Ukraine's adoption of long-range UAVs for cross-border operations signaling a new era of cost-effective, precision warfare that is likely to reshape regional security doctrines across Eastern Europe.
	-The escalation and broadening of civilian-targeted missile, shelling, and drone attacks in major Ukrainian urban centers demonstrates a deliberate Russian military strategy to degrade urban resilience and psychological morale, indicating a shift toward total warfare tactics and raising urgent questions about the future of urban conflict and civilian protection under international law.
	-The intensification and diversification of Ukrainian cross-border operations in Russian regions such as Belgorod and Kursk in 2024 reflect a doctrinal evolution from sporadic defensive actions to sustained, multi

**Extra human-in-the-loop step**
Manually review all generated thesis and hand pick the best 2

In [None]:
# Review the generated these and hand pick the best 2 theses.
selected_theses = ["The proliferation of drone warfare in the Russia-Ukraine conflict from 2022–2024 evidences a paradigmatic shift toward technology-driven, asymmetric military strategies, with Ukraine's adoption of long-range UAVs for cross-border operations signaling a new era of cost-effective, precision warfare that is likely to reshape regional security doctrines across Eastern Europe.",
                   "Despite large numbers of drone, missile, and artillery strikes on civilian infrastructure, low per-event fatality rates combined with the routine occurrence of high-precision attacks suggest a strategic shift aimed not at mass casualties, but at long-term disruption, economic attrition, and internal destabilization, offering a blueprint for future 'hybrid warfare' where technology is weaponized for persistent low-intensity conflict."] # TODO: add the selected theses

assert len(selected_theses) == 2, f"❌ Please select exactly 2 theses but found {len(selected_theses)}."
with open("output/action_item_6_selected_theses.json", "w") as f:
    json.dump({
        "selected_theses": selected_theses
    }, f, indent=2)
print(f"✅ Result saved to output/action_item_6_selected_theses.json")

✅ Result saved to output/action_item_6_selected_theses.json


## Action Item 7: Automated Investigative Report Synthesis

> **Reference:** See **Section 4.4 (Synthesis & Report Composition)** in the handout for comprehensive methodology on evidence synthesis and investigative narrative construction.

### Objective
Synthesize your research findings into a comprehensive investigative report using automated DSPy-based processing pipelines that demonstrate advanced research capabilities beyond traditional literature synthesis.

### Generated Report Structure:
1. **Executive Summary:** A concise overview of your key findings and their significance
2. **Sections:** Use `#` for primary sections; Use `##` for subsections that organize detailed analysis; etc.
3. **Inline Citations:** Use numbered references in square brackets (e.g., `[1]`, `[2]`, `[3]`)
4. **Bibliography:** Conclude with a reference section where each line follows the format: `[index]. URL or source description`

### Implementation Process:

**Step 1: Thesis-Specific Literature Search**

Using your three generated theses from Action Item 6, conduct targeted literature searches for each thesis. Implement a `LiteratureSearchAgentRequest` with the `guideline` parameter to focus searches on supporting evidence for each specific thesis. Set `with_synthesis=False` to collect raw `rag_responses` for later processing.

**Step 2: Key Insight Identification**

Implement a `KeyInsightIdentifier` DSPy signature to automatically extract the most important insight from each RAG response. This reduces noise and focuses on essential information for report composition. The key insight should be a concise, one-sentence summary capturing the most relevant information for each question-answer pair.

**Step 3: Report Structure and Guideline Generation**

Create a `FinalWritingGuidelineProposal` DSPy signature that uses the collected key insights to:
- Generate a unified report thesis that synthesizes your investigative findings
- Propose a structured writing guideline in bullet-point format outlining key sections and content organization
- Ensure logical flow from background through specific discoveries to implications

**Step 4: Automated Report Synthesis**

Implement a `FinalReportSynthesizer` DSPy signature that combines the report thesis, writing guidelines, and all collected evidence into a coherent investigative report. The synthesizer should:
- Merge relevant information into logically coherent narrative sections
- Preserve all original citations exactly as provided
- Eliminate redundancy while maintaining completeness
- Create smooth transitions between thematic sections
- Constrain content to provided information without external speculation

### Quality Standards:

Your final report should demonstrate investigative depth by presenting original analytical insights not readily available in existing literature, comprehensive coverage through automated integration of evidence from multiple search iterations, logical organization with clear progression from initial questions to final conclusions, and proper attribution through systematic citation preservation.

The completed report should read as a coherent investigative piece rather than a collection of separate research summaries, showcasing the power of automated synthesis in building compelling investigative narratives from distributed evidence sources.

---

In [None]:
## After you have generated 3 theses, we'll generate another round of literature search to support your theses. 
rag_responses = []

for idx, thesis in enumerate(selected_theses, 1):
    # ===============================================
    # TODO: generate literature search request. Take a look at definition of LiteratureSearchAgentRequest
    # Hint: previously we only use the field `topic` and leave other fields as default. Now we need to make use of the field `guideline`. Take a look at the source code to understand how the guideline is used.
    # Hint 2: we disable the synthesis step by setting `with_synthesis=False` as we only need the rag responses.
    # ===============================================
    literature_search_response: LiteratureSearchAgentResponse = await literature_search_agent.aforward(LiteratureSearchAgentRequest(topic = thesis, guideline = "Searches contain citations and statistics that support or refute the thesis.", with_synthesis=False))
    
    with open(f"output/action_item_7_literature_search_response_{idx}.json", "w") as f:
        json.dump(literature_search_response.to_dict(), f, indent=2)
    print(f"✅ Result saved to output/action_item_7_literature_search_response_{idx}.json")

    rag_responses.extend(literature_search_response.rag_responses) # TODO: add the rag response to the list


In [33]:
# Identify key insight for each rag response

# The goal of key insight identification is to identify the most important insight from the RAG response. Think of it as a one sentence summary of the RAG response that is most important to the question. The reason we do this is to help us reduce the noise in the rag responses and focus on the most important information when determing the final report outline.

class KeyInsightIdentifier(dspy.Signature):
    """
    Analyze the question, content, and answer provided and identify a one sentence key insight that summarizes the most important information turned up by the RAG query.
    """
    question: str = dspy.InputField(
        desc="The question that was asked"
    )
    question_context: str = dspy.InputField(
        desc="The context of the question"
    )
    answer: str = dspy.InputField(
        desc="The answer to the question aggregating information from external sources"
    )
    key_insight: str = dspy.OutputField(
        desc="One sentence key insight."
    )

key_insight_identifier = dspy.Predict(KeyInsightIdentifier)
key_insight_identifier_lm = init_lm(LanguageModelProviderConfig(
    provider=LanguageModelProvider.LANGUAGE_MODEL_PROVIDER_LITELLM_SERVER,
    model_name="gpt-4.1-mini", # NOTE: should we use a more powerful model? You can play around with it.
    temperature=1.0,
    max_tokens=10000,
    litellm_server_config=LiteLLMServerConfig(api_key=os.getenv("LITELLM_API_KEY"), api_base=os.getenv("LITELLM_API_BASE"))
))


for rag_response in tqdm(rag_responses):
    with dspy.context(lm=key_insight_identifier_lm):
        rag_response.key_insight = (await key_insight_identifier.aforward(
            question=rag_response.question,
            question_context=rag_response.question_context,
            answer=rag_response.answer
        )).key_insight

with open("output/action_item_7_rag_responses_with_key_insight.json", "w") as f:
    json.dump([rag_response.to_dict() for rag_response in rag_responses], f, indent=2)

for rag_response in rag_responses:
    print(rag_response.key_insight + "\n")
print(f"✅ Result saved to output/action_item_7_rag_responses_with_key_insight.json")

100%|██████████| 8/8 [00:09<00:00,  1.20s/it]

Between 2022 and 2024, drone warfare in the Russia-Ukraine conflict escalated to unprecedented scales, with both nations deploying and producing millions of UAVs—Russia launching over a million drones including 6,000+ Shahed strike drones monthly and Ukraine manufacturing millions via decentralized grassroots efforts—signifying a fundamental shift that sees drones increasingly replacing artillery and transforming battlefield dynamics through mass usage and advanced technological integration.

Ukraine’s long-range UAVs have enabled cost-effective, precision cross-border strikes that disrupt strategic Russian military assets and infrastructure through innovative AI-driven swarm tactics and scalable domestic drone production, proving a transformative model for modern warfare.

The extensive and innovative use of drones in the Russia-Ukraine conflict is driving Eastern European countries and NATO to rapidly adapt their regional security doctrines by prioritizing integrated drone warfare ca




In [44]:
# Final report title and guideline proposal

# We will generate title and guideline for the final report. The guideline should be in bullet point format outling the key points that should be included in the report.

class FinalWritingGuidelineProposal(dspy.Signature):
    """
    Generate an executive summary and final report, that use markdown emhasis sections, composed of one sentence, argumentative thesis that analyze evidence to produce insightful and relevant theses that are like an investigative journalist. It should include incline citations and a bibliography. 

    """
    ... # TODO: define any input fields. Hint: you can use the key insight from each rag response to help you decide what to include in the guideline.
    question: List[str] = dspy.InputField(
        desc="The sub-research question that was explored"
    )
    key_insights: List[str] = dspy.InputField(
        desc="Key insight for each RAG response (the answer)"
    )
    context: List[str] = dspy.InputField(
        desc="Rationale behind asking the sub-research question"
    )
    cited_documents: List[List[str]] = dspy.InputField(
        desc="URLs of cited documents"
    )
    thesises: str = dspy.InputField(
        desc="The two selected theses that the final report will be based on"
    )
    report_thesis: str = dspy.OutputField(
        desc ="A sentence no more than 14 words that provides an investigative argument.") # TODO: add/improve instructions
    writing_guideline: str = dspy.OutputField(
        desc="Use the report thesis, and overall goal of using key insights to create an investigative report to generate detailed writing guidelines for a synthesis model." # TODO: add/improve instructions
    )

final_writing_guideline_proposal = dspy.Predict(FinalWritingGuidelineProposal)
final_writing_guideline_proposal_lm = init_lm(LanguageModelProviderConfig(
    provider=LanguageModelProvider.LANGUAGE_MODEL_PROVIDER_LITELLM_SERVER,
    model_name="gpt-4.1", # NOTE: report title proposal requires high intelligence, so we use a more powerful model.
    temperature=1.0,
    max_tokens=10000,
    litellm_server_config=LiteLLMServerConfig(api_key=os.getenv("LITELLM_API_KEY"), api_base=os.getenv("LITELLM_API_BASE"))
))
with open("output/action_item_7_rag_responses_with_key_insight.json", "r") as f:
    database_exploration_precomputed = json.load(f)
    database_exploration_rag_responses = [{"question": entry["question"], "answer": entry["answer"], "context": entry["question_context"], "sources": [cited_document["url"] for cited_document in entry["cited_documents"]]} for entry in database_exploration_precomputed]

with dspy.context(lm=final_writing_guideline_proposal_lm):
    final_writing_guideline_proposal_response = (await final_writing_guideline_proposal.aforward(
        question=[entry["question"] for entry in database_exploration_rag_responses],
        key_insights=[entry["answer"] for entry in database_exploration_rag_responses],
        context=[entry["context"] for entry in database_exploration_rag_responses],
        cited_documents=[entry["sources"] for entry in database_exploration_rag_responses],
        thesises=f"Thesis 1: {selected_theses[0]}, Thesis 2: {selected_theses[1]}"
    ))

final_writing_thesis = final_writing_guideline_proposal_response.report_thesis
final_writing_guideline = final_writing_guideline_proposal_response.writing_guideline

with open("output/action_item_7_final_writing_guideline_proposal.json", "w") as f:
    json.dump({
        "report_thesis": final_writing_thesis,
        "writing_guideline": final_writing_guideline
    }, f, indent=2)

print(f"report title: {final_writing_thesis}\n\n")
print(f"writing guideline: {final_writing_guideline}")
print(f"✅ Result saved to output/action_item_7_final_writing_guideline_proposal.json")


report title: The Russia-Ukraine drone war signals a shift from mass casualties to technological, disruptive hybrid warfare.


writing guideline: 1. **Executive Summary:**  
   - Begin with a concise synthesis of the key findings that supports the argumentative thesis: the Russia-Ukraine conflict is a watershed moment marking a shift in warfare from mass casualty events to persistent, technologically-enabled disruption and attritional strategies.  
   - Clearly state that drone warfare, as seen in 2022–2024, exemplifies this transition and shapes both local and regional security doctrines.

2. **Evidence-based Narrative:**  
   - Structure the report to tangibly link both main theses:  
     a) The vast proliferation and sophistication of drone and drone-assisted warfare, emphasizing quantitative escalations (millions of UAVs, daily attack rates), innovation (AI integration, decentralized production, long-range UAV operations), and operational outcomes (cross-border strikes, cost effic

In [45]:
def _normalize_rag_response_citation_indices(rag_responses: List[RagResponse]) -> Tuple[List[str], List[RetrievedDocument]]:
        """
        Normalize citation indices across multiple RAG (retrieval-augmented generation) responses.

        Each `RagResponse` contains:
        - `answer`: a string with inline citations like [1], [2], ...
        - `cited_documents`: the list of documents those citations refer to

        Problem:
        Citation indices restart at [1] for every response, but when combining answers,
        we want all citations to point to a single global list of retrieved documents.

        What this function does:
        1. Iterates over all RAG responses in order.
        2. Shifts the local citation indices in each answer so that they correctly map
            into the combined list of all retrieved documents.
            - For example, if the first response cited 3 docs ([1], [2], [3]),
            then the second response’s citations start at [4], not [1].
        3. Prefixes each updated answer with its corresponding sub-question for clarity.
        4. Returns:
            - A list of normalized answers (with corrected citation indices).
            - The flattened list of all retrieved documents in the proper order.

        Example:
            Input (two RAG responses):
                R1: "Paris is in France [1].", docs=[docA]
                R2: "Berlin is in Germany [1].", docs=[docB]

            Output:
                answers = [
                "Sub-question: ...\nAnswer: Paris is in France [1].",
                "Sub-question: ...\nAnswer: Berlin is in Germany [2]."
                ]
                documents = [docA, docB]
        """
        all_documents: List[RetrievedDocument] = []
        all_updated_answers: List[str] = []
        for idx, rag_response in enumerate(rag_responses):
            citation_offset = len(all_documents)
            updated_answer = rag_response.answer
            for i in range(len(rag_response.cited_documents)):
                updated_answer = updated_answer.replace(f"[{i+1}]", f"[tmp_{citation_offset+i+1}]")
            for i in range(len(rag_response.cited_documents)):
                updated_answer = updated_answer.replace(f"[tmp_{citation_offset+i+1}]", f"[{citation_offset+i+1}]")

            all_updated_answers.append(
                f"Sub-question: {rag_response.question}\nAnswer: {updated_answer}\n")
            all_documents.extend(rag_response.cited_documents)
        return all_updated_answers, all_documents

In [48]:
# Final report synthesis

# We will synthesize the final report using generated thesis, guideline, and rag responses.

class FinalReportSynthesizer(dspy.Signature):
    """
    You are an investigative journalist composing a report with given thesis, guideline, and useful information from previous literature search.

    CONTENT INTEGRATION RULES:
    - Merge all relevant sub-question answers into a logically coherent narrative
    - Create clear thematic sections with smooth transitions between topics. Use #, ##, ###, etc. to create title of sections and sub-sections.
    - Eliminate redundancy while preserving all unique factual content
    - Exclude sub-questions/answers that don't contribute meaningfully to the survey topic
    - Maintain completeness - no loss of relevant information from source material
    - No title, conclusion, summary, or reference at the end of the answer.

    CITATION PRESERVATION:
    - Preserve ALL original citations exactly as provided - no format modifications
    - They should be bracketed and numbered. 
    - Only use citations provided in the gathered information. Do not create new citations.
    - Ensure every citation in the text corresponds to a source in the bibliography.

    CONTENT CONSTRAINTS:
    - All content should be able to be connected to a cited document or the ACLED database. 
    - Identify when the model might be speculating based on assumed knowledge or generalizations, and remove such sections. 
    """
    report_thesis: str = dspy.InputField(
        desc="The proposed thesis for the investigative journalism report"
    )
    writing_guideline: str = dspy.InputField(
        desc="The proposed writing guideline for the final report in bullet point format"
    )
    gathered_information: str = dspy.InputField(
        description="""Complete set of sub-question answers with their inline citations from previous research steps. 
        Format typically includes:
        - Sub-question: [question text]
        - Answer: [detailed response with inline citations [1], [2], etc.]
        - (Repeated for multiple sub-questions)"""
    )

    # TODO: optionally add other input fields 

    final_report: str = dspy.OutputField(
        desc="The final investigative report in markdown format. Use # for primary sections; Use ## for subsections that organize detailed analysis."
    )

final_report_synthesizer = dspy.Predict(FinalReportSynthesizer)
final_report_synthesizer_lm = init_lm(LanguageModelProviderConfig(
    provider=LanguageModelProvider.LANGUAGE_MODEL_PROVIDER_LITELLM_SERVER,
    model_name="gpt-5-mini", # NOTE: final report synthesis requires consolidate information from very long context and requires high reasoning ability, so we use a more powerful model. If this takes too long, you can try a smaller model like gpt-5-mini or gpt-4.1.
    temperature=1.0,
    max_tokens=20000,
    litellm_server_config=LiteLLMServerConfig(api_key=os.getenv("LITELLM_API_KEY"), api_base=os.getenv("LITELLM_API_BASE"))
))

# TODO: complete the input fields. Read the function doc to understand what the function does.
all_updated_answers, all_documents = _normalize_rag_response_citation_indices(rag_responses)

with dspy.context(lm=final_report_synthesizer_lm):
    final_report = (await final_report_synthesizer.aforward(
        report_thesis=final_writing_thesis,
        writing_guideline=final_writing_guideline,
        gathered_information= all_updated_answers,
        report_style="Comprehensive, highly accurate, and exhaustive; include every relevant detail and ensure no important information is omitted."
    )).final_report

with open("output/action_item_7_final_report_raw.md", "w") as f:
    f.write(final_report)
print(f"✅ Result saved to output/action_item_7_final_report_raw.md")

✅ Result saved to output/action_item_7_final_report_raw.md


In [49]:
print(all_documents)



In [50]:
# TODO: manually add bibliography to the final report. Review the output. Adjust the prompt and rerun the report synthesis if necessary. Make sure it has title, executive summary, sections with desired inline citations, and bibliography.

final_report_with_bibliography = "\n".join([f"[{i+1}] {doc.url}" for i, doc in enumerate(all_documents)])

with open("output/action_item_7_final_report.md", "w") as f:
    f.write(final_report_with_bibliography)
print(f"✅ Result saved to output/action_item_7_final_report.md")

✅ Result saved to output/action_item_7_final_report.md


**Review the generated report. Reflect on what are weaknesses and explain in detail how would you plan to improve it.**

In [None]:
weaknesses_and_improvements = """
{weaknesses_and_improvements}
"""

with open("output/action_item_7_weaknesses_and_improvements.md", "w") as f:
    f.write(weaknesses_and_improvements)
print(f"✅ Result saved to output/action_item_7_weaknesses_and_improvements.md")


## Create submission

In [None]:
! python create_submission.py

## Appendix

### Database Exploration Agent Implementation

> **Note:** This implementation is provided for reference purposes. Utilize the pre-computed results in Action Item 5 for efficient completion of the assignment.

---

In [None]:
# API endpoint URL
api_url = "https://cs224v-database-agent.genie.stanford.edu/database-exploration"

# Prepare the request payload
payload = {
    "topic": TOPIC,
    "seed_questions": seed_questions,
    "lm_config": database_exploration_lm_config.to_dict()
}

# NOTE: uncomment the code below to make the request

# print("Making request to database exploration endpoint... Might take up to 30 minutes")
# async with httpx.AsyncClient(timeout=6000.0) as client:
#     r = await client.post(api_url, json=payload)
#     r.raise_for_status()
#     database_exploration_response = r.json()

# with open("output/action_item_5_database_exploration.json", "w") as f:
#     json.dump(database_exploration_response, f, indent=2)

# print(f"✅ Result saved to output/action_item_5_database_exploration.json")
