# Stanford CS 329a Self-Improving AI Agents, Homework 3

This assignment builds an end-to-end agentic workflow by combining LLMs with external APIs and self-improvement techniques.

- **Part 1:** Implementing an API-augmented LLM pipeline which integrates four external APIs, and generates responses grounded in API responses.
- **Part 2:** Applying self-improvement methods:
  - Query decomposition & fusion: Breaks complex queries into sub-queries, retrieves API outputs, and fuses the results.
  - Iterative refinement: Uses the LLM to trigger additional API calls and refine the response until sufficient information is gathered to answer the query.
- **Part 3:** Integrating these components into a full agentic workflow and evaluating accuracy on the test set.
- **Part 4:** Extending the system into a deep research agent for knowledge-intensive tasks, inspired by recent AI launches.

**Final Deliverable**: A zipped folder (.zip) of your edited files.

In [None]:
# Auto-reloads imported modules, so changes to .py files are automatically reflected in the notebook
%load_ext autoreload
%autoreload 2

## Package Installation

#### API Configuration and Setup

Before we can use the APIs, we need to set up our environment and initialize the API manager. This involves:

1. Loading API keys from environment variables
2. Setting up the API manager with the necessary credentials
3. Validating that all required keys are present

For this homework, we'll use:
- Google Custom Search API (requires API key and Custom Search Engine ID):
     - https://developers.google.com/custom-search/v1/overview
     - https://programmablesearchengine.google.com/controlpanel/create
       -  Once created, the Custom Search Engine ID can be found here: https://programmablesearchengine.google.com/controlpanel/all
- Polygon API (requires API key):
     - https://polygon.io/dashboard/keys
- Wolfram Alpha API (requires app ID): 
     - https://developer.wolframalpha.com/access

**Note**: Be mindful of your API usage and compute budget. Use smaller, more cost-effective models (e.g. gemma-3n-E4B-it) for development before scaling to larger models. For security, never hardcode API keys; always use a `.env` file and add it to your `.gitignore`.

In [None]:
import os
from dotenv import load_dotenv
from cs329a_hw3.api_manager import APIManager

# Load environment variables from .env file
load_dotenv()

# Get API keys
GOOGLE_API_KEY = os.getenv('GOOGLE_API_KEY')
GOOGLE_CX_ID = os.getenv('GOOGLE_CX_ID')
POLYGON_API_KEY = os.getenv('POLYGON_API_KEY')
TOGETHER_API_KEY = os.getenv('TOGETHER_API_KEY')
WOLFRAM_APP_ID = os.getenv('WOLFRAM_APP_ID')

if not all([GOOGLE_API_KEY, GOOGLE_CX_ID, POLYGON_API_KEY, TOGETHER_API_KEY, WOLFRAM_APP_ID]):
    raise ValueError("One or more required API keys are missing from environment variables")

api_manager = APIManager(
    google_api_key=GOOGLE_API_KEY,
    google_cx_id=GOOGLE_CX_ID,
    polygon_api_key=POLYGON_API_KEY,
    wolfram_app_id=WOLFRAM_APP_ID
)

print("APIManager initialized successfully.")

## Part 0 - LLM Performance with a Single Call [5 points]

To establish a baseline, let's test the performance of a standalone Large Language Model (LLM) without any external tools or APIs. This helps illustrate the limitations of relying solely on the model's internal knowledge, especially for questions requiring real-time, specific, or computational information.

**Deliverable:**
- In the `cs329a_hw3/multi_lm_agent.py` file, implement the `generate` method in the MultiLMAgent class.

**Observe that:**
- LLMs can hallucinate or provide outdated information for fact-based queries.
- Models often struggle with precise mathematical calculations that a tool like Wolfram Alpha can solve instantly, due to the tokenization scheme.
- Queries about current stock prices or today's weather are outside the scope of a model's static training data.
- Evaluation is difficult for unstructured model outputs. Numerical answers have multiple possible formats, which makes naive string-matching unreliable; and natural language answers can be phrased in various ways. To that end, we've provided a `evaluate_qa` function that uses a fast, cheap model (gemma-3n-E4B-it) to more reliably evaluate the model's output.

In [None]:
from cs329a_hw3.multi_lm_agent import MultiLMAgent
from cs329a_hw3.evaluation import prepare_dataset, evaluate_qa

multi_lm_agent = MultiLMAgent(api_manager)
dataset = prepare_dataset(debug_mode=False)

queries, answers = dataset['query'], dataset['answer']
num_correct = 0
print("Zero-Shot Responses:")
zero_shot_responses = []
for query, answer in zip(queries, answers):
    response = multi_lm_agent.generate(query=query, model="meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo")
    zero_shot_responses.append(response)
    print(f"Query: {query}")
    print(f"Zero-Shot Response: {response}")
    print(f"Answer: {answer}")
    is_correct = evaluate_qa(query, response, answer, model="google/gemma-3n-E4B-it")
    num_correct += is_correct
    print("Is Correct:", is_correct)
    print("-" * 50)

accuracy_zero_shot = num_correct / len(queries)
print(f"Accuracy = {accuracy_zero_shot*100:.1f}%")

## Part 1 - API-augmented LLM pipeline [40 points]

In this part, we augment the LLM with four different API calls, then implement an LLM-based API router that routes the query to the appropriate API, and then prompt the LLM to generate the final response given the API outputs.

We will work with the following APIs and select the appropriate API for a given query:

1. **Google Custom Search API** - For web search
2. **Polygon API** - For financial data
3. **Wolfram Alpha API** - For mathematical calculations
4. **Weather API** - For location-based weather data

#### 1a. Google Custom Search API [5 points]

The Google Custom Search API allows us to programmatically search the web. We'll use this to gather information and context for our tasks.

Key features:
- Web search with customizable parameters
- Filtering and sorting options
- Rich metadata about search results

Deliverable: In the ``cs329_hw3/api_manager.py`` file, implement the `_extract_webpage_content` and ``google_search`` functions.

**Note**: The `google_search` function will return long webpages, so we will need to truncate or parse the response to get the relevant information. Otherwise, the added context will exceed the context window of the LMs in later functions.

In [None]:
search_query = "Apple Product News"
results = api_manager.google_search(
    search_query=search_query,
    num_results=3
)

print("\nSearch results for:", search_query)
print("-" * 50)
for i, result in enumerate(results, 1):
    print(f"Result {i}:")
    for k, v in result.items():
        print(f"\t{k}: {v}")
    print("-" * 50)

#### 1b. Polygon API [5 points]

The Polygon API provides real-time and historical financial data. We'll use this for analyzing stock market information.

Key features:
- Real-time stock quotes
- Historical price data
- Technical indicators
- Company fundamentals

Deliverable: In the ``cs329a_hw3/api_manager.py`` file, implement the ``get_stock_data`` function.

In [None]:
ticker, date = "TSLA", "2025-08-27"
stock_data = api_manager.get_stock_data(ticker=ticker, date=date)
if isinstance(stock_data, dict):
    print(f"\nStock data for {ticker} on {date}:")
    for k, v in stock_data.items():
        print(f"\t{k}: {v}")
else:
    print("\tNo stock data available")

#### 1c. Wolfram Alpha API [5 points]

The Wolfram Alpha API provides powerful computational capabilities to answer mathematical queries.

**Deliverable**: In the ``cs329a_hw3/api_manager.py`` file, implement the `compute` function.

In [None]:
math_query = "integral cos(x)/sqrt(x) from 0 to 1"
result = api_manager.compute(math_query)
print(f"Wolfram Alpha result for {math_query}: {result}")

#### 1d. Weather API [5 points]

The Weather API provides historical weather data for any location, including temperature, precipitation, wind, etc.

To impement this, we'll use Nominatim from the [Geopy API](https://geopy.readthedocs.io/en/stable/) to geocode a location, and then use the [Open-Meteo API](https://open-meteo.com/) to get the weather data for that location.

Deliverable: In the ``cs329a_hw3/api_manager.py`` file, implement the ``get_weather`` function.

In [None]:
# We can use the API manager's weather functionality
from datetime import datetime
location = "Palo Alto, CA"
date = datetime.now().strftime('%Y-%m-%d')
hour = "14"

weather_data = api_manager.get_weather(location, date, hour)
if "error" not in weather_data:
    print(f"Weather conditions for {location} on {date} at {hour}:00:")
    for k, v in weather_data.items():
        print(f"\t{k}: {v}")
else:
    print(f"\tCould not get weather data for {location}: {weather_data['error']}")

#### 1e. API Routing [15 points]

The API routing is a system that uses language models to route queries to the appropriate API function. This allows us to build an agent that can use multiple APIs to answer user queries.

Key requirements for the function logic and prompt construction:
- Query an LLM to determine the appropriate API to use for the query
- Correctly parse the query response and map it to the appropriate API function
- In the query response, include the API name and parameters to be used
- Query the selected API and return the response from the API after parsing
- Handle edge cases and fallbacks for query parsing and API selection

Deliverable: In the ``cs329a_hw3/api_manager.py`` file, implement the ``_parse_query_params`` and ``route_query`` functions.

Note: TogetherAI supports structured outputs, which can simplify the implementation of these functions: https://docs.together.ai/docs/json-mode.
We use Pydantic models to define the expected parameters for each API.

In [None]:
from cs329a_hw3.evaluation import prepare_dataset

dataset = prepare_dataset(debug_mode=True)
queries, answers = dataset['query'], dataset['answer']
for query in queries:
    print(f"Query: {query}")
    output = api_manager.route_query(query, model="meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo")
    print(f"Called `{output['api_used']}` with parameters: {output['params']}")
    print("Result:", output["results"])
    print("API Used:", output["api_used"])
    print("-" * 50)
    break

#### 1f. Evaluation with a single API call [5 points]

With the API router complete, let's evaluate its performance. This involves routing a query to the correct API, getting the data, and then prompting an LLM with both the original query and the API data to generate a final answer.

**Deliverable:**
- In `cs329a_hw3/multi_lm_agent.py`, implement the `single_LM_with_single_API_call` function.`

Key requirements for the function prompt and logic:
- Query the API manager for the necessary data
- Use the query and the data retrieved from the API manager to create a prompt for the model
- Use the model to generate the response
- Return the response from the model

Deliverable: In the `cs329a_hw3/multi_lm_agent.py` file, implement the `single_LM_with_single_API_call` function.


In [None]:
from cs329a_hw3.evaluation import evaluate_qa, prepare_dataset

multi_lm_agent = MultiLMAgent(api_manager)
dataset = prepare_dataset(debug_mode=True)

queries, answers = dataset['query'], dataset['answer']
num_correct = 0

print("Single-Call LM with Single API Call:")
single_LM_with_single_API_call_responses = []
for query, answer in zip(queries, answers):
    response = multi_lm_agent.single_LM_with_single_API_call(query=query, model="meta-llama/Llama-3.3-70B-Instruct-Turbo-Free")
    single_LM_with_single_API_call_responses.append(response)
    print(f"Query: {query}")
    print(f"Single-Call LM Response: {response}")
    print(f"Answer: {answer}")
    is_correct = evaluate_qa(query, response, answer, model="google/gemma-3n-E4B-it")
    num_correct += is_correct
    print("Is Correct:", is_correct)
    print("-" * 50)

accuracy_singleLM_with_single_API_call = num_correct / len(queries)
print(f"Accuracy = {accuracy_singleLM_with_single_API_call*100:.1f}%")

## Part 2 - Self-improvement techniques [40 points]
We now improve the accuracy by using a) **query decomposition and fusion** and b) **iterative self-refinement**

#### 2a. Query Decomposition [10 points]

Complex queries often require information from multiple API calls (e.g., "Which city was windier, Chicago or Boston?"). To address such queries, we will first create the Query Decomposition component, which breaks down complex queries into simpler, more manageable parts, allowing us to use multiple APIs to answer the query.

Key requirements:
- Use the LLM to decompose the query into multiple independent sub-queries relevant for answering the original query (`_get_query_decomposition_prompt` and `_get_sub_queries` in `multi_lm_agent.py`)
- Route each sub-query to the appropriate API using the API manager, and gather the structured API results and parameters from each sub-query
- Error handling for failed decompositions, failed API calls, and failed query parsing

Deliverable: In the `cs329a_hw3/multi_lm_agent.py` file, implement the `decompose_query`; as discussed above, this also requires implementing the `_get_query_decomposition_prompt` and `_get_sub_queries` helper functions.

**Note**:
- Using a `ThreadPoolExecutor` in your `decompose_query` implementation can significantly speed up execution by making multiple API calls concurrently.
- Google Search webpage results can be very long, so they must be truncated or parsed to extract the relevant information and avoid exceeding the context window of the LMs.
- The `decompose_query` method takes a query and returns a list of sub-queries. How do these sub-queries help with the overall task? What information do they provide that the original query does not? 

In [None]:
from cs329a_hw3.multi_lm_agent import MultiLMAgent
from cs329a_hw3.evaluation import prepare_dataset, evaluate_qa

multi_lm_agent = MultiLMAgent(api_manager)

dataset = prepare_dataset(debug_mode=True)
queries, answers = dataset['query'], dataset['answer']

print("Testing Query Decomposition:")
decomposed_queries = []  # List[List[Dict]], where each inner list contains the decomposed queries for a single query
for query in queries:
    print(f"Original Query: {query}")
    decomposed_query = multi_lm_agent.decompose_query(query=query, max_sub_queries=3)

    print("Sub-queries:")
    for sub_query in decomposed_query:
        for k, v in sub_query.items():
            print(f"\t{k}: {v}")
        print()
    print("-" * 50)

    decomposed_queries.append(decomposed_query)

#### 2b. Synthesizing information from sub-query API responses [5 points]

After executing the sub-queries, we need to assemble a new, context-rich prompt that incorporates each sub-query's API response and the user's query. This will be given to a model to synthesize a final answer.

Deliverable: In the `cs329a_hw3/multi_lm_agent.py` file, implement `_get_synthesis_prompt`.

**Note**:
- Do the prompts constructed provide all the necessary information to answer the query?
- A good synthesis prompt should clearly present the original query and  neatly format the results from each successful API call

In [None]:
dataset = prepare_dataset(debug_mode=True)

queries, answers = dataset['query'], dataset['answer']
generated_prompts = []
for query, decomposed_query in zip(queries, decomposed_queries):
    generated_prompt = multi_lm_agent._get_synthesis_prompt(query, decomposed_query)
    print("Generated Prompt:")
    print(generated_prompt)
    print("-" * 50)
    generated_prompts.append(generated_prompt)

#### 2c. Fusion of responses [5 points]

With the constructed prompt, we can generate multiple responses with different models to get a diverse set of responses, and then use a fusion model to combine the best elements from each into a single coherent output. This generally provides a more robust, comprehensive response than any single model could provide.

Key requirements for prompt construction:
- Call `decompose_query` and `_get_synthesis_prompt` to get the decomposed queries and the generated synthesis prompt (which contains the original query and the results from each sub-query)
- Query multiple models with the generated synthesis prompt (specifically `"google/gemma-3n-E4B-it"`, `"meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo"`, and `"OpenAI/gpt-oss-20B"`)
- Combine elements from these multiple responses by creating a new fusion prompt, while maintaining consistency and clarity in the final response
- Handle edge cases and fallbacks
- Return the final response from the fusion model

Deliverable: In the `cs329_hw3/multi_lm_agent.py` file, implement the `decompose_and_fuse`.This function orchestrates the full query decomposition and fusion workflow. It should make internal calls to `decompose_query`, `_get_synthesis_prompt`, and `generate`. For the fusion step, you should sample from in parallel, then feed their responses to the final fusion model.

**Note**: The `fuse` method should take the generated prompt and multiple generated responses before returning a single fused response. Compare how the fused response is different from the individual responses.

In [None]:
from cs329a_hw3.evaluation import evaluate_qa, prepare_dataset

dataset = prepare_dataset(debug_mode=True)
queries, answers = dataset['query'], dataset['answer']

print("Generated Responses with Query Decomposition + Fusion:")
query_decomp_responses = []
num_correct = 0
for query, answer in zip(queries, answers):
    response = multi_lm_agent.decompose_and_fuse(query)
    query_decomp_responses.append(response)
    print(f"Query: {query}")
    print(f"Query Decomposition + Fusion Response: {response}")
    print(f"Answer: {answer}")
    is_correct = evaluate_qa(query, response, answer, model="google/gemma-3n-E4B-it")
    num_correct += is_correct
    print("Is Correct:", is_correct)
    print("-" * 50)

accuracy_query_decomp = num_correct / len(queries)
print(f"Accuracy = {accuracy_query_decomp*100:.1f}%")

#### 2d. Iterative Self-Refinement [20 points]

Sometimes a single round of decomposition is not enough to answer the query. To address this, Iterative Refinement improves the response by querying for more information as needed across multiple APIs.

At each step, the agent can either issue a new sub-query to gather more information (e.g. if a prior sub-query failed or if additional information is needed); or generate the final answer if it has enough information. This can be particularly useful for complex queries that require multiple API calls to answer, such as multi-hop question-answering where the result of one step informs the next.

Deliverable: In `cs329a_hw3/multi_lm_agent.py`, implement the `iterative_refine` method.

Key requirements for function logic and prompt construction:
- At each step, you should construct a prompt with the user query, cleanly formatted API results of all previous sub-queries, and instructions for the LLM to either issue a new sub-query or generate the final answer (`_get_iterative_refinement_prompt` of `cs329a_hw3/multi_lm_agent.py`)
- Based on the LLM's response, you should either get an API response for a new sub-query (using the `route_query` method) or return its response as the final answer.
- Exit when the model outputs a final answer, or when the maximum number of iterations is reached.

Note unlike query decomposition + fusion, the iterative refinement module allows for sub-queries to be re-issued if an earlier sub-query failed, as well as for intermediate information to be incorporated into subsequent sub-queries.

For resources on LM judges and self-verification, see: 
- [LM Judge Survey](https://arxiv.org/abs/2411.15594)
- [Large Language Models are Better Reasoners with Self-Verification](https://arxiv.org/abs/2212.09561)

In [None]:
from cs329a_hw3.evaluation import evaluate_qa, prepare_dataset
from tqdm import tqdm

dataset = prepare_dataset(debug_mode=True)
queries, answers = dataset['query'], dataset['answer']

multi_lm_agent = MultiLMAgent(api_manager)
print("Generated Responses with Iterative Refinement:")
iterative_refine_responses = []
num_correct = 0
for query, answer in tqdm(zip(queries, answers)):
    print(f"Query: {query}")
    
    response = multi_lm_agent.iterative_refine(query, max_iterations=4)
    print(f"Iterative Refinement Response: {response}")
    iterative_refine_responses.append(response)
    
    print(f"Answer: {answer}")
    is_correct = evaluate_qa(query, response, answer, model="google/gemma-3n-E4B-it")
    num_correct += is_correct
    print("Is Correct:", is_correct)
    print("-" * 50)
    
accuracy_iterative_refine = num_correct / len(queries)
print(f"Accuracy = {accuracy_iterative_refine*100:.1f}%")

Create a bar graph plotting the accuracies of zero-shot prompting, a single API router and LM call, query decomposition and fusion, and iterative refinement (max 4 rounds).

In [None]:
from matplotlib import pyplot as plt

# Create data for the bar graph
methods = ['Zero-shot', 'Single API router and LM call', 'Query decomposition and fusion', 'Iterative refinement']

# Must be in this order and in floating point format from 0.0 to 1.0
accuracies = [
    accuracy_zero_shot / 100,
    accuracy_singleLM_with_single_API_call / 100, 
    accuracy_query_decomp / 100,
    accuracy_iterative_refine / 100
]

colors = ['#FF9999', '#66B2FF', '#99FF99', '#FFCC99']

plt.figure(figsize=(10, 6))
bars = plt.bar(methods, accuracies, color=colors)

plt.title('Accuracy Comparison Across Different LLM Pipelines')
plt.ylabel('Accuracy')
plt.ylim(0, 1.0)

plt.xticks(rotation=45, ha='right')
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height, f'{height:.2%}', ha='center', va='bottom')
plt.tight_layout()
plt.show()

## Part 3 - Building an LLM Agentic Workflow [15 points]

Using the components we've built (or new ones you've implemented), build an LLM agent that can answer the following questions about the world. These questions will be a mix of the types of questions we've built components for and will require using the APIs in creative ways.

Some questions will require just a single API call, while others will require multiple API calls and multiple rounds of iterative refinement. Create a pipeline that can dynamically adjust to the complexity of the question. Feel free to implement new components or use the ones we've already built!

If you get above 70% accuracy on the entire dataset, you will get full points. For scores below 70%, you will get partial credit based on the percentage of accuracy.

**Important**: Make sure to evaluate over the entire dataset when you are confident with your implementation! This will help you preserve inference compute credits and speed up the development process. Please use "meta-llama/Llama-3.3-70B-Instruct-Turbo-Free" for the iterative refinement and decomposition models, and "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo" for the fusion model.

Deliverable: In the `cs329a_hw3/multi_lm_agent.py` file, implement the `run_pipeline` method, which takes a query and returns a final response.

**Note**: How does the performance on the dataset compare between single-call LMs vs. the complete pipeline with the multi-LM agent? How does it improve accuracy by improving access to tool APIs and allowing for more complex reasoning?

Let's first evaluate over the entire dataset with just a zero-shot model
- **IMPORTANT**: Make sure to evaluate over the entire dataset only when you are confident with your implementation! 
- This will help you preserve inference compute credits and speed up the development process.

In [None]:
# Let's first evaluate the zero-shot performance over the entire dataset
# IMPORTANT: Make sure to evaluate over the entire dataset when you are confident with your implementation! 
# This will help you preserve inference compute credits and speed up the development process.

from cs329a_hw3.evaluation import evaluate_qa, prepare_dataset

debug_mode = False # Loads the entire dataset for evaluation.
dataset = prepare_dataset(debug_mode=debug_mode)
queries, answers = dataset['query'], dataset['answer']

zero_shot_responses, num_correct = [], 0
for query, answer in zip(queries, answers):
    print(f"Query: {query}")
    response = multi_lm_agent.generate(query, model="meta-llama/Llama-3.3-70B-Instruct-Turbo-Free")
    zero_shot_responses.append(response)
    print(f"Zero-Shot Response: {response}")
    print(f"Answer: {answer}")
    is_correct = evaluate_qa(query, response, answer, model="google/gemma-3n-E4B-it")
    num_correct += is_correct
    print("Is Correct:", is_correct)
    print("-" * 50)

print("Evaluating full zero-shot performance...")
complete_set_zero_shot_accuracy = num_correct / len(queries)
print(f"Accuracy = {complete_set_zero_shot_accuracy*100:.1f}%")


Now let's evaluate over the entire dataset with our multi-LM agentic pipeline

In [None]:
from cs329a_hw3.evaluation import evaluate_qa, prepare_dataset

dataset = prepare_dataset(debug_mode=False)
queries, answers = dataset['query'], dataset['answer']

multi_lm_agent = MultiLMAgent(api_manager)

multi_lm_responses, num_correct = [], 0
print("\nGenerating responses with Multi-LM Agent...")
for query, answer in zip(queries, answers):
    print(f"Query: {query}")
    response = multi_lm_agent.run_pipeline(query)
    multi_lm_responses.append(response)
    print(f"Multi-LM Response: {response}")
    print(f"Answer: {answer}")
    is_correct = evaluate_qa(query, response, answer, model="google/gemma-3n-E4B-it")
    num_correct += is_correct
    print("Is Correct:", is_correct)
    print("-" * 50)

complete_set_multi_lm_agent_accuracy = num_correct / len(queries)
print(f"Accuracy = {complete_set_multi_lm_agent_accuracy*100:.1f}%")


## Part 4 - Deep Research Agent [20 points]

Agentic LM systems are used everywhere today! From chatbots to coding agents to task automation, they are becoming more and more prevalent in our daily lives. Gemini, OpenAI, and Perplexity (among others) have "deep research" agents that are capable of synthesizing large amounts of online information and completing multi-step research tasks: [Introducing Deep Research](https://openai.com/index/introducing-deep-research/), [Deep Research](https://blog.google/products/gemini/google-gemini-deep-research/).

Using the components we've built and extending them if needed, implement your own deep research agent that can generate comprehensive analyses from online sources. The agent should be able to handle complex queries requiring multi-step research, synthesizing information from multiple sources, and generating a comprehensive final report.

**Key requirements for implementation:**
- Generate a four-five paragraph report
- Proper, easy-to-read structuring of the report
- Usage of multiple sources of information with appropriate link citations
- Track temporal information and maintain chronological accuracy

Deliverable: In the `cs329a_hw3/deep_research_agent.py` file, implement the `research` method in the `DeepResearchAgent` class.

The method should:
- Take a complex query (e.g., "What was the UK's macroeconomic performance in 2024?")
- Break it down into sub-questions
- Research each sub-question using the search engine API
- Synthesize and summarize findings with appropriate formatting as a report
- Return a report and list of sources 
- **IMPORTANT**: Make sure to use cheaper models during the development process to help you preserve inference compute credits and speed up the process!

In [None]:
from cs329a_hw3.DeepResearchAgent import DeepResearchAgent

test_queries = [
    "What are the key developments and challenges in solid-state battery technology for electric vehicles in 2024, including major company announcements and technical breakthroughs?",
    "How has the implementation of the UK's post-Brexit immigration policy affected its labor market and key industries between 2021-2024? Include specific policy changes and their measured impacts.",
    "What progress has been made in nuclear fusion energy in 2024, focusing on major research milestones, private sector investments, and timeline predictions for commercial viability?"
]

custom_agent = DeepResearchAgent(api_manager)

for query in test_queries:
    print(f"\nQuery: {query}")
    single_LM_response = custom_agent.generate(query, model="meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo")
    report = custom_agent.research(query)
    print(f"Single-LM Response: {single_LM_response}")
    print("\n" * 3)
    print(f"Report: {report['report']}")
    print(f"Sources: {report['sources']}")
    print("-" * 50)