# Research Paper Report Generating Agent.

<a href="https://colab.research.google.com/github/run-llama/llamacloud-demo/blob/main/examples/report_generation/research_paper_report_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Report generation is a frequent use case among our enterprise customers. We are developing a demo for report generation based on a practical use case that we commonly encounter.

The pace at which AI papers are being published on arXiv is incredibly fast, and many people struggle to keep up with the latest updates. It would be valuable to create a report based on papers published by criteria such as date, author, company, or affiliation or specific papers in the outline from the user.

Here’s a proposed workflow:
1. Use the arXiv API to pull daily papers.
2. Generate metadata such as publication date, update date, authors, research lab, etc.
3. Index the data on LlamaCloud.
4. Repeat steps 1-3 on a daily basis.
5. Create an outline for the report. (Ideally from user)
6. Develop a report-generating agent.
7. Generate report based on the outline.

**NOTE:** 

1. Please adjust the paper titles in the outline based on the date the notebook was run, as they may differ.
2. For this iteration we did not use filters during retrieval or querying stage.

![research_paper_report_generation](research_paper_report_generation.png)

### Installation

We'll be utilizing various packages along with LlamaIndex:

1. LlamaCloud - For creating a managed index in the cloud.
2. LlamaParse - For effective document parsing.
3. arxiv - For accessing the latest research papers.

In [None]:
!pip install -U llama-index llama-index-indices-managed-llama-cloud llama-parse llama-cloud arxiv

### SetupAPI Keys

Set up the `LLAMA_CLOUD_API_KEY` and `OPENAI_API_KEY` for accessing the LlamaCloud managed index and OpenAI LLMs.

In [1]:
import nest_asyncio

nest_asyncio.apply()

In [2]:
import os

os.environ["OPENAI_API_KEY"] = 'sk-...' # Get your API key from https://platform.openai.com/account/api-keys
os.environ["LLAMA_CLOUD_API_KEY"] = "llx-..." # Get your API key from https://cloud.llamaindex.ai/api-key

### Setup LLM

For this demonstration, we'll use the `OpenAI` LLM, but you are free to experiment with any LLM of your choice for further exploration.








In [3]:
from llama_index.llms.openai import OpenAI

llm = OpenAI()

### Download `arxiv` papers based on topics.

For this demonstration, we will download papers related to specific research topics of interest, focusing on `RAG and Agent.`

In [4]:
research_paper_topics = ["RAG", "Agent"]

In [5]:
import arxiv

from pathlib import Path

def download_papers(client, topics, num_results_per_topic):
    """Function to download papers from arxiv for given topics and number of results per topic"""
    for topic in topics:

        # sort by recent data and with max results
        search = arxiv.Search(
        query = topic,
        max_results = num_results_per_topic,
        sort_by = arxiv.SortCriterion.SubmittedDate
        )

        # get the results
        results = client.results(search)

        # download the pdf
        for r in results:
            r.download_pdf()

def list_pdf_files(directory):
    # List all .pdf files using pathlib
    pdf_files = [file.name for file in Path(directory).glob('*.pdf')]
    return pdf_files

We will download three research papers for each topic.

In [6]:
# create a client
client = arxiv.Client()

download_papers(client, research_paper_topics, 3)

### Parse the documents using `LlamaParse`

We'll use `LlamaParse` to parse documents with `type=markdown` because LLMs excel at interpreting the text and tables found in PDFs.

In [7]:
from llama_parse import LlamaParse

def parse_files(pdf_files):
    """Function to parse the pdf files using LlamaParse in markdown format"""

    parser = LlamaParse(
        result_type="markdown",  # "markdown" and "text" are available
        num_workers=4,  # if multiple files passed, split in `num_workers` API calls
        verbose=True,
    )

    documents = []

    for index, pdf_file in enumerate(pdf_files):
        print(f"Processing file {index + 1}/{len(pdf_files)}: {pdf_file}")
        document = parser.load_data(pdf_file)
        documents.append(document)

    return documents

Parse the downloaded documents.

In [16]:
directory = './'
pdf_files = list_pdf_files(directory)

documents = parse_files(pdf_files)

Processing file 1/6: 2410.13825v1.AgentOccam__A_Simple_Yet_Strong_Baseline_for_LLM_Based_Web_Agents.pdf
Started parsing the file under job_id d53aa174-0758-4320-888d-29c3b332d639
Processing file 2/6: 2410.13553v1.Integrating_Temporal_Representations_for_Dynamic_Memory_Retrieval_and_Management_in_Large_Language_Models.pdf
Started parsing the file under job_id 81dd0a7a-3ad7-419c-b1d2-11d11b94dceb
Processing file 3/6: 2410.13671v1.HEALTH_PARIKSHA__Assessing_RAG_Models_for_Health_Chatbots_in_Real_World_Multilingual_Settings.pdf
Started parsing the file under job_id 0dee1b41-a6c7-4f97-80c7-fa23d740b546
Processing file 4/6: 2410.13824v1.Harnessing_Webpage_UIs_for_Text_Rich_Visual_Understanding.pdf
Started parsing the file under job_id e89139eb-26b2-4f07-89ec-d1c4ca44461f
Processing file 5/6: 2410.13860v1.VLM_Grounder__A_VLM_Agent_for_Zero_Shot_3D_Visual_Grounding.pdf
Started parsing the file under job_id c4815177-79a5-4476-b38d-ad7c15ba94e7
Processing file 6/6: 2410.13716v1.MIRAGE_Bench__Aut

### Utils

Here, we define some utilities to help us extract metadata from each document, create a LlamaCloud pipeline/index, and upload the documents to the pipeline/index.

1. `Metadata` - Pydantic model to extract metadata of author names, companies and general AI tags.
2. `get_papers_metadata` - Extracts the metadata information from the research paper.
3. `create_llamacloud_pipeline` - Create `LlamaCloud` pipeline.
4. `upload_documents` - Upload the research papers to `LlamaCloud` index.

In [9]:
from llama_cloud.types import CloudDocumentCreate
from pydantic import BaseModel, Field
from typing import List
from llama_cloud.client import LlamaCloud
from llama_index.core.prompts import PromptTemplate
from llama_index.core.async_utils import run_jobs

class Metadata(BaseModel):
    """Output containing the authors names, authors companies, and general AI tags."""

    author_names: List[str] = Field(..., description="List of author names of the paper. Give empty list if not available")

    author_companies: List[str] = Field(..., description="List of author companies of the paper. Give empty list if not available")

    ai_tags: List[str] = Field(..., description="List of general AI tags related to the paper. Give empty list if not available")

def create_llamacloud_pipeline(pipeline_name, embedding_config, transform_config, data_sink_id=None):
    """Function to create a pipeline in llamacloud"""

    client = LlamaCloud(token=os.environ["LLAMA_CLOUD_API_KEY"])

    pipeline = {
        'name': pipeline_name,
        'transform_config': transform_config,
        'embedding_config': embedding_config,
        'data_sink_id': data_sink_id
    }

    pipeline = client.pipelines.upsert_pipeline(request=pipeline)

    return client, pipeline

async def get_papers_metadata(text):
    """Function to get the metadata from the given paper"""
    prompt_template = PromptTemplate("""Generate authors names, authors companies, and general top 3 AI tags for the given research paper.

    Research Paper:

    {text}""")

    metadata = await llm.astructured_predict(
        Metadata,
        prompt_template,
        text=text,
    )

    return metadata

async def get_document_upload(document, llm):
    text_for_metadata_extraction = document[0].text + document[1].text + document[2].text
    full_text = "\n\n".join([doc.text for doc in document])
    metadata = await get_papers_metadata(text_for_metadata_extraction)
    return CloudDocumentCreate(
        text=full_text,
        metadata={
            'author_names': metadata.author_names,
            'author_companies': metadata.author_companies,
            'ai_tags': metadata.ai_tags
        }
     )
                 
async def upload_documents(client, documents):
    """Function to upload the documents to the cloud"""

    # Upload the documents to the cloud
    extract_jobs = []
    for document in documents:
        extract_jobs.append(get_document_upload(document, llm))
    
    document_upload_objs = await run_jobs(extract_jobs, workers=4)

    _ = client.pipelines.create_batch_pipeline_documents(pipeline.id, request=document_upload_objs)

### Create `LlamaCloud` pipeline.

We will first create a `LlamaCloud` pipeline (empty index) before uploading documents. We need `embedding_config` and `transform_config` for the same.

`embedding_config` - This config provides details about the embedding model and the corresponding API key used for creating embeddings during the indexing stage. Here we use OpenAI embeddings.

`transform_config` - This config outlines the `chunk_size` and `chunk_overlap` parameters used during the indexing stage.

In [10]:
# Embedding config
embedding_config = {
    'type': 'OPENAI_EMBEDDING',
    'component': {
        'api_key': os.environ["OPENAI_API_KEY"], # editable
        'model_name': 'text-embedding-ada-002' # editable
    }
}

# Transformation auto config
transform_config = {
    'mode': 'auto',
    'config': {
        'chunk_size': 1024,
        'chunk_overlap': 20
    }
}

client, pipeline = create_llamacloud_pipeline('report_generation', embedding_config, transform_config)

### Upload documents to `LlamaCloud` index.

Now that we have set up a pipeline (an empty index), we will upload the downloaded documents using the specified `embedding_config` and `transform_config` configurations.

In [18]:
await upload_documents(client, documents)

### Create Index and QueryEngine

Let's connect to the created LlamaCloud index and build a QueryEngine to use it for report generation.

We will utilize hybrid search combined with re-ranking (cohere-reranker) for this purpose.

In [19]:
from llama_index.indices.managed.llama_cloud import LlamaCloudIndex

# connect to existing index
index = LlamaCloudIndex(
          name="report_generation",
          project_name="Default",
          api_key=os.environ['LLAMA_CLOUD_API_KEY'])

query_engine = index.as_query_engine(
  dense_similarity_top_k=10,
  sparse_similarity_top_k=10,
  alpha=0.5,
  enable_reranking=True,
  rerank_top_n = 5,
  retrieval_mode="chunks"
)

### Utils to create queries for generating report based on outline.

#### Sample Outline of the Report

#### START OF OUTLINE OF THE REPORT

```
# Research Paper Report on RAG - Retrieval Augmented Generation and Agentic World.

## 1. Introduction

## 2. Retrieval Augmented Generation (RAG) and Agents
2.1. Fundamentals of RAG and Agents.
2.2. Current State and Applications

## 3. Latest Papers:
3.1. Paper-1 title (to be filled).
3.2. Paper-2 title (to be filled).
3.3. Paper-3 title (to be filled).

## 4. Conclusion:
```

#### END OF OUTLINE OF THE REPORT

Here is a sample outline for the report we intend to generate. We need to populate sections like   `Introduction, Retrieval Augmented Generation (RAG) and Agents`, and its sub-sections `Fundamentals of RAG and Agents`, `Current State and Applications`, and `Latest Papers`, as well as the final `Conclusion`. This can be done either by using an LLM or the LlamaCloud Index.

To complete these sections, we'll need to query the index/LLM for relevant information. We will craft queries based on the sub-sections and sections within the context of the report's title. Here are some utilities to assist in this task.

1. `extract_title`: Function to extract the title from the first line of the outline.
2. `generate_query_with_llm`: Function to generate a query for a report using LLM.
3. `classify_query`: Function to classify the query as either 'LLM' or 'INDEX' based on the query content.
4. `parse_outline_and_generate_queries`: Function to parse the outline and generate queries for each section and subsection.

**NOTE**: The utilities should be adjusted based on the specific outline of the report we are considering. This ensures that they align with the sections and sub-sections we need to populate.

In [20]:
import re

def extract_title(outline):
    """Function to extract the title from the first line of the outline"""

    first_line = outline.strip().split('\n')[0]
    return first_line.strip('# ').strip()

def generate_query_with_llm(title, section, subsection):
    """Function to generate a query for a report using LLM"""

    prompt = f"Generate a research query for a report on {title}. "
    prompt += f"The query should be for the subsection '{subsection}' under the main section '{section}'. "
    prompt += "The query should guide the research to gather relevant information for this part of the report. The query should be clear, short and concise. "

    response = llm.complete(prompt)

    return str(response).strip()

def classify_query(query):
    """Function to classify the query as either 'LLM' or 'INDEX' based on the query content"""

    prompt = f"""Classify the following query as either "LLM" if it can be answered directly by a large language model with general knowledge, or "INDEX" if it likely requires querying an external index or database for specific or up-to-date information.

    Query: "{query}"

    Consider the following:
    1. If the query asks for general knowledge, concepts, or explanations, classify as "LLM".
    2. If the query asks for specific facts, recent events, or detailed information that might not be in the LLM's training data, classify as "INDEX".
    3. If unsure, err on the side of "INDEX".

    Classification:"""

    classification = str(llm.complete(prompt)).strip().upper()

    if classification not in ["LLM", "INDEX"]:
        classification = "INDEX"  # Default to INDEX if the response is unclear

    return classification

def parse_outline_and_generate_queries(outline):
    """Function to parse the outline and generate queries for each section and subsection"""
    
    lines = outline.strip().split('\n')
    title = extract_title(outline)
    current_section = ""
    queries = {}

    for line in lines[1:]:  # Skip the title line
        if line.startswith('## '):
            current_section = line.strip('# ').strip()
            queries[current_section] = {}
        elif re.match(r'^\d+\.\d+\.', line):
            subsection = line.strip()
            query = generate_query_with_llm(title, current_section, subsection)
            classification = classify_query(query)
            queries[current_section][subsection] = {"query": query, "classification": classification}

    # Handle sections without subsections
    for section in queries:
        if not queries[section]:
            query = generate_query_with_llm(title, section, "General overview")
            queries[section]["General"] = {"query": query, "classification": "LLM"}

    return queries

### `ReportGenerationAgent`

Here we create an agent to generate the final report based on the outline.

Following are the steps following in generating the report.

1. Generates queries to fill the report from the outline.
2. Generates answers for the queries using LlamaCloud index.
3. Fill back the answers to relevant parts in the report.
4. Format the final report.

In [21]:
from typing import Any, List
from llama_index.core.llms.function_calling import FunctionCallingLLM
from llama_index.core.workflow import Workflow, StartEvent, StopEvent, Context, step
from llama_index.core.workflow import Event

class ReportGenerationEvent(Event):
    pass


class ReportGenerationAgent(Workflow):
    """Report generation agent."""

    def __init__(
        self,
        query_engine: Any,
        llm: FunctionCallingLLM | None = None,
        **kwargs: Any,
    ) -> None:
        super().__init__(**kwargs)
        self.query_engine = query_engine
        self.llm = llm or OpenAI(model='gpt-4o-mini')

    def format_report(self, section_contents, outline):
        """Format the report based on the section contents."""
        report = ""

        for section, subsections in section_contents.items():
            section_match = re.match(r'^(\d+\.)\s*(.*)$', section)
            if section_match:
                section_num, section_title = section_match.groups()
                
                if "introduction" in section.lower():
                    introduction_num, introduction_title = section_num, section_title
                elif "conclusion" in section.lower():
                    conclusion_num, conclusion_title = section_num, section_title
                else:
                    combined_content = "\n".join(subsections.values())
                    summary_query = f"Provide a short summary for section '{section}':\n\n{combined_content}"
                    section_summary = str(llm.complete(summary_query))
                    report += f"# {section_num} {section_title}\n\n{section_summary}\n\n"

                    report = self.get_subsections_content(subsections, report)

        # Add introduction

        introduction_query = f"Create an introduction for the report:\n\n{report}"
        introduction = str(self.llm.complete(introduction_query))
        report = f"# {introduction_num} {introduction_title}\n\n{introduction}\n\n" + report

        # Add conclusion

        conclusion_query = f"Create a conclusion for the report:\n\n{report}"
        conclusion = str(self.llm.complete(conclusion_query))
        report += f"# {conclusion_num} {conclusion_title}\n\n{conclusion}"

        # Add title
        title = extract_title(outline)
        report = f"# {title}\n\n{report}"
        return report

    def get_subsections_content(self, subsections, report):
        """Generate content for each subsection in the outline."""
        # Sort subsections by their keys before adding them to the report
        for subsection in sorted(subsections.keys(), key=lambda x: re.search(r'(\d+\.\d+)', x).group(1) if re.search(r'(\d+\.\d+)', x) else x):
            content = subsections[subsection]
            subsection_match = re.search(r'(\d+\.\d+)\.\s*(.+)', subsection)
            if subsection_match:
                subsection_num, subsection_title = subsection_match.groups()
                report += f"## {subsection_num} {subsection_title}\n\n{content}\n\n"
            else:
                report += f"## {subsection}\n\n{content}\n\n"
        return report

    def generate_section_content(self, queries, reverse=False):
        """Generate content for each section and subsection in the outline."""
        section_contents = {}
        for section, subsections in queries.items():
            section_contents[section] = {}
            subsection_keys = reversed(sorted(subsections.keys())) if reverse else sorted(subsections.keys())
            for subsection in subsection_keys:
                data = subsections[subsection]
                query = data['query']
                classification = data['classification']
                if classification == "LLM":
                    answer = str(llm.complete(query + " Give a short answer."))
                else:
                    answer = str(query_engine.query(query))
                section_contents[section][subsection] = answer
        return section_contents

    @step(pass_context=True)
    async def queries_generation_event(self, ctx: Context, ev: StartEvent) -> ReportGenerationEvent:
        """Generate queries for the report."""
        ctx.data["outline"] = ev.outline
        queries = parse_outline_and_generate_queries(ctx.data["outline"])

        return ReportGenerationEvent(queries=queries)

    @step(pass_context=True)
    async def generate_report(
        self, ctx: Context, ev: ReportGenerationEvent
    ) -> StopEvent:
        """Generate report."""

        queries = ev.queries

        # Generate contents for sections in reverse order
        section_contents = self.generate_section_content(queries, reverse=True)
        # Format and compile the final report
        report = self.format_report(section_contents, ctx.data["outline"])
       
        return StopEvent(result={"response": report})

### Outline of the report.

Here's the outline of the report.

Please update the paper titles in the 'Latest Papers' section according to the user's interests.

In [22]:
outline = """
# Research Paper Report on RAG - Retrieval Augmented Generation and Agentic World.

## 1. Introduction

## 2. Retrieval Augmented Generation (RAG) and Agents
2.1. Fundamentals of RAG and Agents.
2.2. Current State and Applications

## 3. Latest Papers:
3.1. HEALTH-PARIKSHA: Assessing RAG Models for Health Chatbots in Real-World Multilingual Settings
3.2. MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems
3.3. VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding

## 4. Conclusion:
"""

### Generate report

Now that everything is set up, we will create an agent to generate the report.

In [23]:
agent = ReportGenerationAgent(
    query_engine=query_engine,
    llm=llm,
    verbose=True,
    timeout=1200.0,
)

In [24]:
report = await agent.run(outline=outline)

Running step queries_generation_event
Step queries_generation_event produced event ReportGenerationEvent
Running step generate_report
Step generate_report produced event StopEvent


In [25]:
print(report['response'])

# Research Paper Report on RAG - Retrieval Augmented Generation and Agentic World.

# 1. Introduction

In this report, we delve into the advancements in Retrieval-Augmented Generation (RAG) and Agents technologies, focusing on their impact on memory recall and management in AI systems. These technologies aim to enhance response generation in dialogue agents by combining retrieval and generation-based methods, improving memory recall accuracy, and context-awareness in interactions. We also explore the latest papers in the field, including zero-shot 3D visual grounding, multilingual benchmarking for RAG systems, and the assessment of RAG models for health chatbots in real-world multilingual settings. These studies shed light on the significant performance variations, challenges, and advancements in RAG and Agents technologies, showcasing their potential in revolutionizing dialogue agents' cognitive abilities and interaction capabilities.

# 2. Retrieval Augmented Generation (RAG) and Age

### Save the final report

In [26]:
with open("report.md", "w") as f:
    f.write(report['response'])