# Extract Data from Financial Reports - with Citations and Reasoning

Given complex files like financial reports, contracts, invoices etc, Llama Extract allows you to make use of an LLM to extract the information relevant to you, in a structured format.

In this example, we'll be using [LlamaExtract](https://docs.cloud.llamaindex.ai/llamaextract/getting_started?utm_campaign=extract&utm_medium=recipe) to extract structured data from an SEC filing (specifically, the filing by Nvidia for fiscal year 2025).

On top of simple data extraction, we'll ask our extraction agent to provide citations and reasoning for each extracted field. This allows us to:
- Confirm  the accuracy of the extracted field
- Understand the reasoning behind why the LLM extracted a given piece of information
- This last point allows us an opportunity to adjust the system prompt or field descriptions and improve on results where needed.


The example we go through below is also replicable within Llama Cloud as well, where you will also be able to pick between a number of pre-defined schemas, instead of building your own.

In [None]:
!pip install llama-cloud-services

## Connect to Llama Cloud

To get started, make sure you provide your [Llama Cloud](https://cloud.llamaindex.ai?utm_campaign=extract&utm_medium=recipe) API key.

In [None]:
import os
from getpass import getpass

if "LLAMA_CLOUD_API_KEY" not in os.environ:
    os.environ["LLAMA_CLOUD_API_KEY"] = getpass("Enter your Llama Cloud API Key: ")

Enter your Llama Cloud API Key: ··········


## Extract Data with Llama Extract Agent

In [None]:
from llama_cloud_services import LlamaExtract

# Optionally, provide your project id, if not, it will use the 'Default' project
llama_extract = LlamaExtract()

No project_id provided, fetching default project.


### Provide Your Custom Schema

When using LlamaExtract via the API, you provide your own schema that describes what you want extracted from files and data provided to your agent. Here, we are essentially building an SEC filings extraction agent.

In [None]:
from pydantic import BaseModel, Field
from enum import Enum


class FilingType(str, Enum):
    ten_k = "10 K"
    ten_q = "10-Q"
    ten_ka = "10-K/A"
    ten_qa = "10-Q/A"


class FinancialReport(BaseModel):
    company_name: str = Field(description="The name of the company")
    description: str = Field(
        description="Short description of the filing and what it contains"
    )
    filing_type: FilingType = Field(description="Type of SEC filing")
    filing_date: str = Field(description="Date when filing was submitted to SEC")
    fiscal_year: int = Field(description="Fiscal year")
    unit: str = Field(
        description="Unit of financial figures (thousands, millions, etc.)"
    )
    revenue: int = Field(description="Total revenue for period")

### Set Up Citations and Reasoning

Optionally, we can set the `ExtractConfig` to extract citations for each field the agent extracts. These cications will cite the specific pages and sections of the file from which a given field was extractedd.

By setting `use_reasoning` to True, we als ask the agent to do an additional reasoning step, explaining why a given field was extracted.

In [None]:
from llama_cloud.types import ExtractConfig, ExtractMode

config = ExtractConfig(
    use_reasoning=True, cite_sources=True, extraction_mode=ExtractMode.MULTIMODAL
)

In [None]:
agent = llama_extract.create_agent(
    name="filing-parser", data_schema=FinancialReport, config=config
)



### Demo Time - Download a PDF and Extract Data with Citations

In [None]:
import requests

url = "https://raw.githubusercontent.com/run-llama/llama_cloud_services/refs/heads/main/examples/extract/data/sec_filings/nvda_10k.pdf"

response = requests.get(url)

if response.status_code == 200:
    with open("/content/nvda_10k.pdf", "wb") as f:
        f.write(response.content)
    print("PDF downloaded successfully.")
else:
    print(f"Failed to download. Status code: {response.status_code}")

PDF downloaded successfully.


In [None]:
filing_info = agent.extract("/content/nvda_10k.pdf")

Uploading files: 100%|██████████| 1/1 [00:00<00:00,  1.83it/s]
Creating extraction jobs: 100%|██████████| 1/1 [00:00<00:00,  4.38it/s]
Extracting files: 100%|██████████| 1/1 [02:03<00:00, 123.40s/it]


In [None]:
filing_info.data

{'company_name': 'NVIDIA Corporation',
 'description': "The filing provides a detailed overview of NVIDIA's business as a full-stack computing infrastructure company, discusses various technologies including digital avatars and autonomous vehicles, outlines numerous risk factors affecting operations such as supply chain issues and geopolitical tensions, and describes employee stock purchase plans and related compliance requirements.",
 'filing_type': '10 K',
 'filing_date': 'February 26, 2025',
 'fiscal_year': 2025,
 'unit': 'millions',
 'revenue': 130497}

### Inspect Citations and Reasoning

In [None]:
filing_info.extraction_metadata

{'field_metadata': {'company_name': {'reasoning': 'VERBATIM EXTRACTION',
   'citation': [{'page': 1, 'matching_text': 'NVIDIA CORPORATION'},
    {'page': 2, 'matching_text': 'NVIDIA Corporation'},
    {'page': 3,
     'matching_text': 'All references to "NVIDIA," "we," "us," "our," or the "Company" mean NVIDIA Corporation and its subsidiaries.'},
    {'page': 35,
     'matching_text': 'Comparison of 5 Year Cumulative Total Return* Among NVIDIA Corporation'},
    {'page': 49,
     'matching_text': 'To the Board of Directors and Shareholders of NVIDIA Corporation'},
    {'page': 90, 'matching_text': 'NVIDIA Corporation'},
    {'page': 119,
     'matching_text': '*"Company"* means NVIDIA Corporation, a Delaware corporation.'},
    {'page': 126,
     'matching_text': 'Annual Report on Form 10-K of NVIDIA Corporation'}]},
  'filing_type': {'reasoning': "VERBATIM EXTRACTION from multiple sources confirming the filing type as '10 K'.",
   'citation': [{'page': 1, 'matching_text': 'FORM 10-K'}

## What's Next?

In this example, we built an Extraction Agent that is capable of citing it's sources from the document it's extracting data from, and reasoning about its reponse. To further customize and improve on the results, you can also try to customize the `system_prompt` in the `ExtractConfig`.

#### Learn More

- [LlamaExtract Documentation](https://docs.cloud.llamaindex.ai/llamaextract/getting_started)
- [Example Notebooks](https://github.com/run-llama/llama_cloud_services/tree/main/examples/extract)