<a href="https://colab.research.google.com/github/run-llama/llamacloud-demo/blob/main/examples/form_filling/Form_Filling_10K_SEC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Form Filling using LlamaCloud and LlamaParse

Form filling is a common use case we frequently encounter while working with our customers.

Our customers often index documents on our enterprise platform, LlamaCloud, and want to fill in details within a document based on the information from the indexed documents.

To demonstrate this use case, we’ve indexed MSFT, AMZN, APPL 10K SEC filings 2021, and 2022. Using this indexed data, the next step is to fill the necessary details in excel file `sec_10k_analysis_form_filling.xlsx`.

**NOTE**: Before proceeding further, you need to create an index using the 10-K SEC filings for Microsoft (MST), Amazon (AMZN), and Apple (APPL) from the years 2021 and 2022 on [LlamaCloud](https://cloud.llamaindex.ai/).

### Installation

In [None]:
# !pip install llama-index llama-parse llama-index-indices-managed-llama-cloud

In [None]:
from typing import List
from pydantic import BaseModel
import os
import json
import csv

from llama_parse import LlamaParse
from llama_index.core.schema import Document
from llama_index.llms.openai import OpenAI

import nest_asyncio

nest_asyncio.apply()

### Setup API Keys

In [None]:
os.environ['LLAMA_CLOUD_API_KEY'] = 'llx-...' # Get it from https://cloud.llamaindex.ai/api-key

os.environ['OPENAI_API_KEY'] = 'sk-...' # Get it from https://platform.openai.com/api-keys

### Setup LLM

In [None]:
llm = OpenAI(model='gpt-4o-mini')

### Parse the Form Filling File

In [None]:
def parse_file(file_path: str) -> List[Document]:
    llama_parse = LlamaParse(
        api_key=os.environ['LLAMA_CLOUD_API_KEY'],
        result_type='markdown',
    )

    result = llama_parse.load_data(
        file_path,
    )
    return result

In [None]:
documents = parse_file('sec_10k_analysis_form_filling.xlsx')

Started parsing the file under job_id 112912f7-51d4-4ae7-8f94-7e41bf4a710e


In [None]:
print(documents[0].text)

|Parameter                            |2021         |                |            |2022         |                |            |
|-------------------------------------|-------------|----------------|------------|-------------|----------------|------------|
|                                     |Amazon (AMZN)|Microsoft (MSFT)|Apple (AAPL)|Amazon (AMZN)|Microsoft (MSFT)|Apple (AAPL)|
|1. Revenue                           |             |                |            |             |                |            |
|2. Net Income                        |             |                |            |             |                |            |
|3. Earnings Per Share (EPS)          |             |                |            |             |                |            |
|4. EBITDA                            |             |                |            |             |                |            |
|5. Free Cash Flow                    |             |                |            |             |       

In [None]:
text = documents[0].text

### Structured Extraction

In [None]:
prompt = f"""
You are an AI assistant specializing in financial analysis. You've been given an Excel spreadsheet containing financial data for multiple companies. Your task is to extract and structure this information in a clear, organized format.

The Excel sheet contains the following:
1. Multiple companies (rows)
2. Year (columns)
3. Various financial parameters (sub-columns)

Input Excel data:
{text}

Please present the extracted and structured information in a clear, easy-to-read format.
"""

In [None]:
class CompanyParameters(BaseModel):
    """Data model for an sec filing analysis."""

    Companies: List[str]
    FinancialParameters: List[str]
    Years: List[str]

In [None]:
from llama_index.core.llms import ChatMessage

sllm = llm.as_structured_llm(output_cls=CompanyParameters)
input_msg = ChatMessage.from_str(prompt)

In [None]:
output = sllm.chat([input_msg])

In [None]:
output_obj = output.raw

In [None]:
output_obj

CompanyParameters(Companies=['Amazon (AMZN)', 'Microsoft (MSFT)', 'Apple (AAPL)'], FinancialParameters=['Revenue', 'Net Income', 'Earnings Per Share (EPS)', 'EBITDA', 'Free Cash Flow', 'Return on Equity (ROE)', 'Return on Assets (ROA)', 'Debt-to-Equity Ratio', 'Current Ratio', 'Gross Margin', 'Operating Margin', 'Net Profit Margin', 'Inventory Turnover', 'Accounts Receivable Turnover', 'Capital Expenditures', 'Research and Development Expenses', 'Market Cap', 'Price-to-Earnings (P/E) Ratio', 'Dividend Yield', 'Year-over-Year Growth Rate'], Years=['2021', '2022'])

You can check list of companies, financial parameters and years in the structured format.

### Connect To LlamaCloud Index.

In [None]:
from llama_index.indices.managed.llama_cloud import LlamaCloudIndex

index = LlamaCloudIndex(
  name="AMZN_MSFT_APPL_2021_2022",
  project_name="Default",
  organization_id="c2086254-1398-4ee0-acb4-a6ae35d8d947",
  api_key=os.environ['LLAMA_CLOUD_API_KEY'],
)

In [None]:
query_engine = index.as_query_engine(
    dense_similarity_top_k=10,
    sparse_similarity_top_k=10,
    alpha=0.5,
    enable_reranking=True,
    rerank_top_n=5,
)

### Generate Answers

for all financial parameters, companies and years.

In [None]:
from tqdm import tqdm
def generate_answers(companies: List[str], financial_parameters: List[str], years: List[str]) -> List[str]:
    companies_financial_parameters_answers = {}

    for year in years:
        companies_financial_parameters_answers[year] = {}
        for company in companies:
            companies_financial_parameters_answers[year][company] = {}
            for financial_parameter in tqdm(financial_parameters):
                query = f"what is the {financial_parameter} of {company} for the year {year}?. Don't be verbose. Provide 1-5 words answers for mathematical values. If you are unable to provide answer, output as NA."
                answer = str(query_engine.query(query))
                companies_financial_parameters_answers[year][company][financial_parameter] = answer

    return companies_financial_parameters_answers

In [None]:
companies = output_obj.Companies
financial_parameters = output_obj.FinancialParameters
years = output_obj.Years

In [None]:
answers = generate_answers(companies, financial_parameters, years)

100%|██████████| 20/20 [00:32<00:00,  1.61s/it]
100%|██████████| 20/20 [00:26<00:00,  1.34s/it]
100%|██████████| 20/20 [00:27<00:00,  1.40s/it]
100%|██████████| 20/20 [00:36<00:00,  1.85s/it]
100%|██████████| 20/20 [00:27<00:00,  1.36s/it]
100%|██████████| 20/20 [00:26<00:00,  1.32s/it]


In [None]:
answers

{'2021': {'Amazon (AMZN)': {'Revenue': '469,822 million',
   'Net Income': '33,364',
   'Earnings Per Share (EPS)': '64.81',
   'EBITDA': '24,879',
   'Free Cash Flow': '-$11,569',
   'Return on Equity (ROE)': '22.9%',
   'Return on Assets (ROA)': '5.9%',
   'Debt-to-Equity Ratio': '0.63',
   'Current Ratio': '1.02',
   'Gross Margin': '41.0%',
   'Operating Margin': '13.7%',
   'Net Profit Margin': '6.8%',
   'Inventory Turnover': '4.8 times',
   'Accounts Receivable Turnover': '5.6 times',
   'Capital Expenditures': '58.3 billion',
   'Research and Development Expenses': 'Not significant',
   'Market Cap': '$1.66 trillion',
   'Price-to-Earnings (P/E) Ratio': '73.6',
   'Dividend Yield': 'NA',
   'Year-over-Year Growth Rate': '22%'},
  'Microsoft (MSFT)': {'Revenue': '$168,088 million',
   'Net Income': '$61,271 million',
   'Earnings Per Share (EPS)': '8.05',
   'EBITDA': '$ 76,632 million',
   'Free Cash Flow': '28.7 billion',
   'Return on Equity (ROE)': '27.4%',
   'Return on Ass

### Write data to `csv`

In [None]:
def flatten_dict(d, parent_key='', sep='_'):
    items = []
    for k, v in d.items():
        new_key = f"{parent_key}{sep}{k}" if parent_key else k
        if isinstance(v, dict):
            items.extend(flatten_dict(v, new_key, sep=sep).items())
        else:
            items.append((new_key, v))
    return dict(items)

# Flatten the nested dictionary
flat_data = []
for year, companies in answers.items():
    for company, metrics in companies.items():
        flat_metrics = flatten_dict(metrics)
        flat_metrics['Year'] = year
        flat_metrics['Company'] = company
        flat_data.append(flat_metrics)

# Get all unique keys to use as CSV headers
headers = set()
for item in flat_data:
    headers.update(item.keys())

# Sort headers to ensure 'Year' and 'Company' come first
headers = sorted(headers)
headers.insert(0, headers.pop(headers.index('Year')))
headers.insert(1, headers.pop(headers.index('Company')))

# Write to CSV
with open('sec_10k_analysis_form_filling.csv', 'w', newline='', encoding='utf-8') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=headers)
    writer.writeheader()
    for row in flat_data:
        writer.writerow(row)