<a href="https://colab.research.google.com/github/run-llama/llamacloud-demo/blob/main/examples/form_filling/Form_Filling_10K_SEC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Form Filling using LlamaCloud and LlamaParse

Form filling is a common use case we frequently encounter while working with our customers.

Our customers often index documents on our enterprise platform, LlamaCloud, and want to fill in details within a document based on the information from the indexed documents.

To demonstrate this use case, we’ve indexed MSFT, AMZN, APPL 10K SEC filings 2021, and 2022. Using this indexed data, the next step is to fill the necessary details in excel file `sec_10k_analysis_form_filling.xlsx`.

**NOTE**: Before proceeding further, you need to create an index using the 10-K SEC filings for Microsoft (MST), Amazon (AMZN), and Apple (APPL) from the years 2021 and 2022 on [LlamaCloud](https://cloud.llamaindex.ai/).

### Installation

In [None]:
# !pip install llama-index llama-parse llama-index-indices-managed-llama-cloud

In [1]:
from typing import List
from pydantic import BaseModel
import os
import json
import csv

from llama_parse import LlamaParse
from llama_index.core.schema import Document
from llama_index.llms.openai import OpenAI

import nest_asyncio

nest_asyncio.apply()

### Setup API Keys

In [None]:
os.environ['LLAMA_CLOUD_API_KEY'] = 'llx-...' # Get it from https://cloud.llamaindex.ai/api-key

os.environ['OPENAI_API_KEY'] = 'sk-...' # Get it from https://platform.openai.com/api-keys

### Setup LLM

In [2]:
llm = OpenAI(model='gpt-4o-mini')

### Parse the Form Filling File

In [3]:
def parse_file(file_path: str) -> List[Document]:
    llama_parse = LlamaParse(
        api_key=os.environ['LLAMA_CLOUD_API_KEY'],
        result_type='markdown',
    )

    result = llama_parse.load_data(
        file_path,
    )
    return result

In [5]:
documents = parse_file('data/sec_10k_analysis_form_filling.xlsx')

Started parsing the file under job_id ba44b8f0-5fed-461a-a117-88f0a786dc11


In [6]:
print(documents[0].text)

|Parameter                            |2021         |                |            |2022         |                |            |
|-------------------------------------|-------------|----------------|------------|-------------|----------------|------------|
|                                     |Amazon (AMZN)|Microsoft (MSFT)|Apple (AAPL)|Amazon (AMZN)|Microsoft (MSFT)|Apple (AAPL)|
|1. Revenue                           |             |                |            |             |                |            |
|2. Net Income                        |             |                |            |             |                |            |
|3. Earnings Per Share (EPS)          |             |                |            |             |                |            |
|4. EBITDA                            |             |                |            |             |                |            |
|5. Free Cash Flow                    |             |                |            |             |       

In [7]:
text = documents[0].text

### Structured Extraction

In [8]:
prompt = f"""
You are an AI assistant specializing in financial analysis. You've been given an Excel spreadsheet containing financial data for multiple companies. Your task is to extract and structure this information in a clear, organized format.

The Excel sheet contains the following:
1. Multiple companies (rows)
2. Year (columns)
3. Various financial parameters (sub-columns)

Input Excel data:
{text}

Please present the extracted and structured information in a clear, easy-to-read format.
"""

In [9]:
class CompanyParameters(BaseModel):
    """Data model for an sec filing analysis."""

    Companies: List[str]
    FinancialParameters: List[str]
    Years: List[str]

In [10]:
from llama_index.core.llms import ChatMessage

sllm = llm.as_structured_llm(output_cls=CompanyParameters)
input_msg = ChatMessage.from_str(prompt)

In [11]:
output = sllm.chat([input_msg])

In [12]:
output_obj = output.raw

In [13]:
output_obj

CompanyParameters(Companies=['Amazon (AMZN)', 'Microsoft (MSFT)', 'Apple (AAPL)'], FinancialParameters=['Revenue', 'Net Income', 'Earnings Per Share (EPS)', 'EBITDA', 'Free Cash Flow', 'Return on Equity (ROE)', 'Return on Assets (ROA)', 'Debt-to-Equity Ratio', 'Current Ratio', 'Gross Margin', 'Operating Margin', 'Net Profit Margin', 'Inventory Turnover', 'Accounts Receivable Turnover', 'Capital Expenditures', 'Research and Development Expenses', 'Market Cap', 'Price-to-Earnings (P/E) Ratio', 'Dividend Yield', 'Year-over-Year Growth Rate'], Years=['2021', '2022'])

You can check list of companies, financial parameters and years in the structured format.

### Connect To LlamaCloud Index.

In [None]:
from llama_index.indices.managed.llama_cloud import LlamaCloudIndex

index = LlamaCloudIndex(
  name="AMZN_MSFT_APPL_2021_2022",
  project_name="Default",
  organization_id="c2086254-1398-4ee0-acb4-a6ae35d8d947",
  api_key=os.environ['LLAMA_CLOUD_API_KEY'],
)

In [None]:
query_engine = index.as_query_engine(
    dense_similarity_top_k=10,
    sparse_similarity_top_k=10,
    alpha=0.5,
    enable_reranking=True,
    rerank_top_n=5,
)

### Generate Answers

for all financial parameters, companies and years.

In [None]:
from tqdm import tqdm
def generate_answers(companies: List[str], financial_parameters: List[str], years: List[str]) -> List[str]:
    companies_financial_parameters_answers = {}

    for year in years:
        companies_financial_parameters_answers[year] = {}
        for company in companies:
            companies_financial_parameters_answers[year][company] = {}
            for financial_parameter in tqdm(financial_parameters):
                query = f"what is the {financial_parameter} of {company} for the year {year}?. Don't be verbose. Provide 1-5 words answers for mathematical values. If you are unable to provide answer, output as NA."
                answer = str(query_engine.query(query))
                companies_financial_parameters_answers[year][company][financial_parameter] = answer

    return companies_financial_parameters_answers

In [None]:
companies = output_obj.Companies
financial_parameters = output_obj.FinancialParameters
years = output_obj.Years

In [None]:
answers = generate_answers(companies, financial_parameters, years)

100%|██████████| 20/20 [00:32<00:00,  1.61s/it]
100%|██████████| 20/20 [00:26<00:00,  1.34s/it]
100%|██████████| 20/20 [00:27<00:00,  1.40s/it]
100%|██████████| 20/20 [00:36<00:00,  1.85s/it]
100%|██████████| 20/20 [00:27<00:00,  1.36s/it]
100%|██████████| 20/20 [00:26<00:00,  1.32s/it]


In [None]:
answers

{'2021': {'Amazon (AMZN)': {'Revenue': '469,822 million',
   'Net Income': '33,364',
   'Earnings Per Share (EPS)': '64.81',
   'EBITDA': '24,879',
   'Free Cash Flow': '-$11,569',
   'Return on Equity (ROE)': '22.9%',
   'Return on Assets (ROA)': '5.9%',
   'Debt-to-Equity Ratio': '0.63',
   'Current Ratio': '1.02',
   'Gross Margin': '41.0%',
   'Operating Margin': '13.7%',
   'Net Profit Margin': '6.8%',
   'Inventory Turnover': '4.8 times',
   'Accounts Receivable Turnover': '5.6 times',
   'Capital Expenditures': '58.3 billion',
   'Research and Development Expenses': 'Not significant',
   'Market Cap': '$1.66 trillion',
   'Price-to-Earnings (P/E) Ratio': '73.6',
   'Dividend Yield': 'NA',
   'Year-over-Year Growth Rate': '22%'},
  'Microsoft (MSFT)': {'Revenue': '$168,088 million',
   'Net Income': '$61,271 million',
   'Earnings Per Share (EPS)': '8.05',
   'EBITDA': '$ 76,632 million',
   'Free Cash Flow': '28.7 billion',
   'Return on Equity (ROE)': '27.4%',
   'Return on Ass

In [15]:
answers = json.loads("""\
{'2021': {'Amazon (AMZN)': {'Revenue': '469,822 million',
   'Net Income': '33,364',
   'Earnings Per Share (EPS)': '64.81',
   'EBITDA': '24,879',
   'Free Cash Flow': '-$11,569',
   'Return on Equity (ROE)': '22.9%',
   'Return on Assets (ROA)': '5.9%',
   'Debt-to-Equity Ratio': '0.63',
   'Current Ratio': '1.02',
   'Gross Margin': '41.0%',
   'Operating Margin': '13.7%',
   'Net Profit Margin': '6.8%',
   'Inventory Turnover': '4.8 times',
   'Accounts Receivable Turnover': '5.6 times',
   'Capital Expenditures': '58.3 billion',
   'Research and Development Expenses': 'Not significant',
   'Market Cap': '$1.66 trillion',
   'Price-to-Earnings (P/E) Ratio': '73.6',
   'Dividend Yield': 'NA',
   'Year-over-Year Growth Rate': '22%'},
  'Microsoft (MSFT)': {'Revenue': '$168,088 million',
   'Net Income': '$61,271 million',
   'Earnings Per Share (EPS)': '8.05',
   'EBITDA': '$ 76,632 million',
   'Free Cash Flow': '28.7 billion',
   'Return on Equity (ROE)': '27.4%',
   'Return on Assets (ROA)': '10.9%',
   'Debt-to-Equity Ratio': '0.30',
   'Current Ratio': '1.79',
   'Gross Margin': '$115.9 billion',
   'Operating Margin': '32%',
   'Net Profit Margin': '19.7%',
   'Inventory Turnover': '2.5 times',
   'Accounts Receivable Turnover': '7.5 times',
   'Capital Expenditures': '$9.5 billion',
   'Research and Development Expenses': '20,716 million',
   'Market Cap': '$2.5 trillion',
   'Price-to-Earnings (P/E) Ratio': '34.5',
   'Dividend Yield': '1. 0.8%',
   'Year-over-Year Growth Rate': '18%'},
  'Apple (AAPL)': {'Revenue': '$365.817 billion',
   'Net Income': '94,680 million',
   'Earnings Per Share (EPS)': '5.61',
   'EBITDA': '19,863',
   'Free Cash Flow': '$ 73,000',
   'Return on Equity (ROE)': '21.9%',
   'Return on Assets (ROA)': '5.1%',
   'Debt-to-Equity Ratio': '0.93',
   'Current Ratio': '1.12',
   'Gross Margin': '$152,836 million',
   'Operating Margin': '44.7%',
   'Net Profit Margin': '21.7%',
   'Inventory Turnover': 'Not available',
   'Accounts Receivable Turnover': '6.2 times',
   'Capital Expenditures': '$9,000 million',
   'Research and Development Expenses': '$21,914 million',
   'Market Cap': '$2.46 trillion',
   'Price-to-Earnings (P/E) Ratio': '28.11',
   'Dividend Yield': '1. 0.0065',
   'Year-over-Year Growth Rate': '33%'}},
 '2022': {'Amazon (AMZN)': {'Revenue': '513,983',
   'Net Income': '(2,722)',
   'Earnings Per Share (EPS)': '9.70',
   'EBITDA': '$15,432 million',
   'Free Cash Flow': '$-11,569 million',
   'Return on Equity (ROE)': '8.4%',
   'Return on Assets (ROA)': '6.9%',
   'Debt-to-Equity Ratio': '1.39',
   'Current Ratio': '1.1',
   'Gross Margin': 'NA',
   'Operating Margin': '16.9%',
   'Net Profit Margin': 'NA',
   'Inventory Turnover': 'NA',
   'Accounts Receivable Turnover': '6.4 times',
   'Capital Expenditures': '$58.3 billion',
   'Research and Development Expenses': '$73,213 million',
   'Market Cap': '$1.47 trillion',
   'Price-to-Earnings (P/E) Ratio': 'NA',
   'Dividend Yield': 'NA',
   'Year-over-Year Growth Rate': '13%'},
  'Microsoft (MSFT)': {'Revenue': '$198,270 million',
   'Net Income': '72,738',
   'Earnings Per Share (EPS)': '9.65',
   'EBITDA': '$107.895 billion',
   'Free Cash Flow': '$58.7 billion',
   'Return on Equity (ROE)': '38.7%',
   'Return on Assets (ROA)': '9.99%',
   'Debt-to-Equity Ratio': '0.59',
   'Current Ratio': '1.78',
   'Gross Margin': '$135.620 billion',
   'Operating Margin': '19%',
   'Net Profit Margin': '19%',
   'Inventory Turnover': '3.4 times',
   'Accounts Receivable Turnover': '6.1 times',
   'Capital Expenditures': '$8.5 billion',
   'Research and Development Expenses': '$24,512 million',
   'Market Cap': '$1.87 trillion',
   'Price-to-Earnings (P/E) Ratio': '38.6',
   'Dividend Yield': '2.0%',
   'Year-over-Year Growth Rate': '18%'},
  'Apple (AAPL)': {'Revenue': '394,328 million',
   'Net Income': '99,803',
   'Earnings Per Share (EPS)': '6.15',
   'EBITDA': '$145,787 million',
   'Free Cash Flow': '$ 88,531',
   'Return on Equity (ROE)': '13.15%',
   'Return on Assets (ROA)': '6.7%',
   'Debt-to-Equity Ratio': '0.68',
   'Current Ratio': '2.78',
   'Gross Margin': '$170,782 million',
   'Operating Margin': '37.9%',
   'Net Profit Margin': '21.9%',
   'Inventory Turnover': '1. 6.87',
   'Accounts Receivable Turnover': '6.0 times',
   'Capital Expenditures': '$42,117 million',
   'Research and Development Expenses': '$26,251 million',
   'Market Cap': '$2.9 trillion',
   'Price-to-Earnings (P/E) Ratio': '24.6',
   'Dividend Yield': '0.0054',
   'Year-over-Year Growth Rate': '8%'}}}
""".replace("\'", "\""))

### Write data to `csv`

In [17]:
def flatten_dict(d, parent_key='', sep='_'):
    items = []
    for k, v in d.items():
        new_key = f"{parent_key}{sep}{k}" if parent_key else k
        if isinstance(v, dict):
            items.extend(flatten_dict(v, new_key, sep=sep).items())
        else:
            items.append((new_key, v))
    return dict(items)

# Flatten the nested dictionary
flat_data = []
for year, companies in answers.items():
    for company, metrics in companies.items():
        flat_metrics = flatten_dict(metrics)
        flat_metrics['Year'] = year
        flat_metrics['Company'] = company
        flat_data.append(flat_metrics)

# Get all unique keys to use as CSV headers
headers = set()
for item in flat_data:
    headers.update(item.keys())

# Sort headers to ensure 'Year' and 'Company' come first
headers = sorted(headers)
headers.insert(0, headers.pop(headers.index('Year')))
headers.insert(1, headers.pop(headers.index('Company')))

# Write to CSV
with open('sec_10k_analysis_form_filling.csv', 'w', newline='', encoding='utf-8') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=headers)
    writer.writeheader()
    for row in flat_data:
        writer.writerow(row)

In [34]:
import pandas as pd
from IPython.core.display import HTML

pd.set_option('display.max_colwidth', 10)

out_df = pd.read_csv("sec_10k_analysis_form_filling.csv")

In [35]:
html = out_df.to_html()
HTML(html)

Unnamed: 0,Year,Company,Accounts Receivable Turnover,Capital Expenditures,Current Ratio,Debt-to-Equity Ratio,Dividend Yield,EBITDA,Earnings Per Share (EPS),Free Cash Flow,Gross Margin,Inventory Turnover,Market Cap,Net Income,Net Profit Margin,Operating Margin,Price-to-Earnings (P/E) Ratio,Research and Development Expenses,Return on Assets (ROA),Return on Equity (ROE),Revenue,Year-over-Year Growth Rate
0,2021,Amazon (AMZN),5.6 times,58.3 billion,1.02,0.63,,24879,64.81,"-$11,569",41.0%,4.8 times,$1.66 trillion,33364,6.8%,13.7%,73.6,Not significant,5.9%,22.9%,"469,822 million",22%
1,2021,Microsoft (MSFT),7.5 times,$9.5 billion,1.79,0.3,1. 0.8%,"$ 76,632 million",8.05,28.7 billion,$115.9 billion,2.5 times,$2.5 trillion,"$61,271 million",19.7%,32%,34.5,"20,716 million",10.9%,27.4%,"$168,088 million",18%
2,2021,Apple (AAPL),6.2 times,"$9,000 million",1.12,0.93,1. 0.0065,19863,5.61,"$ 73,000","$152,836 million",Not available,$2.46 trillion,"94,680 million",21.7%,44.7%,28.11,"$21,914 million",5.1%,21.9%,$365.817 billion,33%
3,2022,Amazon (AMZN),6.4 times,$58.3 billion,1.1,1.39,,"$15,432 million",9.7,"$-11,569 million",,,$1.47 trillion,"(2,722)",,16.9%,,"$73,213 million",6.9%,8.4%,513983,13%
4,2022,Microsoft (MSFT),6.1 times,$8.5 billion,1.78,0.59,2.0%,$107.895 billion,9.65,$58.7 billion,$135.620 billion,3.4 times,$1.87 trillion,72738,19%,19%,38.6,"$24,512 million",9.99%,38.7%,"$198,270 million",18%
5,2022,Apple (AAPL),6.0 times,"$42,117 million",2.78,0.68,0.0054,"$145,787 million",6.15,"$ 88,531","$170,782 million",1. 6.87,$2.9 trillion,99803,21.9%,37.9%,24.6,"$26,251 million",6.7%,13.15%,"394,328 million",8%
