# Building a Sub-Question Query Engine with LlamaCloud

<a href="https://colab.research.google.com/github/run-llama/llamacloud-demo/blob/main/examples/10k_apple_tesla/demo_subquestion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we build an advanced RAG pipeline with LlamaCloud + LlamaIndex over ten SEC 10K Filings (5 Apple 10K Filings from 2019-2023 and 5 Tesla 10K Filings from 2019-2023). We plug a default LlamaCloud query engine into the sub-question query engine (available on open-source) to do query decomposition on complex questions.

## Setup

Install core packages, download files, parse documents.

In [None]:
!pip install llama-index
!pip install llama-index-core
!pip install llama-index-embeddings-openai
!pip install llama-index-question-gen-openai
!pip install llama-index-postprocessor-flag-embedding-reranker
!pip install git+https://github.com/FlagOpen/FlagEmbedding.git
!pip install llama-parse

In [None]:
# download Apple 
!wget "https://s2.q4cdn.com/470004039/files/doc_earnings/2023/q4/filing/_10-K-Q4-2023-As-Filed.pdf" -O data/apple_2023.pdf
!wget "https://s2.q4cdn.com/470004039/files/doc_financials/2022/q4/_10-K-2022-(As-Filed).pdf" -O data/apple_2022.pdf
!wget "https://s2.q4cdn.com/470004039/files/doc_financials/2021/q4/_10-K-2021-(As-Filed).pdf" -O data/apple_2021.pdf
!wget "https://s2.q4cdn.com/470004039/files/doc_financials/2020/ar/_10-K-2020-(As-Filed).pdf" -O data/apple_2020.pdf
!wget "https://www.dropbox.com/scl/fi/i6vk884ggtq382mu3whfz/apple_2019_10k.pdf?rlkey=eudxh3muxh7kop43ov4bgaj5i&dl=1" -O data/apple_2019.pdf

# download Tesla
!wget "https://ir.tesla.com/_flysystem/s3/sec/000162828024002390/tsla-20231231-gen.pdf" -O data/tesla_2023.pdf
!wget "https://ir.tesla.com/_flysystem/s3/sec/000095017023001409/tsla-20221231-gen.pdf" -O data/tesla_2022.pdf
!wget "https://www.dropbox.com/scl/fi/ptk83fmye7lqr7pz9r6dm/tesla_2021_10k.pdf?rlkey=24kxixeajbw9nru1sd6tg3bye&dl=1" -O data/tesla_2021.pdf
!wget "https://ir.tesla.com/_flysystem/s3/sec/000156459021004599/tsla-10k_20201231-gen.pdf" -O data/tesla_2020.pdf
!wget "https://ir.tesla.com/_flysystem/s3/sec/000156459020004475/tsla-10k_20191231-gen_0.pdf" -O data/tesla_2019.pdf

Some OpenAI and LlamaParse details. The OpenAI LLM is used for response synthesis.

In [1]:
# llama-parse is async-first, running the async code in a notebook requires the use of nest_asyncio
import nest_asyncio
nest_asyncio.apply()

In [2]:
import os
# API access to llama-cloud
os.environ["LLAMA_CLOUD_API_KEY"] = "llx-"

In [3]:
# Using OpenAI API for embeddings/llms
os.environ["OPENAI_API_KEY"] = "sk-"

In [3]:
from llama_index.llms.openai import OpenAI
from llama_index.core import VectorStoreIndex, Settings

llm = OpenAI(model="gpt-4-0125-preview", temperature=0)

Settings.llm = llm

## Load Documents into LlamaCloud

The first order of business is to download the 5 Apple and Tesla 10Ks and upload them into LlamaCloud.

You can easily do this by creating a pipeline and uploading docs via the "Files" mode.

After this is done, proceed to the next section.

## Define LlamaCloud Retriever over Documents

In this section we define a LlamaCloud Retriever over these SEC 10K Filings. 

LlamaCloud natively uses LlamaParse + integrated table parsing + indexing to give back a retrieval endpoint over these documents.

We primarily use the default settings, without modification.

In [6]:
from llama_index.indices.managed.llama_cloud import LlamaCloudIndex

index = LlamaCloudIndex(
  name="<index_name>",
  project_name="<project_name>",
  api_key="llx-"
)

In [7]:
query_engine = index.as_query_engine()

In [9]:
response = query_engine.query("What are risk factors for tesla in 2022")

In [10]:
print(str(response))

The risk factors for Tesla in 2022 include potential product liability claims that could lead to lawsuits, substantial monetary damages, product recalls or redesign efforts, and the need to defend against even meritless claims, all of which could generate negative publicity and be expensive and time-consuming. Additionally, maintaining public credibility and confidence in Tesla's long-term business prospects is crucial for its success. Challenges also arise from the need to effectively grow and manage the compliance, residual value, financing, and credit risks related to Tesla's various financing programs.


## Define Custom Sub-Question Query Engine

Define a custom query engine with tool retrieval enabled.

In [11]:
from llama_index.core.query_engine import CustomQueryEngine, SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool, ToolMetadata

In [12]:
query_engine_tool = QueryEngineTool(
    query_engine=query_engine,
    metadata=ToolMetadata(
        name=f"query_tool",
        description=(
            f"Search tool for Apple or Tesla financials for any year (please specify)"
        )
    )
)

rag_query_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=[query_engine_tool],
    use_async=True
)

In [13]:
response = rag_query_engine.query(
    "Get Property, plant and equipment assets for Tesla in 2022"
)

Generated 1 sub questions.
[1;3;38;2;237;90;200m[query_tool] Q: Get Property, plant and equipment assets for Tesla in 2022
[0m[1;3;38;2;237;90;200m[query_tool] A: The value of Property, Plant, and Equipment assets for Tesla as of December 31, 2022, was $23,548 million.
[0m

In [14]:
print(response.source_nodes[3].get_content(metadata_mode="all"))

table_df: {'Assets': {0: 'Current assets', 1: 'Cash and cash equivalents', 2: 'Short-term investments', 3: 'Accounts receivable, net', 4: 'Inventory', 5: 'Prepaid expenses and other current assets', 6: 'Total current assets', 7: 'Operating lease vehicles, net', 8: 'Solar energy systems, net', 9: 'Property, plant and equipment, net', 10: 'Operating lease right-of-use assets', 11: 'Digital assets, net', 12: 'Intangible assets, net', 13: 'Goodwill', 14: 'Deferred tax assets', 15: 'Other non-current assets', 16: 'Total assets'}, 'December 31, 2023': {0: ' ', 1: '$16,398', 2: '$12,696', 3: '$3,508', 4: '$13,626', 5: '$3,388', 6: '$49,616', 7: '$5,989', 8: '$5,229', 9: '$29,725', 10: '$4,180', 11: '$184', 12: '$178', 13: '$253', 14: '$6,733', 15: '$4,531', 16: '$106,618'}, 'December 31, 2022': {0: ' ', 1: '$16,253', 2: '$5,932', 3: '$2,952', 4: '$12,839', 5: '$2,941', 6: '$40,917', 7: '$5,035', 8: '$5,489', 9: '$23,548', 10: '$2,563', 11: '$184', 12: '$215', 13: '$194', 14: '$328', 15: '$3,8

In [15]:
# print(str(response))
print(response.source_nodes[0].get_content(metadata_mode="all"))

Sub question: Get Property, plant and equipment assets for Tesla in 2022
Response: The value of Property, Plant, and Equipment assets for Tesla as of December 31, 2022, was $23,548 million.


In [16]:
print(str(response))

The value of Property, Plant, and Equipment assets for Tesla as of December 31, 2022, was $23,548 million.


## Define Baseline 

Here we define a baseline index and naive RAG pipeline over these documents.

Documents are stored in ChromaDB - after they're indexed they can be loaded directly from Chroma instead of re-indexing/re-embedding all the documents.

We retrieve from the index using a standard top-k dense retrieval + generation approach.

#### Load in Documents using a Baseline Parser

Here we load in documents using PyPDF.

In [17]:
from llama_index.core import SimpleDirectoryReader
from collections import defaultdict

companies = ["apple", "tesla"]
years = list(range(2019, 2024))

all_docs_base = defaultdict(dict)
for company in companies:
    for year in years:
        print((company, year))
        reader = SimpleDirectoryReader(input_files=[f"./data/{company}_{year}.pdf"])
        all_docs_base[company][year] = reader.load_data()

('apple', 2019)
('apple', 2020)
('apple', 2021)
('apple', 2022)
('apple', 2023)
('tesla', 2019)
('tesla', 2020)
('tesla', 2021)
('tesla', 2022)
('tesla', 2023)


#### Save Documents to Disk

TO save compute time, we define functions to save the parsed docs to disk.

In [17]:
# save to disk
import json

def save_docs(save_dir, all_docs, companies, years):
    for company in companies:
        for year in years:
            docs_list = [d.dict() for d in all_docs[company][year]]
            json_str = json.dumps(docs_list)
            with open(f"parsed_docs/{company}_{year}.json", "w") as fp:
                fp.write(json_str)

In [19]:
# load from disk
from llama_index.core import Document
import json

def load_docs(save_dir, companies, years):
    tmp_all_docs = defaultdict(dict)
    for company in companies:
        for year in years:
            docs_list = json.load(open(f"parsed_docs/{company}_{year}.json", "r"))
            docs = [Document.parse_obj(d) for d in docs_list]
            tmp_all_docs[company][year] = docs
    return tmp_all_docs

In [27]:
# save to disk
save_docs("parsed_docs_base", all_docs_base, companies, years)

In [None]:
# load from disk
all_docs_base = load_docs("parsed_docs_base", companies, years)

In [None]:
all_docs_base_list = []

for company in companies:
    for year in years:
        all_docs_base_list.extend(all_docs_base[company][year])

#### Index via ChromaDB

We process the documents through a naive RAG pipeline and put into a local Chroma instance.

In [21]:
import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext, VectorStoreIndex

In [None]:
## Option 1: Run this cell if you haven't loaded documents into Chroma yet.
db = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = db.get_or_create_collection("quickstart")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(
    all_docs_base_list, storage_context=storage_context, show_progress=True
)

In [None]:
## Option 2: Run this cell if you want to load an index from an existing populated Chroma index.
db2 = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = db2.get_or_create_collection("quickstart")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
index = VectorStoreIndex.from_vector_store(
    vector_store,
)

In [23]:
base_query_engine = index.as_query_engine(similarity_top_k=5)

In [24]:
response = base_query_engine.query("Get Property, plant and equipment assets for Tesla in 2021")
print(str(response))

The information provided does not include data on Tesla's property, plant, and equipment assets for 2021.


## Try out Some Comparisons

We're now ready to try out some comparison queries. We compare our sub-question + LlamaCloud query engine against the baseline query engine.

We show that our advanced query engine outperforms on questions over single documents (which doesn't leverage the sub-question query engine but relies on the parsing/retrieval capability of the LlamaCloud index), as well as over multi-documents (which does rely on the sub-question query engine to break down questions).

In [25]:
questions = [
    "Get Property, plant and equipment assets for Tesla in 2022",
    "What are the excess tax benefits from equity awards in 2023 for Apple and Tesla", 
    "Payments for acquisition of property, plant and equipment for Apple in 2020?",
    "Get cash / cash equivalents for 2021 and 2023 for Tesla",
    "Tell me risk factors for both Tesla and Apple in 2023"
]

In [26]:
responses = []
base_responses = []
for question in questions:
    print(f"Question: {question}")
    response = rag_query_engine.query(question) 
    print(f"Response: {str(response)}")
    base_response = base_query_engine.query(question)
    print(f"Base Response: {str(base_response)}")
    responses.append(response)
    base_responses.append(base_response)

Question: Get Property, plant and equipment assets for Tesla in 2022
Generated 1 sub questions.
[1;3;38;2;237;90;200m[query_tool] Q: Get Property, plant and equipment assets for Tesla in 2022
[0m[1;3;38;2;237;90;200m[query_tool] A: The value of Property, Plant, and Equipment assets for Tesla as of December 31, 2022, was $23,548 million.
[0mResponse: The value of Property, Plant, and Equipment assets for Tesla as of December 31, 2022, was $23,548 million.
Base Response: The information provided does not include data on Tesla's property, plant, and equipment assets for 2022.
Question: What are the excess tax benefits from equity awards in 2023 for Apple and Tesla
Generated 2 sub questions.
[1;3;38;2;237;90;200m[query_tool] Q: What are the excess tax benefits from equity awards in 2023 for Apple?
[0m[1;3;38;2;90;149;237m[query_tool] Q: What are the excess tax benefits from equity awards in 2023 for Tesla?
[0m[1;3;38;2;237;90;200m[query_tool] A: The excess tax benefits from equity