<a href="https://colab.research.google.com/github/thegallier/configs/blob/main/Copy_of_tesla_10q_table.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/query_engine/sec_tables/tesla_10q_table.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Joint Tabular/Semantic QA over Tesla 10K

In this example, we show how to ask questions over 10K with understanding of both the unstructured text as well as embedded tables.

We use Unstructured to parse out the tables, and use LlamaIndex recursive retrieval to index/retrieve tables if necessary given the user question.

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

In [1]:
!pip install llama-index

Collecting llama-index
  Downloading llama_index-0.9.15.post2-py3-none-any.whl (966 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m966.5/966.5 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
Collecting beautifulsoup4<5.0.0,>=4.12.2 (from llama-index)
  Downloading beautifulsoup4-4.12.2-py3-none-any.whl (142 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.0/143.0 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting dataclasses-json (from llama-index)
  Downloading dataclasses_json-0.6.3-py3-none-any.whl (28 kB)
Collecting deprecated>=1.2.9.3 (from llama-index)
  Downloading Deprecated-1.2.14-py2.py3-none-any.whl (9.6 kB)
Collecting httpx (from llama-index)
  Downloading httpx-0.25.2-py3-none-any.whl (74 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.0/75.0 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
Collecting openai>=1.1.0 (from llama-index)
  Downloading openai-1.5.0-py3-none-any.whl (223 kB)
[2K     [90m

In [2]:
%load_ext autoreload
%autoreload 2

In [38]:
!pip install structured

Collecting structured
  Downloading structured-0.1.tar.gz (3.9 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting nose (from structured)
  Downloading nose-1.3.7-py3-none-any.whl (154 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.7/154.7 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: structured
  Building wheel for structured (setup.py) ... [?25l[?25hdone
  Created wheel for structured: filename=structured-0.1-py3-none-any.whl size=4914 sha256=02cc12014a8a22a2dc378b3a0a42284c7d4de545dab23d9014397edd36bd8428
  Stored in directory: /root/.cache/pip/wheels/56/6f/eb/f856d8ce5d9d310badb0d9b61e3a8842b5c6e353683732085f
Successfully built structured
Installing collected packages: nose, structured
Successfully installed nose-1.3.7 structured-0.1


In [42]:
!pip install unstructured

Collecting unstructured
  Downloading unstructured-0.11.5-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
Collecting filetype (from unstructured)
  Downloading filetype-1.2.0-py2.py3-none-any.whl (19 kB)
Collecting python-magic (from unstructured)
  Downloading python_magic-0.4.27-py2.py3-none-any.whl (13 kB)
Collecting emoji (from unstructured)
  Downloading emoji-2.9.0-py2.py3-none-any.whl (397 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m397.5/397.5 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
Collecting python-iso639 (from unstructured)
  Downloading python_iso639-2023.12.11-py3-none-any.whl (275 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m275.1/275.1 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langdetect (from unstructured)
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [11]:
from pydantic import BaseModel
from unstructured.partition.html import partition_html
import pandas as pd

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
pd.set_option("display.width", None)
pd.set_option("display.max_colwidth", None)

## Perform Data Extraction

In these sections we use Unstructured to parse out the table and non-table elements.

### Extract Elements

We use Unstructured to extract table and non-table elements from the 10-K filing.

In [5]:
!wget "https://www.dropbox.com/scl/fi/mlaymdy1ni1ovyeykhhuk/tesla_2021_10k.htm?rlkey=qf9k4zn0ejrbm716j0gg7r802&dl=1" -O tesla_2021_10k.htm
!wget "https://www.dropbox.com/scl/fi/rkw0u959yb4w8vlzz76sa/tesla_2020_10k.htm?rlkey=tfkdshswpoupav5tqigwz1mp7&dl=1" -O tesla_2020_10k.htm

--2023-12-17 18:53:32--  https://www.dropbox.com/scl/fi/mlaymdy1ni1ovyeykhhuk/tesla_2021_10k.htm?rlkey=qf9k4zn0ejrbm716j0gg7r802&dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.84.18, 2620:100:601d:18::a27d:512
Connecting to www.dropbox.com (www.dropbox.com)|162.125.84.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://uc54870edc2796f2991b9e0fa379.dl.dropboxusercontent.com/cd/0/inline/CJmYIvrBiRgGICjFDRg0qQDDmqXiwPxK9fbtC4NbLLowaxqFEjA7_fix-3kATfRO9sifHZR5wt6sud-_xSoGBuwSApukTqUEEx5KyhkzUnD1kmVIkBfFmj0wcFqof7tx-is/file?dl=1# [following]
--2023-12-17 18:53:34--  https://uc54870edc2796f2991b9e0fa379.dl.dropboxusercontent.com/cd/0/inline/CJmYIvrBiRgGICjFDRg0qQDDmqXiwPxK9fbtC4NbLLowaxqFEjA7_fix-3kATfRO9sifHZR5wt6sud-_xSoGBuwSApukTqUEEx5KyhkzUnD1kmVIkBfFmj0wcFqof7tx-is/file?dl=1
Resolving uc54870edc2796f2991b9e0fa379.dl.dropboxusercontent.com (uc54870edc2796f2991b9e0fa379.dl.dropboxusercontent.com)... 162.125.81.15, 2620:100:601f:15::a27d:9

In [79]:
!pip install edgartools

Collecting edgartools
  Downloading edgartools-2.6.2-py3-none-any.whl (285 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m285.0/285.0 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Collecting markdownify>0.11.0 (from edgartools)
  Downloading markdownify-0.11.6-py3-none-any.whl (16 kB)
Collecting pandas>=2.0.0 (from edgartools)
  Downloading pandas-2.1.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.3/12.3 MB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyarrow>=14.0.0 (from edgartools)
  Downloading pyarrow-14.0.1-cp310-cp310-manylinux_2_28_x86_64.whl (38.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.0/38.0 MB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pydantic>=2.0.0 (from edgartools)
  Downloading pydantic-2.5.2-py3-none-any.whl (381 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m381.9/381.9 kB

In [12]:
from llama_index.readers.file.flat_reader import FlatReader
from pathlib import Path

reader = FlatReader()
docs_2021 = reader.load_data(Path("tesla_2021_10k.htm"))
docs_2020 = reader.load_data(Path("tesla_2020_10k.htm"))

In [98]:
aapl_2023

[Document(id_='f429904d-e108-4a1c-a70e-6d2ad5ff8a82', embedding=None, metadata={'filename': '2023_aapl', 'extension': ''}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='11b244e2fba57b16f1bde496cd34c4552822f83b900b1b6e7938397d5b5ae64f', text='<?xml version="1.0" ?><!--XBRL Document Created with the Workiva Platform--><!--Copyright 2023 Workiva--><!--r:ecc8a488-ea65-4937-b6f1-c1002556177f,g:54641354-c9ff-45f0-9d99-6c61917b3c72,d:89a6f703d15142c8bed03bbbc5198dc5--><html xmlns:xbrli="http://www.xbrl.org/2003/instance" xmlns:ix="http://www.xbrl.org/2013/inlineXBRL" xmlns:xbrldi="http://xbrl.org/2006/xbrldi" xmlns:dei="http://xbrl.sec.gov/dei/2023" xmlns:srt="http://fasb.org/srt/2023" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:link="http://www.xbrl.org/2003/linkbase" xmlns:ecd="http://xbrl.sec.gov/ecd/2023" xmlns:aapl="http://www.apple.com/20230701" xmlns="http://www.w3.org/1999/xhtml" xmlns:ixt="http://www.xbrl.org/inlineXBRL/transf

In [1]:
from edgar import *
set_identity("Michael Mccallum mike.mccalum@indigo.com")

filings = get_filings()

In [3]:
filings = Company("AAPL").get_filings(form="10-Q").latest(1)

In [8]:
with open("2023_aapl","w") as f:
  f.write(filings.html())

In [156]:
aapl_2023 = reader.load_data(Path("2023_aapl"),extra_info={'company':'apple','ticker':'aapl','quarter':'3q2023'})

In [157]:
aapl_2023

[Document(id_='aecf3656-e701-4042-847d-8f5bb61652c5', embedding=None, metadata={'filename': '2023_aapl', 'extension': '', 'company': 'apple', 'ticker': 'aapl', 'quarter': '3q2023'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='0b46c951b073d7a7166021466b29b7ea45f6f2e28fc2c5c0c3e70d0ae03319ea', text='<?xml version="1.0" ?><!--XBRL Document Created with the Workiva Platform--><!--Copyright 2023 Workiva--><!--r:ecc8a488-ea65-4937-b6f1-c1002556177f,g:54641354-c9ff-45f0-9d99-6c61917b3c72,d:89a6f703d15142c8bed03bbbc5198dc5--><html xmlns:xbrli="http://www.xbrl.org/2003/instance" xmlns:ix="http://www.xbrl.org/2013/inlineXBRL" xmlns:xbrldi="http://xbrl.org/2006/xbrldi" xmlns:dei="http://xbrl.sec.gov/dei/2023" xmlns:srt="http://fasb.org/srt/2023" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:link="http://www.xbrl.org/2003/linkbase" xmlns:ecd="http://xbrl.sec.gov/ecd/2023" xmlns:aapl="http://www.apple.com/20230701" xmlns="http://www.w3.org/1

In [89]:
filings.html().find("interest rate")

605494

In [118]:
aapl_2023[0]

Document(id_='b4c21180-8d78-49bb-b2a2-6fdb90ac0864', embedding=None, metadata={'filename': '2023_aapl', 'extension': '', 'company': 'apple', 'ticker': 'aapl', 'quarter': '3q2023'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='0b46c951b073d7a7166021466b29b7ea45f6f2e28fc2c5c0c3e70d0ae03319ea', text='<?xml version="1.0" ?><!--XBRL Document Created with the Workiva Platform--><!--Copyright 2023 Workiva--><!--r:ecc8a488-ea65-4937-b6f1-c1002556177f,g:54641354-c9ff-45f0-9d99-6c61917b3c72,d:89a6f703d15142c8bed03bbbc5198dc5--><html xmlns:xbrli="http://www.xbrl.org/2003/instance" xmlns:ix="http://www.xbrl.org/2013/inlineXBRL" xmlns:xbrldi="http://xbrl.org/2006/xbrldi" xmlns:dei="http://xbrl.sec.gov/dei/2023" xmlns:srt="http://fasb.org/srt/2023" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:link="http://www.xbrl.org/2003/linkbase" xmlns:ecd="http://xbrl.sec.gov/ecd/2023" xmlns:aapl="http://www.apple.com/20230701" xmlns="http://www.w3.org/19

In [18]:
import openai


In [19]:
from llama_index.node_parser import (
    UnstructuredElementNodeParser,
)

node_parser = UnstructuredElementNodeParser()

In [99]:
raw_nodes_2023

[TextNode(id_='8b98a05d-8a00-4d5a-9dc9-e6ad120a8e91', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='7d0a91ee-c50d-4a35-a2b1-ce5f84ef2532', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='cfc6d54ca84f67cfcf27a55c10dd3ab79345a405fa1cb958e8b41ec811ac1fa4'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='id_19_table_ref', node_type=<ObjectType.INDEX: '3'>, metadata={'col_schema': 'Column: Title of each class\nType: string\nSummary: Title of the securities\n\nColumn: Trading symbol(s)\nType: string\nSummary: Trading symbols of the securities\n\nColumn: Name of each exchange on which registered\nType: string\nSummary: Exchanges where the securities are registered'}, hash='588322e1e096b026e5ac174be33fed1c41d7e4b2f6312bbdb74780b79d9700ec')}, hash='cfc6d54ca84f67cfcf27a55c10dd3ab79345a405fa1cb958e8b41ec811ac1fa4', text='UNITED STATES\n\nSECURITIES AND EXCHANGE 

In [160]:
 raw_nodes_2023 = node_parser.get_nodes_from_documents(aapl_2023,include_metadata=True)

100%|██████████| 1/1 [00:02<00:00,  2.65s/it]


In [161]:
raw_nodes_2023

[TextNode(id_='caddbef9-75fe-43e6-a292-db6fb1d63cfc', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='f6c530e2-4fe3-44d1-a0d3-a807803263b7', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='cfc6d54ca84f67cfcf27a55c10dd3ab79345a405fa1cb958e8b41ec811ac1fa4'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='id_19_table_ref', node_type=<ObjectType.INDEX: '3'>, metadata={'col_schema': 'Column: Trading symbol(s)\nType: string\nSummary: Trading symbols for various securities\n\nColumn: Name of each exchange on which registered\nType: string\nSummary: Names of exchanges where the securities are registered'}, hash='277859fa7d1081469f0d6aee99325bfc7e0cb21577a09a49d34c33e9093e19f1')}, hash='cfc6d54ca84f67cfcf27a55c10dd3ab79345a405fa1cb958e8b41ec811ac1fa4', text='UNITED STATES\n\nSECURITIES AND EXCHANGE COMMISSION\n\nWashington, D.C. 20549\n\nFORM 10-Q\n\n(Mark One)\n

In [45]:
import os
import pickle

if not os.path.exists("2021_nodes.pkl"):
    raw_nodes_2021 = node_parser.get_nodes_from_documents(docs_2021)
    pickle.dump(raw_nodes_2021, open("2021_nodes.pkl", "wb"))
else:
    raw_nodes_2021 = pickle.load(open("2021_nodes.pkl", "rb"))

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
0it [00:00, ?it/s]


In [22]:
base_nodes_2021, node_mappings_2021 = node_parser.get_base_nodes_and_mappings(
    raw_nodes_2021
)

In [134]:
base_nodes_2023, node_mappings_2023 = node_parser.get_base_nodes_and_mappings(
    raw_nodes_2023
)

## Setup Recursive Retriever

Now that we've extracted tables and their summaries, we can setup a recursive retriever in LlamaIndex to query these tables.

### Construct Retrievers

In [23]:
from llama_index.retrievers import RecursiveRetriever
from llama_index.query_engine import RetrieverQueryEngine
from llama_index import VectorStoreIndex

In [139]:
from llama_index.vector_stores.types import ExactMatchFilter, MetadataFilters

filters = MetadataFilters(filters=[
    ExactMatchFilter(
        key="ticker",
        value="aapl"
    )
])


In [127]:
# construct top-level vector index + query engine
vector_index = VectorStoreIndex(base_nodes_2021)
vector_retriever = vector_index.as_retriever(similarity_top_k=1)
vector_query_engine = vector_index.as_query_engine(similarity_top_k=1)

In [25]:
import os
import pickle

if not os.path.exists("2021_nodes.pkl"):
    raw_nodes_2021 = node_parser.get_nodes_from_documents(docs_2021)
    pickle.dump(raw_nodes_2021, open("2021_nodes.pkl", "wb"))
else:
    raw_nodes_2021 = pickle.load(open("2021_nodes.pkl", "rb"))

In [26]:
from llama_index.node_parser import (
    UnstructuredElementNodeParser,
)

node_parser = UnstructuredElementNodeParser()

In [51]:
base_nodes_2021, node_mappings_2021 = node_parser.get_base_nodes_and_mappings(
    raw_nodes_2021
)

In [147]:
base_nodes_2023, node_mappings_2023 = node_parser.get_base_nodes_and_mappings(
    raw_nodes_2023
)

In [176]:
vector_index = VectorStoreIndex(base_nodes_2023)
vector_retriever = vector_index.as_retriever(similarity_top_k=1,filters=filters)
vector_query_engine = vector_index.as_query_engine(similarity_top_k=1,filters=filters)

In [174]:
len(base_nodes_2023)

22

In [175]:
for i in range(22):
  base_nodes_2023[i].metadata={'ticker':'aapl'}

In [167]:
vector_retriever

<llama_index.indices.vector_store.retrievers.retriever.VectorIndexRetriever at 0x7a4bb3195570>

In [25]:
!pip install llama_index[langchain]

Collecting langchain>=0.0.303 (from llama_index[langchain])
  Downloading langchain-0.0.350-py3-none-any.whl (809 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m809.1/809.1 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
Collecting jsonpatch<2.0,>=1.33 (from langchain>=0.0.303->llama_index[langchain])
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langchain-community<0.1,>=0.0.2 (from langchain>=0.0.303->llama_index[langchain])
  Downloading langchain_community-0.0.3-py3-none-any.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-core<0.2,>=0.1 (from langchain>=0.0.303->llama_index[langchain])
  Downloading langchain_core-0.1.1-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.6/190.6 kB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langsmith<0.1.0,>=0.0.63 (from langchain>=0.0.303

In [28]:
from llama_index.langchain_helpers.text_splitter import SentenceSplitter

In [29]:
from llama_index.retrievers import RecursiveRetriever

recursive_retriever = RecursiveRetriever(
    "vector", #SentenceSplitter,
    retriever_dict={"vector": vector_retriever},
    node_dict=node_mappings_2021,
    verbose=False,
)
query_engine = RetrieverQueryEngine.from_args(recursive_retriever)

In [30]:
nodes = query_engine.retrieve("What was the revenue for Tesla in 2020?")


In [177]:
recursive_retriever = RecursiveRetriever(
    "vector", #SentenceSplitter,
    retriever_dict={"vector": vector_retriever},
    node_dict=node_mappings_2023,
    verbose=True,
)
query_engine = RetrieverQueryEngine.from_args(recursive_retriever)

In [180]:
query_engine.query("What derivatives does gm have?")

[1;3;34mRetrieving with query id None: What derivatives does gm have?
[0m[1;3;38;5;200mRetrieving text node: The Company designates these instruments as either cash flow or fair value hedges. As of July 1, 2023, the maximum length of time over which the Company is hedging its exposure to the variability in future cash flows for term debt–related foreign currency transactions is 19 years.

Apple Inc. | Q3 2023 Form 10-Q | 8

The Company may also enter into derivative instruments that are not designated as accounting hedges to protect gross margins from certain fluctuations in foreign currency exchange rates, as well as to offset a portion of the foreign currency exchange gains and losses generated by the remeasurement of certain assets and liabilities denominated in non-functional currencies.

Interest Rate Risk

To protect the Company’s term debt or marketable securities from fluctuations in interest rates, the Company may enter into interest rate swaps, options or other instruments

Response(response="I'm sorry, but I cannot provide information about the derivatives that GM (General Motors) has based on the given context. The context information provided is about Apple Inc., not GM.", source_nodes=[NodeWithScore(node=TextNode(id_='64fc46cd-825a-4d81-ac2b-5dee525c6d73', embedding=None, metadata={'ticker': 'aapl'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='f297627b-1b65-4c67-8553-fd69cba4fa5c', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='f35f0a63f0da7d56e35310a8929bbac381b84cc8dbb44d6ba7cfbde5946b72cd'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='9b274b2a-f546-4d68-80bc-629a27e2214f', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='a38e83186fa194fe872297105faa891f25c943bc4889211d3c0150930c099bae'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='66c262e0-6d9d-472b-8383-0e00e2ed1ad8', node_type=<ObjectType.TEXT: '1'>, metadata={}, has

### Run some Queries

In [186]:
response = query_engine.query("What is the total amount of earning for apple in 2023?")
print(str(response))

[1;3;34mRetrieving with query id None: What is the total amount of earning for apple in 2023?
[0m[1;3;38;5;200mRetrieving text node: The year-over-year net sales decrease consisted primarily of lower net sales of iPad and iPhone, partially offset by higher net sales of Services.

During the third quarter of 2023, the Company announced the following new products:

15-inch MacBook Air®, powered by the M2 chip;

Mac Studio™, powered by the M2 Max chip and the new M2 Ultra chip;

Mac Pro®, powered by the new M2 Ultra chip; and

Apple Vision Pro™, the Company’s first spatial computer featuring its new visionOS™, expected to be available in early calendar year 2024.

The Company also announced iOS 17, macOS

Sonoma, iPadOS

17, tvOS

17 and watchOS

10, updates to its operating systems that are expected to be available in the fall of 2023.

Apple Inc. | Q3 2023 Form 10-Q | 14

The Company repurchased $18.0 billion of its common stock and paid dividends and dividend equivalents of $3.8 bil

In [96]:
https://docs.llamaindex.ai/en/stable/examples/evaluation/QuestionGeneration.htmlm

SyntaxError: ignored

In [94]:
# compare against the baseline retriever
response = vector_query_engine.query("WHow much gain or losses did Apple incure because of hedges?")
print(str(response))

Apple Inc. incurred a total change in unrealized gains/losses on derivative instruments of $612 million for the three months ended July 1, 2023.


In [60]:
response = query_engine.query("What were the total cash flows in 2021 and in 2020?")

In [61]:
print(str(response))

The total cash flows in 2021 were $5.20 billion, while the total cash flows in 2020 were $9.97 billion.


In [95]:
response = vector_query_engine.query("What were the total cash flows in 2023?")
print(str(response))

The total cash flows in 2023 were $88,945 million.


In [63]:
response = query_engine.query("What are the financial risk factors (interest rate risk, foreign exhange risk and inflation risk) for Tesla? List in bullet points and also any hedges they have")
print(str(response))

- Interest rate risk: Tesla may be exposed to interest rate risk, as changes in interest rates can impact the cost of borrowing and financing activities. Fluctuations in interest rates could affect Tesla's ability to obtain favorable financing terms and could increase its interest expense.
- Foreign exchange risk: Tesla operates globally and generates revenue in various currencies. Changes in foreign exchange rates can impact the value of Tesla's revenue and assets when translated into its reporting currency (e.g., US dollars). Fluctuations in exchange rates could result in foreign exchange losses or gains.
- Inflation risk: Inflation can erode the purchasing power of Tesla's cash flows and assets over time. Rising inflation rates could increase the cost of raw materials, labor, and other inputs, potentially impacting Tesla's profitability.

Hedges:
The provided context does not mention any specific hedges that Tesla has in place to mitigate these financial risks.


In [64]:
response = vector_query_engine.query("What are the risk factors for Tesla?")
print(str(response))

The risk factors for Tesla include the need to ensure compliance with regulatory requirements in various jurisdictions, the dependence on consumer demand for electric vehicles, competition in the automotive industry, perceptions about the limited range and access to charging facilities for electric vehicles, volatility in the cost of oil and gasoline, government regulations and economic incentives, concerns about future viability, cyclical sales in the automotive industry, potential failures or challenges in the supply chain, and the need to secure additional or alternate sources for components.


## Try Table Comparisons

In this setting we load in both the 2021 and 2020 10K filings, parse each into a hierarchy of tables/text objects, define a recursive retriever over each, and then compose both with a SubQuestionQueryEngine.

This allows us to execute document comparisons against both.

### Define E2E Recursive Retriever Function

In [52]:
import pickle
import os


def create_recursive_retriever_over_doc(docs, nodes_save_path=None):
    """Big function to go from document path -> recursive retriever."""
    node_parser = UnstructuredElementNodeParser()
    if nodes_save_path is not None and os.path.exists(nodes_save_path):
        raw_nodes = pickle.load(open(nodes_save_path, "rb"))
    else:
        raw_nodes = node_parser.get_nodes_from_documents(docs)
        if nodes_save_path is not None:
            pickle.dump(raw_nodes, open(nodes_save_path, "wb"))

    base_nodes, node_mappings = node_parser.get_base_nodes_and_mappings(
        raw_nodes
    )

    ### Construct Retrievers
    # construct top-level vector index + query engine
    vector_index = VectorStoreIndex(base_nodes)
    vector_retriever = vector_index.as_retriever(similarity_top_k=2)
    recursive_retriever = RecursiveRetriever(
        "vector",
        retriever_dict={"vector": vector_retriever},
        node_dict=node_mappings,
        verbose=False,
    )
    query_engine = RetrieverQueryEngine.from_args(recursive_retriever)
    return query_engine, base_nodes

### Create Sub Question Query Engine

In [21]:
import openai


In [53]:
import nest_asyncio

nest_asyncio.apply()

In [54]:
from llama_index.tools import QueryEngineTool, ToolMetadata
from llama_index.query_engine import SubQuestionQueryEngine

In [55]:
from llama_index import ServiceContext
from llama_index.llms import OpenAI

llm = OpenAI(model="gpt-4")

service_context = ServiceContext.from_defaults(llm=llm)

In [66]:
query_engine_2021, nodes_2021 = create_recursive_retriever_over_doc(
    docs_2021, nodes_save_path="2021_nodes.pkl"
)
query_engine_2020, nodes_2020 = create_recursive_retriever_over_doc(
    docs_2020, nodes_save_path="2020_nodes.pkl"
)


100%|██████████| 7/7 [00:18<00:00,  2.57s/it]


In [78]:
query_engine_aapl, nodes_aapl = create_recursive_retriever_over_doc(
   aapl_2023, nodes_save_path="2023_nodes.pkl"
)

In [77]:
docs_2023

NameError: ignored

In [79]:
# setup base query engine as tool
query_engine_tools = [

        QueryEngineTool(
        query_engine=query_engine_aapl,
        metadata=ToolMetadata(
            name="aapl_2023_10k",
            description=(
                "Provides information about Apple financials for year 2023"
            ),
        ),
    ),
]

sub_query_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=query_engine_tools,
    service_context=service_context,
    use_async=True,
)

In [56]:
# setup base query engine as tool
query_engine_tools = [
    QueryEngineTool(
        query_engine=query_engine_2021,
        metadata=ToolMetadata(
            name="tesla_2021_10k",
            description=(
                "Provides information about Tesla financials for year 2021"
            ),
        ),
    ),
    QueryEngineTool(
        query_engine=query_engine_2020,
        metadata=ToolMetadata(
            name="tesla_2020_10k",
            description=(
                "Provides information about Tesla financials for year 2020"
            ),
        ),
    ),
        QueryEngineTool(
        query_engine=query_engine_aapl,
        metadata=ToolMetadata(
            name="aapl_2023_10k",
            description=(
                "Provides information about Apple financials for year 2023"
            ),
        ),
    ),
]

sub_query_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=query_engine_tools,
    service_context=service_context,
    use_async=True,
)

NameError: ignored

### Try out some Comparisons

In [80]:
response = sub_query_engine.query(
    "Did apple have any derivatives?"
)

Generated 2 sub questions.
[1;3;38;2;237;90;200m[aapl_2023_10k] Q: What is the financial statement of Apple for the year 2023?
[0m[1;3;38;2;90;149;237m[aapl_2023_10k] Q: Are there any mentions of derivatives in Apple's 2023 financial report?
[0m[1;3;38;2;90;149;237m[aapl_2023_10k] A: There is no mention of Apple in the provided context information.
[0m[1;3;38;2;237;90;200m[aapl_2023_10k] A: I'm sorry, but I cannot provide information about the financial statement of Apple for the year 2023 based on the given context.
[0m

In [60]:
print(str(response))

I'm sorry, but I cannot provide the answer to your query as the information about Apple's 2023 revenues is not available.


In [70]:
response = sub_query_engine.query(
    "Can you compare and contrast the R&D expenditures in 2021 vs. 2020?"
)

Generated 2 sub questions.
[1;3;38;2;237;90;200m[tesla_2021_10k] Q: What was the R&D expenditure for Tesla in 2021?
[0m[1;3;34mRetrieving with query id None: What was the R&D expenditure for Tesla in 2021?
[0m[1;3;38;5;200mRetrieving text node: 39

R&D expenses increased $1.10 billion, or 74%, in the year ended December 31, 2021 as compared to the year ended December 31, 2020. The increase was primarily due to a $506 million increase in employee and labor related expenses due to an increase in headcount, a $263 million increase in R&D expensed materials, a $211 million increase in facilities, outside services, freight and depreciation expense and an $103 million increase in stock-based compensation expense. These increases were to support our expanding product roadmap such as the new versions of Model S and Model X and technologies including our proprietary battery cells and there were additional R&D expenses as we were in the pre-production phases at both Gigafactory Texas and Gi

In [71]:
print(str(response))

In 2021, Tesla's R&D expenditure was significantly higher than in 2020. Specifically, they spent $2.593 billion in 2021, compared to only $148 million in 2020. This represents a substantial increase in research and development investment from one year to the next.


In [72]:
response = sub_query_engine.query(
    "Can you compare and contrast the risk factors in 2021 vs. 2020?"
)

Generated 2 sub questions.
[1;3;38;2;237;90;200m[tesla_2021_10k] Q: What were the risk factors for Tesla in 2021?
[0m[1;3;34mRetrieving with query id None: What were the risk factors for Tesla in 2021?
[0m[1;3;38;5;200mRetrieving text node: We also emphasize in our evaluation and career development efforts internal mobility opportunities for employees to drive professional development. Our goal is a long-term, upward-bound career at Tesla for every employee, which we believe also drives our retention efforts.

Our ability to retain our talented workforce is correlated to our compensation practices and culture of open communication. We provide a highly competitive wage that meets or exceeds that of comparable manufacturing roles, even before equity and benefits are factored in. In addition, the majority of our employees have the opportunity to receive additional Tesla equity each year based on their performance. We continue to review salary and wages against benchmarks and adjust t

In [None]:
print(str(response))

In both 2020 and 2021, Tesla faced risk factors related to the global COVID-19 pandemic, which could disrupt operations, deliveries, and business activities. However, the specific challenges differed slightly between the two years. In 2020, Tesla's risks were more focused on logistical issues such as increasing delivery volumes, particularly in international markets, and ramping up logistics channels in China and Europe. They also faced challenges in increasing the number of Supercharger stations and connectors, and meeting sales, delivery, installation, servicing, and vehicle charging targets globally. 

In contrast, the 2021 risks were more centered around personnel and cybersecurity. Tesla's ability to attract and retain senior leadership and a large number of skilled personnel was a significant concern, especially in regions with strong competition. The potential departure of key employees and negative publicity were also seen as risks. Cybersecurity threats, including cyber-attack

#### Try Comparing against Baseline

In [75]:
vector_index_2023 = VectorStoreIndex(nodes_2023)
vector_query_engine_2023 = vector_index_2023.as_query_engine(
    similarity_top_k=2
)
vector_index_2020 = VectorStoreIndex(nodes_2020)
vector_query_engine_2020 = vector_index_2020.as_query_engine(
    similarity_top_k=2
)
# setup base query engine as tool
query_engine_tools = [
    QueryEngineTool(
        query_engine=vector_query_engine_2021,
        metadata=ToolMetadata(
            name="tesla_2021_10k",
            description=(
                "Provides information about Tesla financials for year 2021"
            ),
        ),
    ),
    QueryEngineTool(
        query_engine=vector_query_engine_2020,
        metadata=ToolMetadata(
            name="tesla_2020_10k",
            description=(
                "Provides information about Tesla financials for year 2020"
            ),
        ),
    ),
]

base_sub_query_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=query_engine_tools,
    service_context=service_context,
    use_async=True,
)

In [81]:
response = base_sub_query_engine.query(
    "Can you compare and contrast the interest rate risk in 2019 with 2020?"
)
print(str(response))

NameError: ignored

In [None]:
https://github.com/tomasonjo/blogs/blob/master/llm/Llamaindex-rebel-neo4j.ipynb

In [None]:
https://github.com/neo4j-contrib/neo4j-apoc-procedures/blob/0f6b2c517db37637d2cfc92c446bcc71134fb07c/extended/src/main/java/apoc/ml/OpenAI.java#L108