# Dataset - Uber SEC 10k filing


A SEC 10-K filing is an annual report that summarizes a public company's financial performance and business activities. The U.S. Securities and Exchange Commission (SEC) requires all public companies to file a 10-K.

In this program, we are downloading few SEC 10-K filings of Uber for the years from 2019 - 2022 and building a chatbot over them.

In [10]:
%pip install llama-index-readers-file
%pip install llama-index-embeddings-openai
%pip install llama-index-agent-openai
%pip install llama-index-llms-openai



Note: you may need to restart the kernel to use updated packages.


Note: you may need to restart the kernel to use updated packages.


Note: you may need to restart the kernel to use updated packages.


Note: you may need to restart the kernel to use updated packages.


In [8]:
pip install -U unstructured

Collecting unstructured
  Obtaining dependency information for unstructured from https://files.pythonhosted.org/packages/99/ac/11d163876f43c7b3d6886353d03491b59fe2fb9ddb263b63c58e94f3f8a2/unstructured-0.16.13-py3-none-any.whl.metadata
  Downloading unstructured-0.16.13-py3-none-any.whl.metadata (24 kB)
Collecting python-magic (from unstructured)
  Obtaining dependency information for python-magic from https://files.pythonhosted.org/packages/6c/73/9f872cb81fc5c3bb48f7227872c28975f998f3e7c2b1c16e95e6432bbb90/python_magic-0.4.27-py2.py3-none-any.whl.metadata
  Downloading python_magic-0.4.27-py2.py3-none-any.whl.metadata (5.8 kB)
Collecting emoji (from unstructured)
  Obtaining dependency information for emoji from https://files.pythonhosted.org/packages/91/db/a0335710caaa6d0aebdaa65ad4df789c15d89b7babd9a30277838a7d9aac/emoji-2.14.1-py3-none-any.whl.metadata
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Collecting python-iso639 (from unstructured)
  Obtaining dependency in

Collecting pydantic-core==2.27.2 (from pydantic<2.11.0,>=2.10.3->unstructured-client->unstructured)
  Obtaining dependency information for pydantic-core==2.27.2 from https://files.pythonhosted.org/packages/9e/e3/71fe85af2021f3f386da42d291412e5baf6ce7716bd7101ea49c810eda90/pydantic_core-2.27.2-cp311-cp311-macosx_11_0_arm64.whl.metadata
  Downloading pydantic_core-2.27.2-cp311-cp311-macosx_11_0_arm64.whl.metadata (6.6 kB)
Downloading unstructured-0.16.13-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading emoji-2.14.1-py3-none-any.whl (590 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m590.6/590.6 kB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading ndjson-0.3.1-py2.py3-none-any.whl (5.3 kB)
Downloading python_iso639-2024.10.22-py3-none-any.whl (274 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m274.9/274.9 kB[0m

In [2]:
import os
import getpass

os.environ['OPENAI_API_KEY'] = getpass.getpass("Enter your Open API key: ")

Enter your Open API key: ········


In [3]:
import nest_asyncio

nest_asyncio.apply()

In [4]:
# set text wrapping
from IPython.display import HTML, display


def set_css():
    display(
        HTML(
            """
  
    pre {
        white-space: pre-wrap;
    }
  
  """
        )
    )


get_ipython().events.register("pre_run_cell", set_css)

# Ingest Data

In [6]:
!mkdir -p data
!curl -L "https://www.dropbox.com/s/948jr9cfs7fgj99/UBER.zip?dl=1" -o data/UBER.zip
!unzip data/UBER.zip -d data


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   125  100   125    0     0    652      0 --:--:-- --:--:-- --:--:--   679
100    17  100    17    0     0     25      0 --:--:-- --:--:-- --:--:--    25
100   475    0   475    0     0    496      0 --:--:-- --:--:-- --:--:--     0
100 1777k  100 1777k    0     0  1409k      0  0:00:01  0:00:01 --:--:-- 1409k
Archive:  data/UBER.zip
   creating: data/UBER/
  inflating: data/UBER/UBER_2021.html  
  inflating: data/__MACOSX/UBER/._UBER_2021.html  
  inflating: data/UBER/UBER_2020.html  
  inflating: data/__MACOSX/UBER/._UBER_2020.html  
  inflating: data/UBER/UBER_2019.html  
  inflating: data/__MACOSX/UBER/._UBER_2019.html  
  inflating: data/UBER/UBER_2022.html  
  inflating: data/__MACOSX/UBER/._UBER_2022.html  


#### Unstructured Library

To parse the data, we use BeautifulSoup from bs4, which is a lightweight and flexible library for handling HTML content.

In [12]:
from bs4 import BeautifulSoup
from pathlib import Path

years = [2022, 2021, 2020, 2019]

doc_set = {}
all_docs = []

# Function to parse HTML content
def parse_html(file_path):
    with open(file_path, "r", encoding="utf-8") as f:
        content = f.read()
    soup = BeautifulSoup(content, "html.parser")
    return soup.get_text()  # Extract plain text from HTML

# Load data
for year in years:
    file_path = Path(f"./data/UBER/UBER_{year}.html")
    if file_path.exists():
        # Parse HTML and store the content
        text = parse_html(file_path)
        
        # Create a document object
        doc = {
            "content": text,
            "metadata": {"year": year}
        }
        
        # Add document to sets
        doc_set[year] = doc_set.get(year, []) + [doc]
        all_docs.append(doc)
    else:
        print(f"File not found: {file_path}")



# Setting up Vector Indices for each year

We first setup a vector index for each year. Each vector index allows us to ask questions about the 10-K filing of a given year.

We build each index and save it to disk.

#### Do NOT run this code if vector indexes have been already created

In [15]:
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core import Document
from llama_index.core import Settings

# Update Settings
Settings.chunk_size = 512
Settings.chunk_overlap = 64
Settings.llm = OpenAI(model="gpt-4o-mini")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

# Initialize indices
index_set = {}
for year in years:
    # Convert dictionaries to Document objects
    documents = [
        Document(text=doc["content"], metadata=doc["metadata"]) 
        for doc in doc_set[year]
    ]
    
    storage_context = StorageContext.from_defaults()
    cur_index = VectorStoreIndex.from_documents(
        documents=documents,
        storage_context=storage_context,
    )
    index_set[year] = cur_index
    storage_context.persist(persist_dir=f"./storage/{year}")


##### To load an index from disk, do the following

In [16]:
# Load indices from disk
from llama_index.core import load_index_from_storage

index_set = {}
for year in years:
    storage_context = StorageContext.from_defaults(
        persist_dir=f"./storage/{year}"
    )
    cur_index = load_index_from_storage(
        storage_context,
    )
    index_set[year] = cur_index

# Setting up a Sub Question Query Engine

Since we have access to documents of 4 years, we may not only want to ask questions regarding the 10-K document of a given year, but ask questions that require analysis over all 10-K filings.

To address this, we can use a Sub Question Query Engine. It decomposes a query into subqueries, each answered by an individual vector index, and synthesizes the results to answer the overall query.

LlamaIndex provides some wrappers around indices (and query engines) so that they can be used by query engines and agents. First we define a QueryEngineTool for each vector index. Each tool has a name and a description; these are what the LLM agent sees to decide which tool to choose.


This code creates a list of QueryEngineTool objects, each associated with a specific year's SEC 10-K filing for Uber, using the vector indices previously created. Here's a breakdown of what each part does:



In [27]:
from llama_index.core.tools import QueryEngineTool, ToolMetadata

individual_query_engine_tools = []

for year in years:
    # Step 1: Retrieve the query engine for the current year's vector index
    query_engine = index_set[year].as_query_engine()
    print(f"Created query engine for {year}.")

    # Step 2: Define metadata for the query engine tool
    metadata = ToolMetadata(
        name=f"vector_index_{year}",
        description=(
            "Useful for when you want to answer queries about the"
            f" {year} SEC 10-K for Uber."
        ),
    )
    print(f"ToolMetadata for {year}: {metadata}")

    # Step 3: Create a QueryEngineTool for this year
    tool = QueryEngineTool(
        query_engine=query_engine,
        metadata=metadata,
    )
    print(f"Created QueryEngineTool for {year}: {tool}")

    # Step 4: Add the tool to the list
    individual_query_engine_tools.append(tool)
    print(f"Added tool for {year} to the list.\n")

# Final step: Print all tools created
print(f"All QueryEngineTools created: {individual_query_engine_tools}")


Created query engine for 2022.
ToolMetadata for 2022: ToolMetadata(description='Useful for when you want to answer queries about the 2022 SEC 10-K for Uber.', name='vector_index_2022', fn_schema=<class 'llama_index.core.tools.types.DefaultToolFnSchema'>, return_direct=False)
Created QueryEngineTool for 2022: <llama_index.core.tools.query_engine.QueryEngineTool object at 0x2921b1d10>
Added tool for 2022 to the list.

Created query engine for 2021.
ToolMetadata for 2021: ToolMetadata(description='Useful for when you want to answer queries about the 2021 SEC 10-K for Uber.', name='vector_index_2021', fn_schema=<class 'llama_index.core.tools.types.DefaultToolFnSchema'>, return_direct=False)
Created QueryEngineTool for 2021: <llama_index.core.tools.query_engine.QueryEngineTool object at 0x28f0798d0>
Added tool for 2021 to the list.

Created query engine for 2020.
ToolMetadata for 2020: ToolMetadata(description='Useful for when you want to answer queries about the 2020 SEC 10-K for Uber.', n


Now we can create the Sub Question Query Engine, which will allow us to synthesize answers across the 10-K filings. We pass in the individual_query_engine_tools we defined above.

In [18]:
from llama_index.core.query_engine import SubQuestionQueryEngine

query_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=individual_query_engine_tools,
)

# Setting up the Chatbot Agent

We use a LlamaIndex Data Agent to setup the outer chatbot agent, which has access to a set of Tools. Specifically, we will use an OpenAIAgent, that takes advantage of OpenAI API function calling. We want to use the separate Tools we defined previously for each index (corresponding to a given year), as well as a tool for the sub question query engine we defined above.

##### First we define a QueryEngineTool for the sub question query engine:

In [19]:
query_engine_tool = QueryEngineTool(
    query_engine=query_engine,
    metadata=ToolMetadata(
        name="sub_question_query_engine",
        description=(
            "useful for when you want to answer queries that require analyzing"
            " multiple SEC 10-K documents for Uber"
        ),
    ),
)

##### We now combine the Tools we defined above into a single list of tools for the agent:

In [20]:
tools = individual_query_engine_tools + [query_engine_tool]

##### we call OpenAIAgent.from_tools to create the agent, passing in the list of tools we defined above.

In [21]:
from llama_index.agent.openai import OpenAIAgent

agent = OpenAIAgent.from_tools(tools, verbose=True)

# Testing the Agent

In [22]:
response = agent.chat("Good morning, this is Varun")
print(str(response))

Added user message to memory: Good morning, this is Varun
Good morning, Varun! How can I assist you today?


In [23]:
response = agent.chat(
    "What were some of the biggest risk factors in 2020 for Uber?"
)
print(str(response))

Added user message to memory: What were some of the biggest risk factors in 2020 for Uber?
=== Calling Function ===
Calling function: vector_index_2020 with args: {"input":"biggest risk factors"}
Got output: The biggest risk factors include the ongoing impact of the COVID-19 pandemic, which has adversely affected various aspects of the business and may continue to do so. Additionally, the potential classification of Drivers as employees rather than independent contractors poses a significant risk. The competitive landscape in the mobility, delivery, and logistics industries is also a concern, characterized by well-established alternatives, low barriers to entry, and strong competitors. Furthermore, the need to lower fares or service fees to remain competitive, along with the history of incurring significant losses, adds to the overall risk profile.

In 2020, some of the biggest risk factors for Uber included:

1. **COVID-19 Pandemic**: The ongoing impact of the pandemic adversely affec

In [28]:
response = agent.chat(
    "What were some of the biggest risk factors in 2026 for Uber?"
)
print(str(response))

Added user message to memory: What were some of the biggest risk factors in 2026 for Uber?
=== Calling Function ===
Calling function: vector_index_2022 with args: {"input":"biggest risk factors in 2026"}
Got output: The biggest risk factors in 2026 could include the ongoing challenges related to climate commitments, such as changing regulations, technological advancements, and the availability of electric vehicles and charging infrastructure. Additionally, the potential for future pandemics or outbreaks of contagious diseases, along with the impacts of catastrophic events like natural disasters or geopolitical conflicts, could significantly affect business operations and financial conditions. Economic uncertainties, including the ability to secure financing and the performance of third-party vendors, may also pose substantial risks.

In 2026, some of the biggest risk factors for Uber may include:

1. **Climate Commitments**: Ongoing challenges related to compliance with changing regula

In [25]:
response = agent.chat(
    "What were the main drivers of Uber's revenue growth or decline in 2021 compared to 2020??"
)
print(str(response))

Added user message to memory: What were the main drivers of Uber's revenue growth or decline in 2021 compared to 2020??
=== Calling Function ===
Calling function: sub_question_query_engine with args: {"input":"main drivers of Uber's revenue growth or decline in 2021 compared to 2020"}
Generated 7 sub questions.
[1;3;38;2;237;90;200m[vector_index_2021] Q: What were the key revenue figures for Uber in 2021 as reported in the 2021 SEC 10-K?
[0m[1;3;38;2;90;149;237m[vector_index_2020] Q: What were the key revenue figures for Uber in 2020 as reported in the 2020 SEC 10-K?
[0m[1;3;38;2;11;159;203m[vector_index_2021] Q: What factors contributed to revenue growth for Uber in 2021 according to the 2021 SEC 10-K?
[0m[1;3;38;2;155;135;227m[vector_index_2021] Q: What factors contributed to revenue decline for Uber in 2021 according to the 2021 SEC 10-K?
[0m[1;3;38;2;237;90;200m[vector_index_2021] Q: How did Uber's business segments perform in 2021 compared to 2020 as per the 2021 SEC 10-K

In [26]:
response = agent.chat(
    "How did the COVID-19 pandemic impact Uber’s operations?"
)
print(str(response))


Added user message to memory: How did the COVID-19 pandemic impact Uber’s operations?
=== Calling Function ===
Calling function: vector_index_2021 with args: {"input":"impact of COVID-19 pandemic on Uber's operations"}
Got output: The COVID-19 pandemic has significantly impacted Uber's operations by altering market and economic conditions globally. It has led to a reduction in demand for Mobility rides due to various governmental restrictions, including emergency declarations, business closures, and limitations on gatherings. This has resulted in driver supply constraints, as concerns regarding the pandemic affect driver availability.

In response to these challenges, Uber has prioritized the health and safety of its consumers, drivers, and employees, while also focusing on preserving liquidity and managing cash flow. The pandemic has accelerated the growth of Uber's Delivery offerings, as demand for these services has increased. To comply with social distancing guidelines, Uber tempor