<a href="https://colab.research.google.com/github/ramesitexp/genai_usecase/blob/main/LLamaIndex_Chatpot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[link text](https://)# 💬🤖 How to Build a Chatbot

LlamaIndex serves as a bridge between your data and Language Learning Models (LLMs), providing a toolkit that enables you to establish a query interface around your data for a variety of tasks, such as question-answering and summarization.

In this tutorial, we'll walk you through building a context-augmented chatbot using a [Data Agent](https://gpt-index.readthedocs.io/en/stable/core_modules/agent_modules/agents/root.html). This agent, powered by LLMs, is capable of intelligently executing tasks over your data. The end result is a chatbot agent equipped with a robust set of data interface tools provided by LlamaIndex to answer queries about your data.

**Note**: This tutorial builds upon initial work on creating a query interface over SEC 10-K filings - [check it out here](https://medium.com/@jerryjliu98/how-unstructured-and-llamaindex-can-help-bring-the-power-of-llms-to-your-own-data-3657d063e30d).

### Context

In this guide, we’ll build a "10-K Chatbot" that uses raw UBER 10-K HTML filings from Dropbox. Users can interact with the chatbot to ask questions related to the 10-K filings.



### Preparation

In [1]:
%pip install llama-index-readers-file
%pip install llama-index-embeddings-openai
%pip install llama-index-agent-openai
%pip install llama-index-llms-openai

Collecting llama-index-readers-file
  Downloading llama_index_readers_file-0.1.19-py3-none-any.whl (36 kB)
Collecting llama-index-core<0.11.0,>=0.10.1 (from llama-index-readers-file)
  Downloading llama_index_core-0.10.33-py3-none-any.whl (15.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.4/15.4 MB[0m [31m27.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pypdf<5.0.0,>=4.0.1 (from llama-index-readers-file)
  Downloading pypdf-4.2.0-py3-none-any.whl (290 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.4/290.4 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting striprtf<0.0.27,>=0.0.26 (from llama-index-readers-file)
  Downloading striprtf-0.0.26-py3-none-any.whl (6.9 kB)
Collecting dataclasses-json (from llama-index-core<0.11.0,>=0.10.1->llama-index-readers-file)
  Downloading dataclasses_json-0.6.5-py3-none-any.whl (28 kB)
Collecting deprecated>=1.2.9.3 (from llama-index-core<0.11.0,>=0.10.1->llama-index-readers-file)
  D

In [2]:
import os

os.environ["OPENAI_API_KEY"] = "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"

import nest_asyncio

nest_asyncio.apply()

In [3]:
# set text wrapping
from IPython.display import HTML, display


def set_css():
    display(
        HTML(
            """
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  """
        )
    )


get_ipython().events.register("pre_run_cell", set_css)

### Ingest Data

Let's first download the raw 10-k files, from 2019-2022.

In [4]:
# NOTE: the code examples assume you're operating within a Jupyter notebook.
# download files
!mkdir data
!wget "https://www.dropbox.com/s/948jr9cfs7fgj99/UBER.zip?dl=1" -O data/UBER.zip
!unzip data/UBER.zip -d data

--2024-04-29 14:03:27--  https://www.dropbox.com/s/948jr9cfs7fgj99/UBER.zip?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.5.18, 2620:100:601d:18::a27d:512
Connecting to www.dropbox.com (www.dropbox.com)|162.125.5.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/dl/948jr9cfs7fgj99/UBER.zip [following]
--2024-04-29 14:03:27--  https://www.dropbox.com/s/dl/948jr9cfs7fgj99/UBER.zip
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc97dc5134c824fe0b01c380d543.dl.dropboxusercontent.com/cd/0/get/CR-c8uw4ts0Vn04BN0m6WvbWxmaoxS6IJismg9MVTBxu88CL3Z4eMia51wMHp6DxF1P_QOpeMjgOpC07u7b8ef4dcTA9rJJWc6hV3qb45RajrxjAUQLMuN8E6_9E_NvJPJ_ALJ1GXecf0aMtkT1jZe_S/file?dl=1# [following]
--2024-04-29 14:03:28--  https://uc97dc5134c824fe0b01c380d543.dl.dropboxusercontent.com/cd/0/get/CR-c8uw4ts0Vn04BN0m6WvbWxmaoxS6IJismg9MVTBxu88CL3Z4eMia51wMHp6DxF1P_QOpeMjgOpC07u7b8ef4dcTA9rJJWc6hV3qb45RajrxjA

To parse the HTML files into formatted text, we use the Unstructured library. Thanks to LlamaHub, we can directly integrate with Unstructured, allowing conversion of any text into a Document format that LlamaIndex can ingest.

First we install the necessary packages:

Then we can use the UnstructuredReader to parse the HTML files into a list of Document objects.

In [9]:
!pip install unstructured

Collecting unstructured
  Downloading unstructured-0.13.5-py3-none-any.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
Collecting filetype (from unstructured)
  Downloading filetype-1.2.0-py2.py3-none-any.whl (19 kB)
Collecting python-magic (from unstructured)
  Downloading python_magic-0.4.27-py2.py3-none-any.whl (13 kB)
Collecting emoji (from unstructured)
  Downloading emoji-2.11.1-py2.py3-none-any.whl (433 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m433.8/433.8 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
Collecting python-iso639 (from unstructured)
  Downloading python_iso639-2024.4.27-py3-none-any.whl (274 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m274.7/274.7 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langdetect (from unstructured)
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [6]:
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [None]:
from llama_index.readers.file import UnstructuredReader
from pathlib import Path

years = [2022, 2021, 2020, 2019]

loader = UnstructuredReader()
doc_set = {}
all_docs = []
for year in years:
    year_docs = loader.load_data(
        file=Path(f"./data/UBER/UBER_{year}.html"), split_documents=False
    )

    # insert year metadata into each year
    print(year_docs)
    for d in year_docs:
        d.metadata = {"year": year}
    doc_set[year] = year_docs
    all_docs.extend(year_docs)

In [11]:
all_docs

[Document(id_='7eb83df2-bf57-4c0d-bd41-9b9e51c1b195', embedding=None, metadata={'year': 2022}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='UNITED STATES\n\nSECURITIES AND EXCHANGE COMMISSION\n\nWashington, D.C. 20549\n\n____________________________________________\n\nFORM\n\n10-K\n\n____________________________________________\n\n(Mark One)\n\nANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\n\nFor the fiscal year ended\n\nDecember 31, 2022\n\nOR\n\nTRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\n\nFor the transition period from_____ to _____\n\nCommission File Number: 001-38902\n\n____________________________________________\n\nUBER TECHNOLOGIES, INC.\n\n(Exact name of registrant as specified in its charter)\n\n____________________________________________\n\nDelaware 45-2647441 (State or other jurisdiction of incorporation or organization) (I.R.S. Employer Identifi

### Setting up Vector Indices for each year

We first setup a vector index for each year. Each vector index allows us
to ask questions about the 10-K filing of a given year.

We build each index and save it to disk.

In [12]:
# initialize simple vector indices
# NOTE: don't run this cell if the indices are already loaded!
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

Settings.chunk_size = 512
Settings.chunk_overlap = 64
Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

index_set = {}
for year in years:
    storage_context = StorageContext.from_defaults()
    cur_index = VectorStoreIndex.from_documents(
        doc_set[year],
        storage_context=storage_context,
    )
    index_set[year] = cur_index
    storage_context.persist(persist_dir=f"./storage/{year}")

To load an index from disk, do the following

1.   List item
2.   List item


In [13]:
# Load indices from disk
from llama_index.core import load_index_from_storage

index_set = {}
for year in years:
    storage_context = StorageContext.from_defaults(
        persist_dir=f"./storage/{year}"
    )
    cur_index = load_index_from_storage(
        storage_context,
    )
    index_set[year] = cur_index

Setting up a Sub Question Query Engine to Synthesize Answers Across 10-K Filings
Since we have access to documents of 4 years, we may not only want to ask questions regarding the 10-K document of a given year, but ask questions that require analysis over all 10-K filings.

To address this, we can use a Sub Question Query Engine. It decomposes a query into subqueries, each answered by an individual vector index, and synthesizes the results to answer the overall query.

LlamaIndex provides some wrappers around indices (and query engines) so that they can be used by query engines and agents. First we define a QueryEngineTool for each vector index. Each tool has a name and a description; these are what the LLM agent sees to decide which tool to choose.

In [15]:
from llama_index.core.tools import QueryEngineTool, ToolMetadata

individual_query_engine_tools = [
    QueryEngineTool(
        query_engine=index_set[year].as_query_engine(),
        metadata=ToolMetadata(
            name=f"vector_index_{year}",
            description=(
                "useful for when you want to answer queries about the"
                f" {year} SEC 10-K for Uber"
            ),
        ),
    )
    for year in years
]

In [None]:
Now we can create the Sub Question Query Engine, which will allow us to synthesize answers across the 10-K filings. We pass in the individual_query_engine_tools we defined above.

In [19]:
!pip install  llama-index-question-gen-openai

Collecting llama-index-question-gen-openai
  Downloading llama_index_question_gen_openai-0.1.3-py3-none-any.whl (2.9 kB)
Collecting llama-index-program-openai<0.2.0,>=0.1.1 (from llama-index-question-gen-openai)
  Downloading llama_index_program_openai-0.1.6-py3-none-any.whl (5.2 kB)
Installing collected packages: llama-index-program-openai, llama-index-question-gen-openai
Successfully installed llama-index-program-openai-0.1.6 llama-index-question-gen-openai-0.1.3


In [20]:
from llama_index.core.query_engine import SubQuestionQueryEngine

query_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=individual_query_engine_tools,
)

Setting up the Chatbot Agent
We use a LlamaIndex Data Agent to setup the outer chatbot agent, which has access to a set of Tools. Specifically, we will use an OpenAIAgent, that takes advantage of OpenAI API function calling. We want to use the separate Tools we defined previously for each index (corresponding to a given year), as well as a tool for the sub question query engine we defined above.

First we define a QueryEngineTool for the sub question query engine:

In [21]:
query_engine_tool = QueryEngineTool(
    query_engine=query_engine,
    metadata=ToolMetadata(
        name="sub_question_query_engine",
        description=(
            "useful for when you want to answer queries that require analyzing"
            " multiple SEC 10-K documents for Uber"
        ),
    ),
)

Then, we combine the Tools we defined above into a single list of tools for the

1.   List item
2.   List item

agent:

In [22]:
tools = individual_query_engine_tools + [query_engine_tool]

Finally, we call `OpenAIAgent.from_tools` to create the agent, passing in the

1.   List item
2.   List item

list of tools we defined above.

In [23]:
from llama_index.agent.openai import OpenAIAgent

agent = OpenAIAgent.from_tools(tools, verbose=True)

Testing the Agent
We can now test the agent with various queries.

If we test it with a simple "hello" query, the agent does not use any Tools.

In [24]:
response = agent.chat("hi, i am bob")
print(str(response))

Added user message to memory: hi, i am bob
Hello Bob! How can I assist you today?


\If we test it with a query regarding the 10-k of a given year, the agent will use
the relevant vector index Tool.

In [25]:
response = agent.chat(
    "What were some of the biggest risk factors in 2020 for Uber?"
)
print(str(response))

Added user message to memory: What were some of the biggest risk factors in 2020 for Uber?
=== Calling Function ===
Calling function: vector_index_2020 with args: {"input":"biggest risk factors"}
Got output: The biggest risk factors include the adverse effects of the COVID-19 pandemic on the business, potential reclassification of Drivers, intense competition in the industries, significant losses incurred, challenges in maintaining a critical mass of platform users, operational, compliance, and cultural challenges, negative impact on brand reputation, difficulties in managing growth, safety incidents affecting platform users, risky investments in new offerings and technologies, and uncertainties surrounding the long-term financial impact of the pandemic.

In 2020, some of the biggest risk factors for Uber included the adverse effects of the COVID-19 pandemic on the business, potential reclassification of Drivers, intense competition in the industries, significant losses incurred, chall

In [26]:
cross_query_str = (
    "Compare/contrast the risk factors described in the Uber 10-K across"
    " years. Give answer in bullet points."
)

response = agent.chat(cross_query_str)
print(str(response))

Added user message to memory: Compare/contrast the risk factors described in the Uber 10-K across years. Give answer in bullet points.
=== Calling Function ===
Calling function: vector_index_2019 with args: {"input": "risk factors"}
Got output: Risks related to the personal mobility, meal delivery, and logistics industries being highly competitive with well-established alternatives, low barriers to entry, low switching costs, and strong competitors in major regions could adversely impact the business and financial prospects.

=== Calling Function ===
Calling function: vector_index_2020 with args: {"input": "risk factors"}
Got output: Some of the risk factors that could have an adverse effect on the business include the impact of the COVID-19 pandemic, potential reclassification of Drivers, intense competition in the industries served, significant losses incurred since inception, challenges in attracting and maintaining platform users, operational and compliance challenges, negative med

Setting up the Chatbot Loop
Now that we have the chatbot setup, it only takes a few more steps to setup a basic interactive loop to chat with our SEC-augmented chatbot!

In [28]:
agent = OpenAIAgent.from_tools(tools)  # verbose=False by default

while True:
    text_input = input("User: ")
    if text_input == "exit":
        break
    response = agent.chat(text_input)
    print(f"Agent: {response}")

# User: What were some of the legal proceedings against Uber in 2022?

User: What were some of the legal proceedings against Uber in 2022?
Agent: In 2022, Uber faced various legal proceedings including lawsuits, investigations, and claims. These legal actions covered a wide range of matters such as driver classification, compliance with laws, workplace practices, and intellectual property infringement. The outcomes of these legal proceedings are unpredictable and could result in significant monetary damages, operational limitations, or changes in business practices, all of which may negatively impact Uber's business, financial condition, and operating results. Additionally, Uber's use of arbitration provisions in its terms of service may pose risks to its reputation and brand, potentially leading to increased litigation costs and exposure.
User: exit
