# Creating a more robust RAQA system using LlamaIndex

We'll be putting together a system for querying both qualitative and quantitative data using LlamaIndex.

To stick to a theme, we'll continue to use BarbenHeimer data as our base - but this can, and should, be extended to other topics/domains.

# Build 🏗️
There are 3 main tasks in this notebook:

- Create a Qualitative VectorStore query engine
- Create a quantitative NLtoSQL query engine
- Combine the two using LlamaIndex's OpenAI agent framework.

# Ship 🚢
Create an host a Gradio or Chainlit application to serve your project on Hugging Face spaces.

# Share 🚀
Make a social media post about your final application and tag @AIMakerspace

### A note on terminology:

You'll notice that there are quite a few similarities between LangChain and LlamaIndex. LlamaIndex can largely be thought of as an extension to LangChain, in some ways - but they moved some of the language around. Let's spend a few moments disambiguating the language.

- `QueryEngine` -> `RetrievalQA`:
  -  `QueryEngine` is just LlamaIndex's way of indicating something is an LLM "chain" on top of a retrieval system
- `OpenAIAgent` vs. `ZeroShotAgent`:
  - The two agents have the same fundamental pattern: Decide which of a list of tools to use to answer a user's query.
  - `OpenAIAgent` (LlamaIndex's primary agent) does not need to rely on an agent excecutor due to the fact that it is leveraging OpenAI's [functional api](https://openai.com/blog/function-calling-and-other-api-updates) which allows the agent to interface "directly" with the tools instead of operating through an intermediary application process.

There is, however, a much large terminological difference when it comes to discussing data.

##### Nodes vs. Documents

As you're aware of from the previous weeks assignments, there's an idea of `documents` in NLP which refers to text objects that exist within a corpus of documents.

LlamaIndex takes this a step further and reclassifies `documents` as `nodes`. Confusingly, it refers to the `Source Document` as simply `Documents`.

The `Document` -> `node` structure is, almost exactly, equivalent to the `Source Document` -> `Document` structure found in LangChain - but the new terminology comes with some clarity about different structure-indices.

We won't be leveraging those structured indicies today, but we will be leveraging a "benefit" of the `node` structure that exists as a default in LlamaIndex, which is the ability to quickly filter nodes based on their metadata.

![image](https://i.imgur.com/B1QDjs5.png)

In [50]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [51]:
%cd /content/drive/MyDrive/LLMOps-AIMakerpace/LLM-Ops-Cohort-1-main/Week 2

/content/drive/MyDrive/LLMOps-AIMakerpace/LLM-Ops-Cohort-1-main/Week 2


### BOILERPLATE

This is only relevant when running the code in a Jupyter Notebook.

In [1]:
import nest_asyncio

nest_asyncio.apply()

import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

### Primary Dependencies and Context Setting

#### Dependencies and OpenAI API key setting

First of all, we'll need our primary libraries - and to set up our OpenAI API key.

In [2]:
!pip install -U -q openai==0.27.8 llama-index==0.8.6 nltk==3.8.1

In [3]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key: ")

import openai
openai.api_key = os.environ["OPENAI_API_KEY"]

OpenAI API Key: ··········


#### Context Setting

Now, LlamaIndex has the ability to set `ServiceContext`. You can think of this as a config file of sorts. The basic idea here is that we use this to establish some core properties and then can pass it to various services.

While we could set this up as a global context, we're going to leave it as `ServiceContext` so we can see where it's applied.

We'll set a few significant contexts:

- `chunk_size` - this is what it says on the tin
- `llm` - this is where we can set what model we wish to use as our primary LLM when we're making `QueryEngine`s and more
- `embed_model` - this will help us keep our embedding model consistent across use cases


We'll also create some resources we're going to keep consistent across all of our indices today.

- `text_splitter` - This is what we'll use to split our text, feel free to experiment here
- `SimpleNodeParser` - This is what will work in tandem with the `text_splitter` to parse our full sized documents into nodes.

In [4]:
from llama_index import ServiceContext
from llama_index.node_parser.simple import SimpleNodeParser
from llama_index.langchain_helpers.text_splitter import TokenTextSplitter
from llama_index.llms import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding()
chunk_size = 1000
chunk_overlap = 100
llm = OpenAI(
    temperature=0,
    model="gpt-4-32k",
    streaming=True
    )

service_context = ServiceContext.from_defaults(
    llm=llm,
    chunk_size=chunk_size,
    embed_model=embed_model
    )

text_splitter = TokenTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
    )

node_parser = SimpleNodeParser(text_splitter=text_splitter)

### BarbenHeimer Wikipedia Retrieval Tool

Now we can get to work creating our semantic `QueryEngine`!

We'll follow a similar pattern as we did with LangChain here - and the first step (as always) is to get dependencies.

In [5]:
!pip install -U -q chromadb==0.4.6 tiktoken==0.4.0 sentence-transformers==2.2.2 pydantic==1.10.11

In [7]:
from llama_index import VectorStoreIndex
from llama_index.vector_stores import ChromaVectorStore
from llama_index.storage.storage_context import StorageContext
import chromadb

#### ChromaDB

We'll be using [ChromaDB](https://www.trychroma.com/) as our `VectorStore` today!

It works in a similar fashion to tools like Pinecone, Weaveate, and more - but it's locally hosted and will serve our purposes fine.

You'll also notice the return of `OpenAIEmbedding()`, which is the embeddings model we'll be leveraging. Of course, this is using the `ada` model under the hood - and already comes equipped with in-memory caching.

You'll notice we can pass our `service_context` into our `VectorStoreIndex`!

In [8]:
chroma_client = chromadb.Client()
chroma_collection = chroma_client.create_collection("wikipedia_barbie_opp")

vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
wiki_vector_index = VectorStoreIndex([], storage_context=storage_context, service_context=service_context)

In [9]:
!pip install -U -q wikipedia

Essentially the same as the LangChain example - we're just going to be pulling information straight from Wikipedia using the built in `WikipediaReader`.

Setting `auto_suggest=False` ensures we run into fewer auto-correct based errors.

In [10]:
from llama_index.readers.wikipedia import WikipediaReader

movie_list = ["Barbie (film)", "Oppenheimer (film)"]

wiki_docs = WikipediaReader().load_data(pages=movie_list, auto_suggest=False)

#### Node Construction

Now we will loop through our documents and metadata and construct nodes (associated with particular metadata for easy filtration later).

We're using the `node_parser` we created at the top of the Notebook.

In [11]:
for movie, wiki_doc in zip(movie_list, wiki_docs):
    # Parse the wiki document as a list into nodes using the previously defined node_parser
    nodes = node_parser.get_nodes_from_documents([wiki_doc])

    for node in nodes:
        # Associate metadata with the node, such as the movie title
        node.metadata = {"title": movie}

    # Add the nodes to the wiki_vector_index
    wiki_vector_index.insert_nodes(nodes)

In [12]:
# Another option importing Document from llama_index

from llama_index import Document

for movie, wiki_doc in zip(movie_list, wiki_docs):
    # Ensure that wiki_doc is treated as a Document object if it's not already
    document = wiki_doc if isinstance(wiki_doc, Document) else Document(text=wiki_doc.text)

    # Parse the document into nodes using the correct method
    nodes = node_parser.get_nodes_from_documents([document])

    for node in nodes:
        # Associate metadata with the node, such as the movie title
        node.metadata = {'title': movie}

    # Add the nodes to the wiki_vector_index
    wiki_vector_index.insert_nodes(nodes)

#### Auto Retriever Functional Tool

This tool will leverage OpenAI's functional endpoint to select the correct metadata filter and query the filtered index - only looking at nodes with the desired metadata.

A simplified diagram: ![image](https://i.imgur.com/AICDPav.png)

First, we need to create our `VectoreStoreInfo` object which will hold all the relevant metadata we need for each component (in this case title metadata).

Notice that you need to include it in a text list.

In [13]:
from llama_index.tools import FunctionTool
from llama_index.vector_stores.types import (
    VectorStoreInfo,
    MetadataInfo,
    ExactMatchFilter,
    MetadataFilters,
)
from llama_index.retrievers import VectorIndexRetriever
from llama_index.query_engine import RetrieverQueryEngine

from typing import List, Tuple, Any
from pydantic import BaseModel, Field

top_k = 3

vector_store_info = VectorStoreInfo(
    content_info="semantic information about movies",
    metadata_info=[MetadataInfo(
        name="title",
        type="str",
        description="title of the movie, one of [Barbie (film), Oppenheimer (film)]",
    )]
)

Now we'll create our base PyDantic object that we can use to ensure compatability with our application layer. This verifies that the response from the OpenAI endpoint conforms to this schema.

In [14]:
class AutoRetrieveModel(BaseModel):
    query: str = Field(..., description="natural language query string")
    filter_key_list: List[str] = Field(
        ..., description="List of metadata filter field names"
    )
    filter_value_list: List[str] = Field(
        ...,
        description=(
            "List of metadata filter field values (corresponding to names specified in filter_key_list)"
        )
    )

Now we can build our function that we will use to query the functional endpoint.

>The `docstring` is important to the functionality of the application.

In [15]:
def auto_retrieve_fn(
    query: str, filter_key_list: List[str], filter_value_list: List[str]
):
    """Auto retrieval function.

    Performs auto-retrieval from a vector database, and then applies a set of filters.

    """
    query = query or "Query"

    exact_match_filters = [
        ExactMatchFilter(key=k, value=v)
        for k, v in zip(filter_key_list, filter_value_list)
    ]
    retriever = VectorIndexRetriever(
        wiki_vector_index, filters=MetadataFilters(filters=exact_match_filters), top_k=top_k
    )
    query_engine = RetrieverQueryEngine.from_args(retriever)

    response = query_engine.query(query)
    return str(response)

Now we need to wrap our system in a tool in order to integrate it into the larger application.

Source Code Here:
- [`FunctionTool`](https://github.com/jerryjliu/llama_index/blob/d24767b0812ac56104497d8f59095eccbe9f2b08/llama_index/tools/function_tool.py#L21)

In [16]:
description = f"""\
Use this tool to look up semantic information about films.
The vector database schema is given below:
{vector_store_info.json()}
"""

auto_retrieve_tool = FunctionTool.from_defaults(
    fn=auto_retrieve_fn, # Function defined earlier (auto_retrieve_fn), which performs the auto-retrieval operation.
    name="Auto_Retriever", # Human-readable name for the tool, allowing you to easily identify it within the application.
    description=description, # Description crafted above
    fn_schema=AutoRetrieveModel # This refers to the PyDantic object created earlier (AutoRetrieveModel) to validate the inputs and outputs of the function.
)

All that's left to do is attach the tool to an OpenAIAgent and let it rip!

Source Code Here:
- [`OpenAIAgent`](https://github.com/jerryjliu/llama_index/blob/d24767b0812ac56104497d8f59095eccbe9f2b08/llama_index/agent/openai_agent.py#L361)

In [17]:
from llama_index.agent import OpenAIAgent

agent = OpenAIAgent.from_tools(
    tools=[auto_retrieve_tool] # The agent can then use this tool to perform the specific functionality defined.
)

In [18]:
response = agent.chat("Tell me what happens (briefly) in the Barbie movie.")
print(str(response))

In the Barbie movie, Barbie and Ken go on a journey of self-discovery after Barbie experiences an existential crisis. The film is a fantasy comedy and features an ensemble cast, including Margot Robbie as Barbie and Ryan Gosling as Ken. It has received critical acclaim and is one of the highest-grossing films of 2023.


### BarbenHeimer SQL Tool

We'll walk through the steps of creating a natural language to SQL system in the following section.

> NOTICE: This does not have parsing on the inputs or intermediary calls to ensure that users are using safe SQL queries. Use this with caution in a production environment without adding specific guardrails from either side of the application.

In [19]:
!pip install -q -U sqlalchemy pandas

The next few steps should be largely straightforward, we'll want to:

1. Read in our `.csv` files into `pd.DataFrame` objects
2. Create an in-memory `sqlite` powered `sqlalchemy` engine
3. Cast our `pd.DataFrame` objects to the SQL engine
4. Create an `SQLDatabase` object through LlamaIndex
5. Use that to create a `QueryEngineTool` that we can interact with through the `NLSQLTableQueryEngine`!

If you get stuck, please consult the documentation.

#### Read `.csv` Into Pandas

In [20]:
import pandas as pd

barbie_df = pd.read_csv("/content/drive/MyDrive/LLMOps-AIMakerpace/LLM-Ops-Cohort-1-main/Week 2/barbie_data/barbie.csv")
oppenheimer_df = pd.read_csv("/content/drive/MyDrive/LLMOps-AIMakerpace/LLM-Ops-Cohort-1-main/Week 2/oppenheimer_data/oppenheimer.csv")

#### Create SQLAlchemy engine with SQLite

In [21]:
from sqlalchemy import create_engine

engine = create_engine("sqlite+pysqlite:///:memory:")

#### Convert `pd.DataFrame` to SQL tables

In [22]:
barbie_df.to_sql(
    name='barbie', # name of table
    con=engine # engine
)

125

In [23]:
oppenheimer_df.to_sql(
    name='oppenheimer', # name of table
    con=engine # engine
)

150

#### Construct a `SQLDatabase` index

Source Code Here:
- [`SQLDatabase`](https://github.com/jerryjliu/llama_index/blob/d24767b0812ac56104497d8f59095eccbe9f2b08/llama_index/langchain_helpers/sql_wrapper.py#L9)

In [24]:
from llama_index import SQLDatabase

sql_database = SQLDatabase(
    engine=engine, # SQLAlchemy engine connected to your SQLite database
    include_tables=['barbie', 'oppenheimer']) # List of table names to include in the index

#### Create the NLSQLTableQueryEngine interface for all added SQL tables

Source Code Here:
- [`NLSQLTableQueryEngine`](https://github.com/jerryjliu/llama_index/blob/d24767b0812ac56104497d8f59095eccbe9f2b08/llama_index/indices/struct_store/sql_query.py#L75C1-L75C1)

In [25]:
from llama_index.indices.struct_store.sql_query import NLSQLTableQueryEngine

sql_query_engine = NLSQLTableQueryEngine(
    sql_database=sql_database, # The SQLDatabase object created earlier
    tables=['barbie', 'oppenheimer']) # List of table names to query

#### Wrap It All Up in a `QueryEngineTool`

You'll want to ensure you have a descriptive...description.

An example is provided here:

```
"Useful for translating a natural language query into a SQL query over a table containing: "
"barbie, containing information related to reviews of the Barbie movie"
"oppenheimer, containing information related to reviews of the Oppenheimer movie"
```

Sorce Code Here:

- [`QueryEngineTool`](https://github.com/jerryjliu/llama_index/blob/d24767b0812ac56104497d8f59095eccbe9f2b08/llama_index/tools/query_engine.py#L13)

In [35]:
from llama_index.tools.query_engine import QueryEngineTool

sql_tool = QueryEngineTool.from_defaults(
    query_engine=sql_query_engine, # The NLSQLTableQueryEngine object created earlier
    name="Natural_Language_to_SQL_Tool",
    description=(
        "Useful for translating a natural language query into a SQL query, in which multiple conditions can be provided over tables containing information related to:"
        " 'barbie', containing information related to reviews of the Barbie movie;"
        " 'oppenheimer', containing information related to reviews of the Oppenheimer movie."
    ),
)

In [36]:
agent = OpenAIAgent.from_tools(
    tools=[sql_tool] # The FunctionTool object created earlier as a list
)

In [37]:
response = agent.chat("What is the average rating of the two films?")

In [29]:
print(str(response))

The average rating of the two films is 7.36 and 8.35.


### Combining The Tools Together

Now, we can simple add our tools into the `OpenAIAgent`, and off we go!

In [38]:
barbenheimer_agent = OpenAIAgent.from_tools(
    tools=[auto_retrieve_tool, sql_tool] # Combining both tools now
)

In [39]:
response = barbenheimer_agent.chat("What is the lowest rating of the two films - and can you summarize what the reviewer said?")

In [40]:
print(str(response))

I'm sorry, but I couldn't find the lowest rating for the film Barbie in the available information.


In [41]:
response = barbenheimer_agent.chat("How many times do the Barbie reviews mention 'Ken', and what is a summary of his character in the Barbie movie?")

In [42]:
print(str(response))

In the Barbie movie, the character Ken is mentioned multiple times in the reviews. He is portrayed as having low self-esteem and seeking approval from Barbie. The casting process for the film involved considering various actors for the role of Ken, including Ryan Gosling. The film explores the negative consequences of hierarchical power structures, with the director, Greta Gerwig, highlighting that "Barbies rule and Kens are an underclass." Ken also has a power ballad in the film, which is seen as a moment when the movie goes beyond traditional expectations for a Barbie movie.
