LlamaIndex tools
1. Data Connectors
2. Documents/ Nodes
3. Data Indexes

Components of Retrieval Augmented Generation(RAG) workflow 

**Knowledge Base (Input)**: The knowledge base is like a library filled with useful information such as FAQs, manuals, and other relevant documents. When a question is asked, this is where the system looks to find the answer.

**Trigger/Query (Input)**: This is the spark that gets things goings. Typically, it's a question or request from a customer that signals the system to spring into action.

**Task/Action (Output)**: After understanding the trigger or query, the system then performs a certain task to address it. For instance, if it's a question, the system will work on providing an answer, or if it's request for a specific action, it will carry out that action accordingly.

#### Two stages using Llamaindex to provide inputs to our RAG mechanism

**1. Indexing State:** Preparing a knowledge base.
**2. Querying Stage:** Harnessing the knowledge base & the LLM to respond to your queries by generating the final output/ performing the final task.

**Different indexing stage:** 

**1. Data Connectors:** These are readers, ingest data from diverse sources and formats into a unified document representation.

**2. Documents/ Nodes:** A document is your container for data, whether it springs from PDF, or a database. 

A node, on the other hand, is a snippet of a document, enriched with metadata and relationships, paving the way for precise retrieval operations.

**3. Data Indexes:** Post ingestion, LlamaIndex assists in arranging the data into a retrievable format. This process involves parsing, embedding, and metadata inference, and ultimately results in the creation of the knowledge base.

Different Querying stage:

1. Query Engines: These are your end-to-end conduits for querying your data, taking a natural language 

2. Chat Engines: They elevate the interaction to a conversational level, allowing back-and-forths with your data.

3. Agents: Agents are your automated decision-makers, interacting with the world through a toolkit, and manoeuvring through tasks with a dynamic action plan rather than a fixed logic.

Few common building blocks:

    Retrievers: They dictate the technique of fetching relevant context from the knowledge base against a query. For example, Dense Retrieval against a vector index is a prevalent approach.

    Node Postprocessors: They refine the set of nodes through transformation, filtering, or re-ranking.
    
    Response Synthesizers: They channel the LLM to generate responses, blending the user query with retrieved text chunks.

In [2]:
%pip install python-dotenv

Collecting python-dotenv
  Downloading python_dotenv-1.0.0-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.0
Note: you may need to restart the kernel to use updated packages.


In [32]:
# %pip uninstall -y llama-index

Found existing installation: llama-index 0.9.3.post1
Uninstalling llama-index-0.9.3.post1:
  Successfully uninstalled llama-index-0.9.3.post1
Note: you may need to restart the kernel to use updated packages.


In [33]:
%pip install llama-index==0.9.0

Collecting llama-index==0.9.0
  Downloading llama_index-0.9.0-py3-none-any.whl.metadata (8.2 kB)
Downloading llama_index-0.9.0-py3-none-any.whl (869 kB)
   ---------------------------------------- 0.0/869.4 kB ? eta -:--:--
   ---------------------------------------- 0.0/869.4 kB ? eta -:--:--
   ---------------------------------------- 10.2/869.4 kB ? eta -:--:--
   - ------------------------------------- 41.0/869.4 kB 393.8 kB/s eta 0:00:03
   ---- --------------------------------- 112.6/869.4 kB 731.4 kB/s eta 0:00:02
   ----------- ---------------------------- 256.0/869.4 kB 1.3 MB/s eta 0:00:01
   ------------------------ --------------- 532.5/869.4 kB 2.2 MB/s eta 0:00:01
   ---------------------------------------- 869.4/869.4 kB 3.1 MB/s eta 0:00:00
Installing collected packages: llama-index
Successfully installed llama-index-0.9.0
Note: you may need to restart the kernel to use updated packages.


By default, LlamaIndex uses OpenAI's gpt-3.5-turbo for creating text and text-embedding-ada-002 for fetching and embedding.

To set up LlamaCPP, follow its setup guide [here](https://docs.llamaindex.ai/en/stable/examples/llm/llama_2_llama_cpp.html).

LlamaCPP and llama2-chat-13B for creating text and BAAI/bge-small-en for fetching and embedding.

In [3]:
from dotenv import load_dotenv

load_dotenv()

True

### Creating Llamaindex Documents

LlamaHub is an open-source repository hosting data connectors which can be seamlessly integrated into any LlamaIndex application. All the connectors present here can be used as follows - [find more](https://llamahub.ai/)

In [None]:
from llama_index import download_loader

GoogleDocsReader = download_loader('GoogleDocsReader')
loader = GoogleDocsReader()
documents = loader.load_data(document_ids=[...])

Let's try couple of data sources,

1. PDF file: We use the **SimpleDirectoryReader** data connector for this. The given example below loads a google 10k report.

In [6]:
%pip install pypdf

Collecting pypdf
  Downloading pypdf-3.17.1-py3-none-any.whl.metadata (7.5 kB)
Downloading pypdf-3.17.1-py3-none-any.whl (277 kB)
   ---------------------------------------- 0.0/277.6 kB ? eta -:--:--
   - -------------------------------------- 10.2/277.6 kB ? eta -:--:--
   - -------------------------------------- 10.2/277.6 kB ? eta -:--:--
   ----- --------------------------------- 41.0/277.6 kB 326.8 kB/s eta 0:00:01
   ---------------- --------------------- 122.9/277.6 kB 717.5 kB/s eta 0:00:01
   ---------------------------------------  276.5/277.6 kB 1.3 MB/s eta 0:00:01
   ---------------------------------------- 277.6/277.6 kB 1.2 MB/s eta 0:00:00
Installing collected packages: pypdf
Successfully installed pypdf-3.17.1
Note: you may need to restart the kernel to use updated packages.


In [7]:
from llama_index import SimpleDirectoryReader

reader = SimpleDirectoryReader(
    input_files = ["20230203_alphabet_10K.pdf"]
)

pdf_documents = reader.load_data()

2. Wikipedia Page: We will use Llamahub from Wikipedia loader

In [8]:
from llama_index import download_loader

wikipedia_reader = download_loader("WikipediaReader")

loader = wikipedia_reader()
wikipedia_documents = loader.load_data(pages = ['Delhi','Tokyo','Iceland'])

In [11]:
wikipedia_documents

[Document(id_='2d701003-a778-49ee-b370-4fe69c114495', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='24e721bad80924a2f6f2eb0f6345893c3531895bc2e744164efb84840bd3b04d', text='Del, or nabla, is an operator used in mathematics (particularly in vector calculus) as a vector differential operator, usually represented by the nabla symbol ∇. When applied to a function defined on a one-dimensional domain, it denotes the standard derivative of the function as defined in calculus. When applied to a field (a function defined on a multi-dimensional domain), it may denote any one of three operations depending on the way it is applied: the gradient or (locally) steepest slope of a scalar field (or sometimes of a vector field, as in the Navier–Stokes equations); the divergence of a vector field; or the curl (rotation) of a vector field.\nDel is a very convenient mathematical notation for those three operations (gradient, divergence,

## Creating LlamaIndex Nodes

In LlamaIndex, once the data has been ingested and represented as Documents, there's an option to further process these Documents into Nodes.

Nodes are more granular data entities that represent "chunks" of source Documents, which could be text chunks, images, or other types of data. 



#### 1. Basic - SimpleNodeParser

To parse Documents into nodes, LlamaIndex provides NodeParser classes. These classes help in automatically transforming the content of Documents into Nodes, adhering to a specific structure that can be utilized further in index construction and querying.

In [13]:
from llama_index.node_parser import SimpleNodeParser
from llama_index import SimpleDirectoryReader

# loading the document
reader = SimpleDirectoryReader(
    input_files = ["20230203_alphabet_10K.pdf"]
)

pdf_documents = reader.load_data()

# Initialize the parser
parser = SimpleNodeParser.from_defaults(
    chunk_size = 1024,
    chunk_overlap = 20
    )

# Parse documents into nodes
nodes = parser.get_nodes_from_documents(pdf_documents)

In [14]:
nodes

[TextNode(id_='4c8f3f15-4b80-4021-8333-92291e90319b', embedding=None, metadata={'page_label': '1', 'file_name': '20230203_alphabet_10K.pdf', 'file_path': '20230203_alphabet_10K.pdf', 'file_type': 'application/pdf', 'file_size': 897814, 'creation_date': '2023-11-19', 'last_modified_date': '2023-11-19', 'last_accessed_date': '2023-11-19'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='729a8903-6e23-4edf-ad52-64da79fc7283', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'page_label': '1', 'file_name': '20230203_alphabet_10K.pdf', 'file_path': '20230203_alphabet_10K.pdf', 'file_type': 'application/pdf', 'file_size': 897814, 'creation_date': '2023-11-19', 'last_modified_date': '2023-11-19', 'last_accessed_dat

Advanced

Various customization options include:

- text_splitter (default: TokenTextSplitter)
- include_metadata (default: True)
- include_prev_next_rel (default: True)
- metadata_extractor (default: None)

### Text Splitter Customization

Customize text splitter, using either `SentenceSplitter`, `TokenTextSplitter`, or `CodeSplitter` from `llama_index.text_splitter`.

#### SentenceSplitter

In [35]:
import tiktoken
from llama_index.text_splitter import SentenceSplitter
from llama_index.node_parser import SimpleNodeParser
from llama_index import ServiceContext,VectorStoreIndex

text_splitter = SentenceSplitter(
    separator = " ", 
    chunk_size = 1024,
    chunk_overlap = 20,
    paragraph_separator = "\n\n\n",
    secondary_chunking_regex = "[^,.;。]+[,.;。]?",
    tokenizer = tiktoken.encoding_for_model("gpt-3.5-turbo").encode
)

service_context = ServiceContext.from_defaults(text_splitter = text_splitter)

index = VectorStoreIndex.from_documents(
    pdf_documents, service_context=service_context
)

# node_parser = SimpleNodeParser.from_defaults(text_splitter=text_splitter)

### CodeSplitter

In [39]:
# %pip install tree-sitter-languages

In [38]:
from llama_index.text_splitter import CodeSplitter

text_splitter = CodeSplitter(
    language = "python",
    chunk_lines = 40,
    chunk_lines_overlap = 15,
    max_chars = 1500,
)

service_context = ServiceContext.from_defaults(text_splitter = text_splitter)


node_parser = VectorStoreIndex.from_documents(
    pdf_documents, service_context=service_context
)

ImportError: Please install tree_sitter_languages to use CodeSplitter.

### SentenceWindowNodeParser

For specific scope embeddings, utilize SentenceWindowNodeParser to split documents into individual sentences, also capturing surrounding sentence windows.

In [40]:
import nltk

from llama_index.node_parser import SentenceWindowNodeParser

node_parser = SentenceWindowNodeParser.from_defaults(
    window_size = 3,
    window_metadata_key = "window",
    original_text_metadata_key = "original_sentence"
)

## Manual Node Creation

For more control, manually create Node objects and define attributes and relationships.

In [41]:
from llama_index.schema import TextNode, NodeRelationship, RelatedNodeInfo

# Create TextNode objects
node1 = TextNode(text = "<text_chunk>", id_ = "<node_id>")
node2 = TextNode(text = "<text_chunk>", id_ = "<node_id>")

# Define node relationships
node1.relationships[NodeRelationship.NEXT] = RelatedNodeInfo(node_id = node2.node_id)
node2.relationships[NodeRelationship.PREVIOUS] = RelatedNodeInfo(node_id = node1.node_id)

# Gather nodes
nodes = [node1, node2]

In this snippet, `TextNode` creates nodes with text content while `NodeRelationship` and `RelatedNodeInfo` define node relationships.

In [42]:
from llama_index.node_parser import SimpleNodeParser

parser = SimpleNodeParser.from_defaults(
    chunk_size = 1024,
    chunk_overlap = 20,
    )

pdf_nodes = parser.get_nodes_from_documents(pdf_documents)
wikipedia_nodes = parser.get_nodes_from_documents(wikipedia_documents)

### Create Index with Nodes and Documents

LlamaIndex lies in its ability to build structured indices over ingested data, represented as either Documents or Nodes. This indexing facilitates efficient querying over the data.

In [None]:
# Building index from documents
from llama_index import VectorStoreIndex

# Assuming docs is your list of Document objects
index = VectorStoreIndex.from_documents(pdf_documents)

#### Different types of indices in LlamaIndex handle data

- **Summary Index:** Stores Nodes as a sequential chain, all nodes are loaded into response synthesis module if no other query parameters are specified.

- **Vector Store Index:** Stores each Node and a corresponding embedding in a Vector Storem and queries involve fetching the top-k most similar Nodes.

- **Tree Index:** Builds a hierarchical tree from a set of Nodes, and queries involve traversing from root nodes down to leaf nodes.

- **Keyword Table Index:** Extracts keywords from each Node to build a mapping, and queries extract relevant keywords to fetch corresponding Nodes.

### Summary Index

In [43]:
wiki_titles = ["Toronto", "Seattle", "Chicago", "Boston", "Houston"]

from pathlib import Path

import requests

for title in wiki_titles:
    response = requests.get(
        "https://en.wikipedia.org/w/api.php",
        params={
            "action": "query",
            "format": "json",
            "titles": title,
            "prop": "extracts",
            # 'exintro': True,
            "explaintext": True,
        },
    ).json()
    page = next(iter(response["query"]["pages"].values()))
    wiki_text = page["extract"]

    data_path = Path("data")
    if not data_path.exists():
        Path.mkdir(data_path)

    with open(data_path / f"{title}.txt", "w") as fp:
        fp.write(wiki_text)

In [44]:
from llama_index import (SimpleDirectoryReader,
                         ServiceContext,
                         )
# Load all wiki documents
city_docs = []
for wiki_title in wiki_titles:
    docs = SimpleDirectoryReader(
        input_files=[f"data/{wiki_title}.txt"]
    ).load_data()
    docs[0].doc_id = wiki_title
    city_docs.extend(docs)

In [None]:
from llama_index.llms import OpenAI

# LLM (gpt-3.5-turbo)
chatgpt = OpenAI(temperature=0, model="gpt-3.5-turbo")
service_context = ServiceContext.from_defaults(llm=chatgpt, chunk_size=1024)

In [48]:
# %pip install tree-sitter-languages

In [49]:
import nest_asyncio

nest_asyncio.apply()

In [50]:
from llama_index import get_response_synthesizer
from llama_index.indices.document_summary import DocumentSummaryIndex

# default mode of building the index
response_synthesizer = get_response_synthesizer(
    response_mode = "tree_summarize",
    use_async = True
)

doc_summary_index = DocumentSummaryIndex.from_documents(
    city_docs,
    service_context = service_context,
    response_synthesizer = response_synthesizer,
    show_progress = True,
)

Parsing nodes: 100%|██████████| 5/5 [00:00<00:00,  9.60it/s]
Summarizing documents:   0%|          | 0/5 [00:00<?, ?it/s]

current doc id: Toronto


Summarizing documents:  20%|██        | 1/5 [00:08<00:32,  8.19s/it]

current doc id: Seattle


Summarizing documents:  40%|████      | 2/5 [00:15<00:22,  7.57s/it]

current doc id: Chicago


Summarizing documents:  60%|██████    | 3/5 [00:22<00:14,  7.34s/it]

current doc id: Boston


Summarizing documents:  80%|████████  | 4/5 [00:29<00:07,  7.41s/it]

current doc id: Houston


Summarizing documents: 100%|██████████| 5/5 [00:37<00:00,  7.41s/it]
Generating embeddings: 100%|██████████| 5/5 [00:00<00:00,  5.51it/s]


In [54]:
doc_summary_index.get_document_summary("Boston")

"The provided text is about the city of Boston and covers various aspects of the city, including its history, geography, neighborhoods, climate, demographics, economy, education system, healthcare facilities, public safety, culture, environment, religious institutions, sports, parks and recreation, government and politics, media, television and film production, transportation infrastructure, international relations, and further reading materials about the city.\n\nSome questions that this text can answer include:\n- What is the history of Boston and how did it develop over time?\n- What were some key events that took place in Boston during the American Revolution?\n- What is Boston's economic base and what industries are prominent in the city?\n- What is Boston's role in higher education and academic research?\n- What is Boston known for in terms of environmental sustainability and new investment?\n- What are some of the major institutions and colleges in Boston?\n- What were some sign

In [52]:
# Make the vector store persistent
# index.storage_context.persist(persist_dir="<persist_dir>")
doc_summary_index.storage_context.persist("index")

In [53]:
# load from persistent storage
from llama_index.indices.loading import load_index_from_storage
from llama_index import StorageContext

# rebuild storage
# storage_context = StorageContext.from_defaults(persist_dir="<persist_dir>")
storage_context = StorageContext.from_defaults(persist_dir = "index")
doc_summary_index = load_index_from_storage(storage_context)

#### High-level Querying

this uses the default, embedding-based form of retrieval

In [55]:
query_engine = doc_summary_index.as_query_engine(
    response_mode = "tree_summarize",
    use_async = True,
)

In [58]:
response = query_engine.query("What are the sports teams in Boston?")
print(len(str(response)))
print(response)

217
The sports teams in Boston include the Boston Red Sox (baseball), the Boston Celtics (basketball), the New England Patriots (American football), the Boston Bruins (ice hockey), and the New England Revolution (soccer).


### LLM-based Retrieval

In [59]:
from llama_index.indices.document_summary import DocumentSummaryIndexLLMRetriever

retriever = DocumentSummaryIndexLLMRetriever(
    doc_summary_index,
    # index (DocumentSummaryIndex): The index to retrieve from.
    # choice_select_prompt (Optional[BasePromptTemplate]): The prompt to use for selecting relevant summaries.
    # choice_batch_size (int): The number of summary nodes to send to LLM at a time. choice_batch_size = 10
    # choice_top_k (int): The number of summary nodes to retrieve.  choice_top = 1
    # format_node_batch_fn (Callable): Function to format a batch of nodes for LLM.
    # parse_choice_select_answer_fn (Callable): Function to parse LLM response.
    # service_context (ServiceContext): The service context to use.

)

In [60]:
retrieved_nodes = retriever.retrieve("What are the sports teams in Boston?")

In [62]:
print(len(retrieved_nodes))
print(retrieved_nodes[0].score)
print(retrieved_nodes[0].node.get_text())

87
10.0
Boston (US: ), officially the City of Boston, is the capital and most populous city of the U.S. state of Massachusetts, and the cultural and financial center of New England in the Northeastern United States, with an area of 48.4 sq mi (125 km2) and a population of 675,647 in 2020. The Greater Boston metropolitan statistical area is the eleventh-largest in the country.Boston is one of the United States's oldest municipalities. It was founded on the Shawmut Peninsula in 1630 by Puritan settlers from Boston, Lincolnshire. During the American Revolution, Boston was the location of several key events, including the Boston Massacre, the Boston Tea Party, the hanging of Paul Revere's lantern signal in Old North Church, the Battle of Bunker Hill, and the siege of Boston. Following American independence from Great Britain, the city continued to play an important role as a port, manufacturing hub, and center for education and culture. The city has expanded beyond the original peninsula t

In [64]:
# use retriever as part of a query engine
from llama_index.query_engine import RetrieverQueryEngine

# configure response synthesizer
response_synthesizer = get_response_synthesizer(response_mode = "tree_summarize")

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever = retriever,
    response_synthesizer = response_synthesizer,
)

# query
response = query_engine.query("What are the sports teams in Boston?")
print(len(str(response)))
print(response)

217
The sports teams in Boston include the Boston Red Sox (baseball), the Boston Celtics (basketball), the New England Patriots (American football), the Boston Bruins (ice hockey), and the New England Revolution (soccer).


### Embedding-based Retrieval

In [66]:
from llama_index.indices.document_summary import DocumentSummaryIndexEmbeddingRetriever

retriever = DocumentSummaryIndexEmbeddingRetriever(
    doc_summary_index,
    # similarity_top_k = 1,
)

retrieved_nodes = retriever.retrieve("What are the sports teams in Boston?")
print(len(retrieved_nodes))
print(retrieved_nodes[0].node.get_text())

87
Boston (US: ), officially the City of Boston, is the capital and most populous city of the U.S. state of Massachusetts, and the cultural and financial center of New England in the Northeastern United States, with an area of 48.4 sq mi (125 km2) and a population of 675,647 in 2020. The Greater Boston metropolitan statistical area is the eleventh-largest in the country.Boston is one of the United States's oldest municipalities. It was founded on the Shawmut Peninsula in 1630 by Puritan settlers from Boston, Lincolnshire. During the American Revolution, Boston was the location of several key events, including the Boston Massacre, the Boston Tea Party, the hanging of Paul Revere's lantern signal in Old North Church, the Battle of Bunker Hill, and the siege of Boston. Following American independence from Great Britain, the city continued to play an important role as a port, manufacturing hub, and center for education and culture. The city has expanded beyond the original peninsula throug

In [67]:
# use retriever as part of a query engine

from llama_index.query_engine import RetrieverQueryEngine

# configure response synthesizer
response_synthesizer = get_response_synthesizer(response_mode="tree_summarize")

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever = retriever,
    response_synthesizer = response_synthesizer,
)

# query
response = query_engine.query("What are the sports teams in Boston?")
print(response)

The sports teams in Boston include the Boston Red Sox (baseball), the Boston Celtics (basketball), the New England Patriots (American football), the Boston Bruins (ice hockey), and the New England Revolution (soccer).


## Using Index to Query Data

After having established a well-structured index using LlamaIndex, the next pivotal step is querying this index to extract meaningful insights or answers to specific inquiries. 

In this simplistic approach, the `as_query_engine()` method is utilized to create a query engine from your index, and the `query()` method to execute a query. 

By default, index.as_query_engine() creates a query engine with the specified default settings in LlamaIndex.


In [None]:

# Assuming 'index' is your constructed index object
query_engine = index.as_query_engine()
response = query_engine.query("your_query")
print(response)

`sub question query engine` to tackle the problem of answering a complex query using multiple data sources. 

It first breaks down the complex query into sub questions for each relevant data source, then gather all the intermediate reponses and synthesizes a final response.

To know more follow the [sub question query engine documentation](https://docs.llamaindex.ai/en/stable/examples/query_engine/sub_question_query_engine.html).