In [8]:
apiKey = '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'

In [1]:
!pip3 install OpenAI --quiet
!pip3 install llama_index --quiet

__Basic Tutorial__
<br>
<br>
Building an __LLM__ Application


In [None]:
from llama_index.llms import OpenAI

response = OpenAI(api_key=apiKey).complete("Paul Graham is ")
print(response)

In the below example, OpenAI's gpt-4 model is used as an LLM along with OpenAI Embeddings model to create a vector store for the paul graham's essay which is in data folder.
<br>
<br>

_ServiceContext_ - to personalize the application instead of defaults (like the above)
<br>
_SimpleDirectoryReader_ - to load data (documents) from the folder (in this case "data")
<br>
_VectorStoreIndex_ - to create a vector store from documents loaded using SimpleDirectoryReader
<br>
<br>

_llama_index.llms_ - has all the LLMs that can be used
<br>
_llama_index.embeddings_ - has all the embeddings that can be used

In [59]:
from llama_index.llms import OpenAI
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.embeddings import OpenAIEmbedding

llm = OpenAI(api_key=apiKey, temperature=0.1, model="gpt-4")
embed_model = OpenAIEmbedding(api_key=apiKey)
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)

In [None]:
documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)

In [None]:
print(documents)

## Loading Data

Data has to be loaded before any LLM can use it to answer our queries. This is a parallel to data cleaning/feature engineering/ETL pipelines.
<br>
<br>
There are three main stages in this pipeline
1. Load the data
2. Transform the data
3. Index and store the data

<br>
Various ways of ingesting data

### Loaders
Loading data is done using data connectors otherwise known as Reader. Data Connectors ingest data from different sources and format the data into Document objects. 
<br>
<br>

_Document_ -> It is a collection of data and metadata about that data.


In [None]:
# Using SimpleDirectoryReader

from llama_index import SimpleDirectoryReader

documents = SimpleDirectoryReader("./data").load_data()

There are a lot of places we get data from and not everything is built in but there are a lot of connectors in LlamaHub
<br>
https://llamahub.ai/?tab=loaders
<br>
<br>
In the below example, there is a database reader which can be used to connect to a data and load data for a query

In [None]:
from llama_index import download_loader
import os

DatabaseReader = download_loader("DatabaseReader")

reader = DatabaseReader(
    scheme=os.getenv("DB_SCHEME"),
    host=os.getenv("DB_HOST"),
    port=os.getenv("DB_PORT"),
    user=os.getenv("DB_USER"),
    password=os.getenv("DB_PASS"),
    dbname=os.getenv("DB_NAME"),
)

query = "SELECT * FROM users"
documents = reader.load_data(query=query)

In [None]:
# In another world, we can write documents instead of loading them

from llama_index.schema import Document

doc = Document(text="text")

### Transformations

Data needs to be processed and transformed before placing it in a storage system. Transformations include chunking, extracting metadata, and embedding each chunk. 
<br>
<br>
Transformations input/outputs are Node objects (a Document is a subclass of a Node). These can be stacked and reordered.

#### High Level Transformations API

_.from_documents()_ method of VectorStoreIndex accepts an array of Document objects and will correctly parse and chunk them up.
<br>
<br>
Under the hood, this splits the document into Node objects, which are similar to Documents (they contain text and metadata) but have relationship with parent Document.

In [None]:
# High Level Transformations API

from llama_index import VectorStoreIndex

vector_index = VectorStoreIndex.from_documents(documents, service_context=service_context)
vector_index.as_query_engine()

for more customization, like splitting the text into chuncks of said size SentenceSplitter can be used

In [None]:
from llama_index.node_parser import SentenceSplitter

text_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=10)
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model, text_splitter=text_splitter)

index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)

### Lower-Level Transformation API

The above steps can be explicitly defined.

It can be done either by using transformations modules (text splitters, metadata, extractors, etc.) as standalone components, or compose them in a declarative Transformation Pipeline Interface. 
<br>
https://docs.llamaindex.ai/en/stable/module_guides/loading/ingestion_pipeline/root.html

A key step is to split the documents into chunks/Node objects. The idea is to process the data into bite-size pieces that can be retrieved/fed to the LLM. These can used in their own or part of an ingestion pipeline.


In [None]:
from llama_index import SimpleDirectoryReader
from llama_index.ingestion import IngestionPipeline
from llama_index.node_parser import TokenTextSplitter

documents = SimpleDirectoryReader("./data").load_data()

pipeline = IngestionPipeline(transformations=[TokenTextSplitter(), ...])

nodes = pipeline.run(documents=documents)

### Adding Metadata

metadata can be added to documents or nodes either manually or with automatic metadata extractors.
<br>
https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/usage_metadata_extractor.html


In [None]:
document = Document(
    text="text",
    metadata={"filename": "<doc_file_name>", "category": "<category>"},
)

To insert a node into a vector index, it should have an embedding. nodes can created directly and passed on to an indexer.

In [None]:
from llama_index.schema import TextNode

node1 = TextNode(text="<text_chunk>", id_="<node_id>")
node2 = TextNode(text="<text_chunk>", id_="<node_id>")

index = VectorStoreIndex([node1, node2])

### Documents/Nodes
Core abstractions within LlamaIndex.
<br>
<br>
you can either use documents or nodes for creating a vector store
<br>
<br>
A Document is the entire data source - for instance a PDF, an API output, or retrieved data from a database.
<br>
<br>
A Node represents a chunck of the source document.
<br>
<br> 
Both the document and node can contains metadata and relationship to other documents/nodes

In [None]:
# Usage pattern
# Documents
from llama_index import Document, VectorStoreIndex

text_list = ['text1', 'text2']
documents = [Document(text=t) for t in text_list]

# build index
index = VectorStoreIndex.from_documents(documents)

In [None]:
# Nodes
from llama_index.node_parser import SentenceSplitter
from llama_index import Document, VectorStoreIndex

# load documents
text_list = ['text1', 'text2']
documents = [Document(text=t) for t in text_list]

# parse nodes (spliiting data into chunks)
parser = SentenceSplitter()
nodes = parser.get_nodes_from_documents(documents)

# build index
index = VectorStoreIndex(nodes)

## Defining and Customizing Documents

metadata can be included in the documents that are being read. Any information set in the metadata dictionary of the document will come up in the metadata of each node created from that document. This enables the index to use those in queries and respones. As a  default, metadata is injected into the text for both embedding and LLM model calls.

In [None]:
# Various ways to set up the dictionary

# 1. Document Constructor
document = Document(
    text="text",
    metadata={"filename": "<doc_file_name>", "category": "<category>"},
)

# 2. After the document is created
document.metadata = {"filename": "<doc_file_name>"}

# 3. set the filename automatically using SimpleDirectoryReader and file_metadata hook
from llama_index import SimpleDirectoryReader

filename_fn = lambda filename: {"file_name": filename}

# automatically sets the metadata of each document according to filename_fn
documents = SimpleDirectoryReader(
    "./data", file_metadata=filename_fn
).load_data()

### Customizing LLM Metadata Text

We might not want certain metadata keys to be read by an LLM but is generates better embeddings. we can exclude those so that LLM can't read those but can be used by the embedding model.

In [None]:
# Excluding file_name key from being read by LLM
document.excluded_llm_metadata_keys = ["file_name"]

# Testing what the LLM can read from the metadata
from llama_index.schema import MetadataMode
print(document.get_content(metadata_mode=MetadataMode.LLM))

We can do the same for embedding model as well

In [None]:
# Excluding file_name key from being read by embedding model
document.excluded_embed_metadata_keys = ["file_name"]

# Testing what the embedding model can read from the metadata
print(document.get_content(metadata_mode=MetadataMode.EMBED))

### Customizing metadata format

format of the metadata being included into the actual text of each node and document when sent to LLM or embedding model is controlled by these three attributes

1. Document.metadata_seperator -> separator between each key/value pair
2. Document.metadata_template -> how key value pairs are formatted (by default it is strings)
3. Document.text_template -> what metadata looks like when joined with the text content of documents/nodes

As a whole,

In [6]:
from llama_index import Document
from llama_index.schema import MetadataMode

document = Document(
    text="This is a super-customized document",
    metadata={
        "file_name": "super_secret_document.txt",
        "category": "finance",
        "author": "LlamaIndex",
    },
    excluded_llm_metadata_keys=["file_name"],
    metadata_seperator="::",
    metadata_template="{key}=>{value}",
    text_template="Metadata: {metadata_str}\n-----\nContent: {content}\n",
)

print(
    "The LLM sees this: \n",
    document.get_content(metadata_mode=MetadataMode.LLM),
)
print(
    "The Embedding model sees this: \n",
    document.get_content(metadata_mode=MetadataMode.EMBED),
)

The LLM sees this: 
 Metadata: category=>finance::author=>LlamaIndex
-----
Content: This is a super-customized document
The Embedding model sees this: 
 Metadata: file_name=>super_secret_document.txt::category=>finance::author=>LlamaIndex
-----
Content: This is a super-customized document


In [None]:
# Assume you have a list of nodes
nodes = ...

# Loop through each node
for node in nodes:
    # Access the metadata attribute
    metadata = node.metadata

    # The metadata is a dictionary, you can access the extracted titles or questions like this:
    document_title = metadata.get('document_title')
    questions_this_excerpt_can_answer = metadata.get('questions_this_excerpt_can_answer')

    print(f"Document Title: {document_title}")
    print(f"Questions this excerpt can answer: {questions_this_excerpt_can_answer}")

## Metadata Extraction

metadata can be extracted from using LLMs with Metadata Extractor modules.

1. SummaryExtractor - extracts a summary over a set of Nodes
2. QuestionsAnsweredExtractor - extracts a set of questions that each Node can answer
3. TitleExtractor - extracts a title over the context of each Node
4. EntityExtractor -  extracts entities (i.e. names of places, people, things) mentioned in the content of each Node

In [None]:
from llama_index.extractors import (
    TitleExtractor,
    QuestionsAnsweredExtractor,
)
from llama_index.text_splitter import TokenTextSplitter

text_splitter = TokenTextSplitter(
    separator=" ", chunk_size=512, chunk_overlap=128
)
title_extractor = TitleExtractor(nodes=5)
qa_extractor = QuestionsAnsweredExtractor(questions=3)

# assume documents are defined -> extract nodes
from llama_index.ingestion import IngestionPipeline

pipeline = IngestionPipeline(
    transformations=[text_splitter, title_extractor, qa_extractor]
)

nodes = pipeline.run(
    documents=documents,
    in_place=True,
    show_progress=True,
)

# or use it in the ServieContext

from llama_index import ServiceContext

service_context = ServiceContext.from_defaults(
    transformations=[text_splitter, title_extractor, qa_extractor]
)

### Document Management

In [31]:
# Making a new directory for data files
!mkdir -p data_new_doc_man
# Making two new data files
!echo "this is test file: one!" > data_new_doc_man/test1.txt
!echo "this is test file: two!" > data_new_doc_man/test2.txt

In [32]:
from llama_index import SimpleDirectoryReader
# Reading the data files using SimpleDirectoryReader
# making file name as id by setting filename_as_id attribute to True
documents = SimpleDirectoryReader("./data_new_doc_man", filename_as_id=True).load_data()

In [33]:
from llama_index.embeddings import OpenAIEmbedding
from llama_index.ingestion import IngestionPipeline
from llama_index.storage.docstore import (
    SimpleDocumentStore
)
from llama_index.text_splitter import SentenceSplitter

In [34]:
# The Pipeline
# SentenceSplitter - dividing the data into chunks
# OpenAIEmbedding - Creating embeddings for the data using OpenAI
# docstore - passing SimpleDocumentStore to store the documents
pipeline = IngestionPipeline(
    transformations = [
        SentenceSplitter(),
        OpenAIEmbedding(api_key=apiKey)
    ],
    docstore=SimpleDocumentStore()
)

In [35]:
# running the pipeline on documents
nodes = pipeline.run(documents=documents)

Docstore strategy set to upserts, but no vector store. Switching to duplicates_only strategy.


In [36]:
print(f"Ingested {len(nodes)} Nodes")

Ingested 2 Nodes


In [37]:
# storing the pipeline
pipeline.persist("./pipeline_storage")

In [38]:
pipeline.load("./pipeline_storage/")

In [39]:
!echo "This is a test file: three!" > data_new_doc_man/test3.txt
!echo "This is a NEW test file: one!" > data_new_doc_man/test1.txt

In [40]:
documents = SimpleDirectoryReader("./data_new_doc_man", filename_as_id=True).load_data()

In [41]:
nodes = pipeline.run(documents=documents)

In [42]:
print(f"Ingested {len(nodes)} Nodes")

Ingested 2 Nodes


In [43]:
for node in nodes:
    print(f"Node: {node.text}")

Node: This is a NEW test file: one!
Node: This is a test file: three!


In [44]:
print(len(pipeline.docstore.docs))

3


In [45]:
!pip3 install redis



In [46]:
!rm -rf test_data
!mkdir -p test_data
!echo "This is a test file: one!" > test_data/test1.txt
!echo "This is a test file: two!" > test_data/test2.txt

In [47]:
from llama_index import SimpleDirectoryReader

documents = SimpleDirectoryReader(
    "./test_data", filename_as_id=True
).load_data()

In [48]:
from llama_index.embeddings import OpenAIEmbedding
from llama_index.ingestion import (
    DocstoreStrategy,
    IngestionPipeline,
    IngestionCache
)
from llama_index.ingestion.cache import RedisCache
from llama_index.storage.docstore import RedisDocumentStore
from llama_index.text_splitter import SentenceSplitter
from llama_index.vector_stores import RedisVectorStore

In [49]:
embed_model = OpenAIEmbedding(api_key=apiKey)

In [55]:
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(),
        embed_model,
    ],
    docstore=RedisDocumentStore.from_host_and_port(
        "localhost", 6379, namespace="document_store"
    ),
    vector_store=RedisVectorStore(
        index_name="redis_vector_store",
        index_prefix="vector_store",
        redis_url="redis://localhost:6379",
    ),
    cache=IngestionCache(
        cache=RedisCache.from_host_and_port("localhost", 6379), collection="redis_cache",
    ),
    docstore_strategy=DocstoreStrategy.UPSERTS,
)

In [56]:
nodes = pipeline.run(documents=documents)
print(f"Ingested {len(nodes)} Nodes")

Ingested 2 Nodes


In [60]:
from llama_index import VectorStoreIndex, ServiceContext

service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)

index = VectorStoreIndex.from_vector_store(
    pipeline.vector_store, service_context=service_context
)

In [61]:
print(
    index.as_query_engine(similarity_top_k=10).query(
        "What documents do you see?"
    )
)

I see two documents, one is located at "test_data/test1.txt" and the other is at "test_data/test2.txt".


In [62]:
!echo "This is a test file: three!" > test_data/test3.txt
!echo "This is a NEW test file: one!" > test_data/test1.txt

In [63]:
documents = SimpleDirectoryReader(
    "./test_data", filename_as_id=True
).load_data()

In [66]:
nodes = pipeline.run(documents=documents)
print(f"Ingested {len(nodes)} nodes")

Ingested 2 nodes


In [67]:
index = VectorStoreIndex.from_vector_store(
    pipeline.vector_store, service_context=service_context
)

response = index.as_query_engine(similarity_top_k = 10).query(
    "What documents do you see?"
)

print(response)

I see three documents. They are test_data/test3.txt, test_data/test2.txt, and test_data/test1.txt.
