In [None]:
apiKey = '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'

In [None]:
!pip3 install OpenAI --quiet
!pip3 install llama_index --quiet

__Basic Tutorial__
<br>
<br>
Building an __LLM__ Application


In [None]:
from llama_index.llms import OpenAI

response = OpenAI(api_key=apiKey).complete("Paul Graham is ")
print(response)

In the below example, OpenAI's gpt-4 model is used as an LLM along with OpenAI Embeddings model to create a vector store for the paul graham's essay which is in data folder.
<br>
<br>

_ServiceContext_ - to personalize the application instead of defaults (like the above)
<br>
_SimpleDirectoryReader_ - to load data (documents) from the folder (in this case "data")
<br>
_VectorStoreIndex_ - to create a vector store from documents loaded using SimpleDirectoryReader
<br>
<br>

_llama_index.llms_ - has all the LLMs that can be used
<br>
_llama_index.embeddings_ - has all the embeddings that can be used

In [None]:
from llama_index.llms import OpenAI
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.embeddings import OpenAIEmbedding

llm = OpenAI(api_key=apiKey, temperature=0.1, model="gpt-4")
embed_model = OpenAIEmbedding(api_key=apiKey)
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)

documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)

In [None]:
print(documents)

## Loading Data

Data has to be loaded before any LLM can use it to answer our queries. This is a parallel to data cleaning/feature engineering/ETL pipelines.
<br>
<br>
There are three main stages in this pipeline
1. Load the data
2. Transform the data
3. Index and store the data

<br>
Various ways of ingesting data

### Loaders
Loading data is done using data connectors otherwise known as Reader. Data Connectors ingest data from different sources and format the data into Document objects. 
<br>
<br>

_Document_ -> It is a collection of data and metadata about that data.


In [None]:
# Using SimpleDirectoryReader

from llama_index import SimpleDirectoryReader

documents = SimpleDirectoryReader("./data").load_data()

There are a lot of places we get data from and not everything is built in but there are a lot of connectors in LlamaHub
<br>
https://llamahub.ai/?tab=loaders
<br>
<br>
In the below example, there is a database reader which can be used to connect to a data and load data for a query

In [None]:
from llama_index import download_loader
import os

DatabaseReader = download_loader("DatabaseReader")

reader = DatabaseReader(
    scheme=os.getenv("DB_SCHEME"),
    host=os.getenv("DB_HOST"),
    port=os.getenv("DB_PORT"),
    user=os.getenv("DB_USER"),
    password=os.getenv("DB_PASS"),
    dbname=os.getenv("DB_NAME"),
)

query = "SELECT * FROM users"
documents = reader.load_data(query=query)

In [None]:
# In another world, we can write documents instead of loading them

from llama_index.schema import Document

doc = Document(text="text")

### Transformations

Data needs to be processed and transformed before placing it in a storage system. Transformations include chunking, extracting metadata, and embedding each chunk. 
<br>
<br>
Transformations input/outputs are Node objects (a Document is a subclass of a Node). These can be stacked and reordered.

#### High Level Transformations API

_.from_documents()_ method of VectorStoreIndex accepts an array of Document objects and will correctly parse and chunk them up.
<br>
<br>
Under the hood, this splits the document into Node objects, which are similar to Documents (they contain text and metadata) but have relationship with parent Document.

In [None]:
# High Level Transformations API

from llama_index import VectorStoreIndex

vector_index = VectorStoreIndex.from_documents(documents, service_context=service_context)
vector_index.as_query_engine()

for more customization, like splitting the text into chuncks of said size SentenceSplitter can be used

In [None]:
from llama_index.node_parser import SentenceSplitter

text_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=10)
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model, text_splitter=text_splitter)

index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)

### Lower-Level Transformation API

The above steps can be explicitly defined.

It can be done either by using transformations modules (text splitters, metadata, extractors, etc.) as standalone components, or compose them in a declarative Transformation Pipeline Interface. 
<br>
https://docs.llamaindex.ai/en/stable/module_guides/loading/ingestion_pipeline/root.html

A key step is to split the documents into chunks/Node objects. The idea is to process the data into bite-size pieces that can be retrieved/fed to the LLM. These can used in their own or part of an ingestion pipeline.


In [None]:
from llama_index import SimpleDirectoryReader
from llama_index.ingestion import IngestionPipeline
from llama_index.node_parser import TokenTextSplitter

documents = SimpleDirectoryReader("./data").load_data()

pipeline = IngestionPipeline(transformations=[TokenTextSplitter(), ...])

nodes = pipeline.run(documents=documents)

### Adding Metadata

metadata can be added to documents or nodes either manually or with automatic metadata extractors.
<br>
https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/usage_metadata_extractor.html


In [None]:
document = Document(
    text="text",
    metadata={"filename": "<doc_file_name>", "category": "<category>"},
)

To insert a node into a vector index, it should have an embedding. nodes can created directly and passed on to an indexer.

In [None]:
from llama_index.schema import TextNode

node1 = TextNode(text="<text_chunk>", id_="<node_id>")
node2 = TextNode(text="<text_chunk>", id_="<node_id>")

index = VectorStoreIndex([node1, node2])

### Documents/Nodes
Core abstractions within LlamaIndex.
<br>
<br>
you can either use documents or nodes for creating a vector store
<br>
<br>
A Document is the entire data source - for instance a PDF, an API output, or retrieved data from a database.
<br>
<br>
A Node represents a chunck of the source document.
<br>
<br> 
Both the document and node can contains metadata and relationship to other documents/nodes

In [None]:
# Usage pattern
# Documents
from llama_index import Document, VectorStoreIndex

text_list = ['text1', 'text2']
documents = [Document(text=t) for t in text_list]

# build index
index = VectorStoreIndex.from_documents(documents)

In [None]:
# Nodes
from llama_index.node_parser import SentenceSplitter
from llama_index import Document, VectorStoreIndex

# load documents
text_list = ['text1', 'text2']
documents = [Document(text=t) for t in text_list]

# parse nodes (spliiting data into chunks)
parser = SentenceSplitter()
nodes = parser.get_nodes_from_documents(documents)

# build index
index = VectorStoreIndex(nodes)