### Quick intro to LlamaIndex  
Sources: [1](https://lmy.medium.com/comparing-langchain-and-llamaindex-with-4-tasks-2970140edf33), [2](https://docs.llamaindex.ai/en/stable/), [3](https://github.com/run-llama/llama_index), [4](https://nanonets.com/blog/llamaindex/)  

LlamaIndex is a "data framework" to help you build LLM apps. It provides the following tools:

+ Offers data connectors to ingest your existing data sources and data formats (APIs, PDFs, docs, SQL, etc.).
+ Provides ways to structure your data (indices, graphs) so that this data can be easily used with LLMs.
+ Provides an advanced retrieval/query interface over your data: Feed in any LLM input prompt, get back retrieved context and knowledge-augmented output.
+ Allows easy integrations with your outer application framework (e.g. with LangChain, Flask, Docker, ChatGPT, anything else).
+ LlamaIndex provides tools for both beginner users and advanced users.  

The high-level API allows beginner users to use LlamaIndex to ingest and query their data in 5 lines of code.  
The lower-level APIs allow advanced users to customize and extend any module (data connectors, indices, retrievers, query engines, reranking modules), to fit their needs.  

LlamaIndex provides the following tools:
+ Data connectors ingest your existing data from their native source and format. These could be APIs, PDFs, SQL, and (much) more.
+ Data indexes structure your data in intermediate representations that are easy and performant for LLMs to consume.
+ Engines provide natural language access to your data. For example:
+ Query engines are powerful retrieval interfaces for knowledge-augmented output.
+ Chat engines are conversational interfaces for multi-message, “back and forth” interactions with your data.
+ Data agents are LLM-powered knowledge workers augmented by tools, from simple helper functions to API integrations and more.
+ Application integrations tie LlamaIndex back into the rest of your ecosystem. This could be LangChain, Flask, Docker, ChatGPT, or… anything else!  

#### Installing Packages

In [1]:
!pip install -qU openai
!pip install -qU llama-index
!pip install -qU pydantic
!pip install -qU llama-index-llms-openai
!pip install -qU pypdf
!pip install -qU docx2txt

ERROR: Could not install packages due to an OSError: [WinError 5] Access is denied: 'C:\\Users\\Renato Rocha Souza\\AppData\\Roaming\\Python\\Python39\\site-packages\\~-dantic_core\\_pydantic_core.cp39-win_amd64.pyd'
Check the permissions.



#### Importing Packages

In [2]:
import os
import sys
import openai
import pydantic

#os.environ["OPENAI_API_KEY"] = "<the key>"
openai.api_key = os.environ["OPENAI_API_KEY"]

import llama_index

from llama_index.core import Settings

from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

from llama_index.llms.ollama import Ollama
from llama_index.embeddings.huggingface import HuggingFaceEmbedding


from llama_index.core import VectorStoreIndex
from llama_index.core import SimpleDirectoryReader
from llama_index.core import StorageContext
from llama_index.core import load_index_from_storage

In [3]:
print("LLamaIndex:", llama_index.core.__version__)
print("Pydantic:", pydantic.VERSION)
print("OpenAI:", openai.__version__)

LLamaIndex: 0.12.2
Pydantic: 2.9.2
OpenAI: 1.55.3


In [4]:
import logging
import sys

#logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
#logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

#### Defining Models

For using [Ollama Models](https://ollama.com/search), check which ones are installed in your local machine

In [5]:
#model="gpt-4o"
model="gpt-4o-mini"

Settings.llm = OpenAI(temperature=0, 
                      model=model, 
                      #max_tokens=512
                      PRESENCE_PENALTY=-2,
                      TOP_P=1,
                     )

#Settings.llm = Ollama(model="llama3.2", request_timeout=300.0)

Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

#Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

#### Defining Folders

In [6]:
print(f"Current dir: {os.getcwd()}")
DOCS_DIR = "../../Data/"
if not os.path.exists(DOCS_DIR):
  os.mkdir(DOCS_DIR)
docs = os.listdir(DOCS_DIR)
docs = [d for d in docs]
docs.sort()
print(f"Files in {DOCS_DIR}")
for doc in docs:
    print(doc)

Current dir: C:\Users\Renato Rocha Souza\Documents\Repos\GenAI4Humanists\Notebooks\LlamaIndex
Files in ../../Data/
.ipynb_checkpoints
1.pdf
Lightroom.jpeg
Sign.png
Vienna_dataset.json
Vienna_image.png
Vienna_mask.png
WarrenCommissionReport.txt
attention.png
axis_report.pdf
california_housing_train.csv
fossils.jpeg
handwritten.jpg
handwritten2.jpg
handwritten3.jpg
hdfc_report.pdf
hr.sqlite
icici_report.pdf
imageToSave.png
imageToSave2.png
kafka_metamorphosis.txt
keynote_recap.mp3
keynote_recap.mp4
knowledge_card.pdf
loftq.pdf
longlora.pdf
lyft_2021.pdf
metagpt.pdf
metra.pdf
new_rag_dataset.json
nyc_text.txt
paul_graham_essay.txt
rag_dataset.json
selfrag.pdf
sound_english.mp3
sound_german.mp3
sound_portuguese.mp3
speech.mp3
swebench.pdf
triangle.png
uber_2021.pdf
values.pdf
vr_mcl.pdf
zipformer.pdf


In [7]:
documents = SimpleDirectoryReader(input_files=[f"{DOCS_DIR}1.pdf"]).load_data()
documents

[Document(id_='ebc9c08b-a1fd-40db-9340-487f0c5e594d', embedding=None, metadata={'page_label': '1', 'file_name': '1.pdf', 'file_path': '..\\..\\Data\\1.pdf', 'file_type': 'application/pdf', 'file_size': 154717, 'creation_date': '2024-05-31', 'last_modified_date': '2024-05-31'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text=' \nFood Calories List \nFrom: www.weightlossforall.com \nThe food calories list is a table of everyday foods listing their calorie content per average portion. The \nfood calories list also gives the calorie content in 100 grams so it can be compared with any other \nproducts not listed here. The table can be useful if you want to exchange a food with similar calorie \nconte

In [8]:
index = VectorStoreIndex.from_documents(documents)

In [9]:
query_engine = index.as_query_engine()
response = query_engine.query("What is the document about?")
print(response)

The document provides a food calories list that details the calorie content of various everyday foods per average portion and per 100 grams. It categorizes foods based on the five basic food groups of a balanced diet, making it useful for those following a weight loss or low-calorie program.


In [10]:
INDEX_DIR = "../../Index/VectorStoreIndex/"
if not os.path.exists(INDEX_DIR):
  os.mkdir(INDEX_DIR)
index.storage_context.persist(INDEX_DIR)

In [11]:
if not os.path.exists(INDEX_DIR):
    documents = SimpleDirectoryReader(DOCS_DIR).load_data()
    index = VectorStoreIndex.from_documents(documents)
    index.storage_context.persist(persist_dir=INDEX_DIR)
else:
    storage_context = StorageContext.from_defaults(persist_dir=INDEX_DIR)
    index = load_index_from_storage(storage_context)

query_engine = index.as_query_engine()
response = query_engine.query("What is the document about?")
print(response)

The document provides a food calories list that details the calorie content of various everyday foods per average portion and per 100 grams. It categorizes foods based on the five basic food groups of a balanced diet, making it useful for those following a weight loss or low-calorie program.


In [12]:
response = query_engine.query("List some ingredients mentioned in the document?")
print(response)

Some ingredients mentioned include lentils, lettuce, melon, mushrooms, olives, onion, orange, peas, peach, pear, pepper, pineapple, plum, spinach, strawberries, sweetcorn, tomato, and watercress.


In [13]:
response = query_engine.query("What is the less caloric ingredient?")
print(response)

The ingredient with the least calories is cucumber, which has 3 calories per 100 grams.
