### Quick intro to LlamaIndex  
Sources: [1](https://lmy.medium.com/comparing-langchain-and-llamaindex-with-4-tasks-2970140edf33), [2](https://docs.llamaindex.ai/en/stable/), [3](https://github.com/run-llama/llama_index), [4](https://nanonets.com/blog/llamaindex/)  

LlamaIndex is a "data framework" to help you build LLM apps. It provides the following tools:

+ Offers data connectors to ingest your existing data sources and data formats (APIs, PDFs, docs, SQL, etc.).
+ Provides ways to structure your data (indices, graphs) so that this data can be easily used with LLMs.
+ Provides an advanced retrieval/query interface over your data: Feed in any LLM input prompt, get back retrieved context and knowledge-augmented output.
+ Allows easy integrations with your outer application framework (e.g. with LangChain, Flask, Docker, ChatGPT, anything else).
+ LlamaIndex provides tools for both beginner users and advanced users.  

The high-level API allows beginner users to use LlamaIndex to ingest and query their data in 5 lines of code.  
The lower-level APIs allow advanced users to customize and extend any module (data connectors, indices, retrievers, query engines, reranking modules), to fit their needs.  

LlamaIndex provides the following tools:
+ Data connectors ingest your existing data from their native source and format. These could be APIs, PDFs, SQL, and (much) more.
+ Data indexes structure your data in intermediate representations that are easy and performant for LLMs to consume.
+ Engines provide natural language access to your data. For example:
+ Query engines are powerful retrieval interfaces for knowledge-augmented output.
+ Chat engines are conversational interfaces for multi-message, “back and forth” interactions with your data.
+ Data agents are LLM-powered knowledge workers augmented by tools, from simple helper functions to API integrations and more.
+ Application integrations tie LlamaIndex back into the rest of your ecosystem. This could be LangChain, Flask, Docker, ChatGPT, or… anything else!  

#### Installing Packages

In [1]:
!pip install -q openai
!pip install -q llama-index
!pip install -q pypdf
!pip install -q docx2txt

#### Importing Packages

In [17]:
import os
import openai

#os.environ["OPENAI_API_KEY"] = "<the key>"
openai.api_key = os.environ["OPENAI_API_KEY"]

from llama_index.core import VectorStoreIndex
from llama_index.core import SimpleDirectoryReader
from llama_index.core import StorageContext
from llama_index.core import load_index_from_storage

In [15]:
import logging
import sys

#logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
#logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

#### Defining Models

In [24]:
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

model="gpt-3.5-turbo"
#model="gpt-4"
#model="gpt-4-turbo"

Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")
Settings.llm = OpenAI(temperature=0, 
                      model=model, 
                      #max_tokens=512
                      PRESENCE_PENALTY=-2,
                      TOP_P=1,
                     )

#### Defining Folders

In [5]:
print(f"Current dir: {os.getcwd()}")
DOCS_DIR = "../Data/"
if not os.path.exists(DOCS_DIR):
  os.mkdir(DOCS_DIR)
docs = os.listdir(DOCS_DIR)
docs = [d for d in docs]
docs.sort()
print(f"Files in {DOCS_DIR}")
for doc in docs:
    print(doc)

Current dir: /home/renato/Documents/Repos/GenAI4Humanists/Notebooks
Files in ../Data/
1.pdf
WarrenCommissionReport.txt
axis_report.pdf
california_housing_train.csv
hdfc_report.pdf
hr.sqlite
icici_report.pdf
knowledge_card.pdf
loftq.pdf
longlora.pdf
lyft_2021.pdf
metagpt.pdf
metra.pdf
paul_graham_essay.txt
selfrag.pdf
swebench.pdf
uber_2021.pdf
values.pdf
vr_mcl.pdf
zipformer.pdf


#### (Optional) Downloading an example PDF file:  

In [7]:
import requests
from bs4 import BeautifulSoup
url = "https://www.geeksforgeeks.org/how-to-extract-pdf-tables-in-python/"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
links = soup.find_all('a')
i = 0
for link in links:
    if ('.pdf' in link.get('href', [])):
        i += 1
        print("Downloading file: ", i)
        response = requests.get(link.get('href'))
        pdf = open("../Data/"+str(i)+".pdf", 'wb')
        pdf.write(response.content)
        pdf.close()
        print("File ", i, " downloaded")
print("All PDF files downloaded")

Downloading file:  1
File  1  downloaded
All PDF files downloaded


In [8]:
documents = SimpleDirectoryReader(input_files=[f"{DOCS_DIR}1.pdf"]).load_data()
documents

[Document(id_='eeab8418-53c2-47b2-95fe-7382650feea3', embedding=None, metadata={'page_label': '1', 'file_name': '1.pdf', 'file_path': '../Data/1.pdf', 'file_type': 'application/pdf', 'file_size': 154717, 'creation_date': '2024-05-30', 'last_modified_date': '2024-05-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text=' Food Calories List \nFrom : www .weightlossforall.com  \nThe food calories list is a table of everyday foods listing their calorie content per average portion. The \nfood calories list also gives the calorie content in 100 grams so it can be compared with any other \nproducts not listed here. The table can be useful if you want to exchange a food with similar calorie \ncontent when following a weight loss  low calorie program.  \nThe food 

In [33]:
index = VectorStoreIndex.from_documents(documents)

In [34]:
query_engine = index.as_query_engine()
response = query_engine.query("What is the document about?")
print(response)

The document provides a list of various foods along with their calorie content per average portion and per 100 grams. It includes information about different types of breads, cereals, fruits, and other food items. The list is intended to help individuals compare the calorie content of different foods and make choices that align with their dietary needs or goals, such as weight loss or maintaining a balanced diet.


In [35]:
INDEX_DIR = "../Index/intro/"
if not os.path.exists(INDEX_DIR):
  os.mkdir(INDEX_DIR)
index.storage_context.persist(INDEX_DIR)

In [36]:
if not os.path.exists(INDEX_DIR):
    documents = SimpleDirectoryReader(DOCS_DIR).load_data()
    index = VectorStoreIndex.from_documents(documents)
    index.storage_context.persist(persist_dir=INDEX_DIR)
else:
    storage_context = StorageContext.from_defaults(persist_dir=INDEX_DIR)
    index = load_index_from_storage(storage_context)

query_engine = index.as_query_engine()
response = query_engine.query("What is the document about?")
print(response)

The document is about a food calories list, which provides the calorie content of various everyday foods per average portion and per 100 grams. It includes information on foods from different groups such as breads, cereals, and fruits. The list can be used to compare calorie content between different foods and can be helpful for those following a weight loss or low calorie program.


In [37]:
response = query_engine.query("List some ingredients mentioned in the document?")
print(response)

The document mentions a variety of ingredients, including Bagel, Biscuit digestives, Jaffa cake, Bread (white and wholemeal), Chapatis, Cornflakes, Crackerbread, Cream crackers, Crumpets, Flapjacks, Macaroni, Muesli, Naan bread, Noodles, Pasta (normal and wholemeal), Porridge oats, Potatoes (boiled and roast), Apple, Apricot, Avocado, Banana, Blackberries, Blackcurrant, Blueberries, Cherry, Clementine, Currants, Damson, Dates, Figs, Gooseberries, Grapes, Grapefruit, Guava, Kiwi, Lemon, Lychees, Mango, Melon (Honeydew and Canteloupe), Nectarines, and Olives.
