In [1]:
%%capture
!pip install llama-index==0.10.37 html2text

In [2]:
import os

from getpass import getpass
import nest_asyncio

from dotenv import load_dotenv

nest_asyncio.apply()

load_dotenv()

True

# 📂 **Loading Data**

Preparing your data for an LLM involves an ingestion pipeline similar to ML data cleaning or traditional ETL processes.

### **Ingestion Pipeline Stages**
  - 📥 Load the data
  - 🔧 Transform the data
  - 🗃️ Index and store the data


Let's start by downloading some example files

# 📥 Load the data

To use data with an LLM, first load it using data connectors, known as `Readers` in LlamaIndex, which format data into `Document` objects containing data and metadata.

📚 **SimpleDirectoryReader**:
  - The most straightforward loader is `SimpleDirectoryReader``.
  - Built into LlamaIndex, it reads various formats (Markdown, PDFs, Word documents, PowerPoint decks, images, audio, video) from every file in a directory, creating documents.

In [1]:
#from llama_index.core import SimpleDirectoryReader
from llama_index.legacy.readers.file.base import SimpleDirectoryReader
import os

print("pwd ", os.getcwd())
print("DIR\n",os.listdir())
# documents = SimpleDirectoryReader("./data").load_data()
# /workspaces/hands-on-ai-rag-using-llamaindex-3830207/data/almanack_of_naval_ravikant.pdf
documents = SimpleDirectoryReader(input_files=['../data/almanack_of_naval_ravikant.pdf']).load_data() 

pwd  /workspaces/hands-on-ai-rag-using-llamaindex-3830207/02_Fundamental_Concepts_in_LlamaIndex
DIR
 ['02_04_Indexing.ipynb', '02_07_Agents.ipynb', '02_02_Using_LLMs.ipynb', '02_01_How_LlamaIndex_Is_Organized.ipynb', '02_05_Storing.ipynb', '02_03_Loading_Data.ipynb', '02_06_Querying.ipynb']


In [2]:
len(documents)

242

In [3]:
type(documents[0])

llama_index.legacy.schema.Document

In [4]:
documents[3].__dict__

{'id_': 'fede48c1-5fcb-425d-bdc6-ad8fef497b5f',
 'embedding': None,
 'metadata': {'page_label': '4',
  'file_name': 'almanack_of_naval_ravikant.pdf',
  'file_path': '../data/almanack_of_naval_ravikant.pdf',
  'file_type': 'application/pdf',
  'file_size': 1884309,
  'creation_date': '2025-01-23',
  'last_modified_date': '2025-01-23',
  'last_accessed_date': '2025-01-23'},
 'excluded_embed_metadata_keys': ['file_name',
  'file_type',
  'file_size',
  'creation_date',
  'last_modified_date',
  'last_accessed_date'],
 'excluded_llm_metadata_keys': ['file_name',
  'file_type',
  'file_size',
  'creation_date',
  'last_modified_date',
  'last_accessed_date'],
 'relationships': {},
 'text': 'Copyright © 2020 EriC JorgEnson\nAll rights reserved.\nthE AlmAn ACk of nA vAl rA vikAnt\nA Guide to Wealth and Happiness\nisBn 978-1-5445-1422-2 Hardcover\n 978-1-5445-1421-5 Paperback\n 978-1-5445-1420-8 Ebook\nThis book has been created as a public service. It is available for \nfree download in pdf a

##### Manually Create Document Objects

In [5]:
from llama_index.core import Document

manual_document = Document(text="This is an example of a manual document")

In [6]:
manual_document.__dict__

{'id_': '772eab16-991b-49f8-af60-b499d42add4d',
 'embedding': None,
 'metadata': {},
 'excluded_embed_metadata_keys': [],
 'excluded_llm_metadata_keys': [],
 'relationships': {},
 'text': 'This is an example of a manual document',
 'mimetype': 'text/plain',
 'start_char_idx': None,
 'end_char_idx': None,
 'text_template': '{metadata_str}\n\n{content}',
 'metadata_template': '{key}: {value}',
 'metadata_seperator': '\n'}

##### Adding metadata

You can add metadata in the document constructor:

In [7]:
manual_document_with_metadata = Document(
    text="This is an example of a manual document",
    metadata={"filename": "made-up-file-name", "category": "imaginary-category"}
)

In [8]:
manual_document_with_metadata.__dict__

{'id_': '374d4814-df31-4a63-9cde-42d299cf6a8d',
 'embedding': None,
 'metadata': {'filename': 'made-up-file-name',
  'category': 'imaginary-category'},
 'excluded_embed_metadata_keys': [],
 'excluded_llm_metadata_keys': [],
 'relationships': {},
 'text': 'This is an example of a manual document',
 'mimetype': 'text/plain',
 'start_char_idx': None,
 'end_char_idx': None,
 'text_template': '{metadata_str}\n\n{content}',
 'metadata_template': '{key}: {value}',
 'metadata_seperator': '\n'}

Or after the document is created

In [9]:
manual_document.metadata={"filename": "made-up-file-name", "category": "imaginary-category"}

In [10]:
manual_document.__dict__

{'id_': '772eab16-991b-49f8-af60-b499d42add4d',
 'embedding': None,
 'metadata': {'filename': 'made-up-file-name',
  'category': 'imaginary-category'},
 'excluded_embed_metadata_keys': [],
 'excluded_llm_metadata_keys': [],
 'relationships': {},
 'text': 'This is an example of a manual document',
 'mimetype': 'text/plain',
 'start_char_idx': None,
 'end_char_idx': None,
 'text_template': '{metadata_str}\n\n{content}',
 'metadata_template': '{key}: {value}',
 'metadata_seperator': '\n'}

# 🔧 Transform the data

After loading, we must process and transform data for retrieval. We need to transform the list of `Document` objects into `Node` objects 

- ✂️ Include chunking, extracting metadata, and embedding each chunk in transformations.

- 🌟 Nodes are a first-class citizen in LlamaIndex, allowing direct definition or parsing from Documents.

- 🔄 Transformation inputs and outputs are `Node` objects (Note: `Document` is subclass of `Node`).

- 🛠️ Nodes are "chunks" of Documents, including text, images, etc., plus metadata and relationships.

- 📊 `NodeParser` classes convert Documents into Nodes with all necessary attributes. There are [a number of](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules.html) `NodeParser`'s you can choose from!

- 📑 High-level API: Use `.from_documents()` for automatic parsing and chunking of Document objects.

- 🔍 Underlying process splits Document into Node objects, maintaining text and metadata with a link to their parent Document.


In [11]:
from llama_index.core.node_parser import SentenceSplitter

parser = SentenceSplitter(
    chunk_size=128, # in tokens
    chunk_overlap=16, #in tokens
    paragraph_separator="\n\n"
)

nodes = parser.get_nodes_from_documents(documents, show_progress=True)

Parsing nodes:   0%|          | 0/242 [00:00<?, ?it/s]

ValueError: Unknown document type: <class 'llama_index.legacy.schema.Document'>

In [None]:
type(nodes[42])

In [None]:
nodes[42].__dict__

You can also choose to construct Node objects manually.


In [None]:
from llama_index.core.schema import TextNode, NodeRelationship, RelatedNodeInfo

node1 = TextNode(text="Dad is married to Mom", id_="001")

node2 = TextNode(text="Dad is Son's dad", id_="002")

## NodeRelationships

You can set relationships between nodes.

- 🌐 NodeRelationships assign connections between chunks of text. It's useful for:
  - Documents organized in a hierarchical manner (e.g., book, chapter, section, subsection)
  - Maintaining sequential order
  - Other complex relationships (ie, in legal documents for links a clause or other cases) 

- 🔍 NodeRelationships help retrieve not just the relevant section, but also related sections that might provide additional context or information.

In [None]:
node1.relationships[NodeRelationship.NEXT] = RelatedNodeInfo(
    node_id=node2.node_id
)

node2.relationships[NodeRelationship.PREVIOUS] = RelatedNodeInfo(
    node_id=node1.node_id
)
nodes = [node1, node2]

node2.relationships[NodeRelationship.PARENT] = RelatedNodeInfo(
    node_id=node1.node_id, metadata={"Romie": "Mom", "Harpreet": "Dad", "Jind":"Daughter", "Jugaad":"Son"}, 
)

A bit of clean up, let's just go ahead and delete the text files we downloaded since we won't need them going forward.


In [None]:
!rm -rf ./data