In [None]:
%%capture
!pip install llama-index==0.10.37 html2text

In [None]:
import os

from getpass import getpass
import nest_asyncio

from dotenv import load_dotenv

nest_asyncio.apply()

load_dotenv()

# 📂 **Loading Data**

Preparing your data for an LLM involves an ingestion pipeline similar to ML data cleaning or traditional ETL processes.

### **Ingestion Pipeline Stages**
  - 📥 Load the data
  - 🔧 Transform the data
  - 🗃️ Index and store the data


Let's start by downloading some example files

In [None]:
import requests
from pathlib import Path

# Base URL for Project Gutenberg texts
base_url = "https://www.gutenberg.org/cache/epub/{book_id}/pg{book_id}.txt"

# Directory to save the downloaded files
directory = Path("../data")

# Create the directory if it doesn't exist
directory.mkdir(parents=True, exist_ok=True)

# Generate a list of book IDs to download
book_ids = range(1, 11)  # This will create a range from 1 to 10

# Generate URLs for each book ID
urls = [base_url.format(book_id=book_id) for book_id in book_ids]

# Download each file and save it in the specified directory
for url in urls:
    response = requests.get(url)
    if response.status_code == 200:
        # Extract the filename from the URL using the book ID and create a file name
        book_id = url.split('/')[-2]  # Extracts the book ID from the URL
        filename = f"pg{book_id}.txt"
        file_path = directory / filename
        # Save the file to the specified directory
        file_path.write_text(response.text)
        print(f"Downloaded {filename} to {file_path}")
    else:
        print(f"Failed to download {url}. HTTP status code: {response.status_code}")

# 📥 Load the data

To use data with an LLM, first load it using data connectors, known as `Readers` in LlamaIndex, which format data into `Document` objects containing data and metadata.

📚 **SimpleDirectoryReader**:
  - The most straightforward loader is `SimpleDirectoryReader``.
  - Built into LlamaIndex, it reads various formats (Markdown, PDFs, Word documents, PowerPoint decks, images, audio, video) from every file in a directory, creating documents.

In [None]:
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("./data").load_data()

In [None]:
len(documents)

In [None]:
type(documents[0])

In [None]:
documents[3].__dict__

##### Manually Create Document Objects

In [None]:
from llama_index.core import Document

manual_document = Document(text="This is an example of a manual document")

In [None]:
manual_document.__dict__

##### Adding metadata

You can add metadata in the document constructor:

In [None]:
manual_document_with_metadata = Document(
    text="This is an example of a manual document",
    metadata={"filename": "made-up-file-name", "category": "imaginary-category"}
)

In [None]:
manual_document_with_metadata.__dict__

Or after the document is created

In [None]:
manual_document.metadata={"filename": "made-up-file-name", "category": "imaginary-category"}

In [None]:
manual_document.__dict__

# 🔧 Transform the data

After loading, we must process and transform data for retrieval. We need to transform the list of `Document` objects into `Node` objects 

- ✂️ Include chunking, extracting metadata, and embedding each chunk in transformations.

- 🌟 Nodes are a first-class citizen in LlamaIndex, allowing direct definition or parsing from Documents.

- 🔄 Transformation inputs and outputs are `Node` objects (Note: `Document` is subclass of `Node`).

- 🛠️ Nodes are "chunks" of Documents, including text, images, etc., plus metadata and relationships.

- 📊 `NodeParser` classes convert Documents into Nodes with all necessary attributes. There are [a number of](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules.html) `NodeParser`'s you can choose from!

- 📑 High-level API: Use `.from_documents()` for automatic parsing and chunking of Document objects.

- 🔍 Underlying process splits Document into Node objects, maintaining text and metadata with a link to their parent Document.


In [None]:
from llama_index.core.node_parser import SentenceSplitter

parser = SentenceSplitter(
    chunk_size=128, # in tokens
    chunk_overlap=16, #in tokens
    paragraph_separator="\n\n"
)

nodes = parser.get_nodes_from_documents(documents, show_progress=True)

In [None]:
type(nodes[42])

In [None]:
nodes[42].__dict__

You can also choose to construct Node objects manually.


In [None]:
from llama_index.core.schema import TextNode, NodeRelationship, RelatedNodeInfo

node1 = TextNode(text="Dad is married to Mom", id_="001")

node2 = TextNode(text="Dad is Son's dad", id_="002")

## NodeRelationships

You can set relationships between nodes.

- 🌐 NodeRelationships assign connections between chunks of text. It's useful for:
  - Documents organized in a hierarchical manner (e.g., book, chapter, section, subsection)
  - Maintaining sequential order
  - Other complex relationships (ie, in legal documents for links a clause or other cases) 

- 🔍 NodeRelationships help retrieve not just the relevant section, but also related sections that might provide additional context or information.

In [None]:
node1.relationships[NodeRelationship.NEXT] = RelatedNodeInfo(
    node_id=node2.node_id
)

node2.relationships[NodeRelationship.PREVIOUS] = RelatedNodeInfo(
    node_id=node1.node_id
)
nodes = [node1, node2]

node2.relationships[NodeRelationship.PARENT] = RelatedNodeInfo(
    node_id=node1.node_id, metadata={"Romie": "Mom", "Harpreet": "Dad", "Jind":"Daughter", "Jugaad":"Son"}, 
)

A bit of clean up, let's just go ahead and delete the text files we downloaded since we won't need them going forward.


In [None]:
!rm -rf ./data