# 1. Setup

## 1.1 Installing Libraries

Reference: [Llama Index Installation and Setup](https://docs.llamaindex.ai/en/stable/getting_started/installation/)

In [None]:
!pip install python-dotenv llama-index chromadb llama-index-vector-stores-chroma llama-index-retrievers-bm25 EbookLib html2text langchain-text-splitters

## 1.2 Importing Libraries

In [1]:
import chromadb

from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, PromptTemplate, get_response_synthesizer
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.ingestion import IngestionPipeline
from llama_index.llms.openai import OpenAI
from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.core.node_parser.relational.markdown_element import MarkdownElementNodeParser
from llama_index.core.node_parser import MarkdownNodeParser

from ebooklib import epub
import uuid
import os
from pathlib import Path
from dotenv import load_dotenv
import nest_asyncio
from enum import Enum

nest_asyncio.apply()

## 1.3 Importing Environment Variables

In [2]:
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

## 1.4 Setting up Embedding Model

In [3]:
embed_model = OpenAIEmbedding(api_key=OPENAI_API_KEY)

## 1.5 Setting up LLM

In [4]:
llm = OpenAI(api_key=OPENAI_API_KEY, model_name="gpt-4o-mini", temperature=0.1)

# 2. Loading Data from Directory using `SimpleDirectoryReader`

Reference: [Loaders](https://docs.llamaindex.ai/en/stable/understanding/loading/loading/)

Extracting Metadata Reference: [SimpleDirectoryReader](https://docs.llamaindex.ai/en/stable/module_guides/loading/simpledirectoryreader/)

We can specify a function that will read each file and extract metadata that gets attached to the resulting Document object for each file by passing the function as `file_metadata`

In [5]:
def extract_epub_metadata(book_path: str) -> dict:
    book_path = Path(book_path)
    if not book_path.exists():
        raise FileNotFoundError(f"EPUB file not found at path: {book_path}")
    book = epub.read_epub(str(book_path))

    return {
        "id": f"epub-{uuid.uuid4().hex}",
        "title": book.get_metadata("DC", "title")[0][0].rstrip(".epub") if book.get_metadata("DC", "title") else "N/A",
        "author": book.get_metadata("DC", "creator")[0][0] if book.get_metadata("DC", "creator") else "",
        "language": book.get_metadata("DC", "language")[0][0] if book.get_metadata("DC", "language") else "",
        "description": book.get_metadata("DC", "description")[0][0] if book.get_metadata("DC", "description") else "",
        "type": "epub",
        "embeddings": "openaiembeddings"
    }

In [6]:
documents = SimpleDirectoryReader(input_dir="./data", file_metadata=extract_epub_metadata).load_data()

  for root_file in tree.findall('//xmlns:rootfile[@media-type]', namespaces={'xmlns': NAMESPACES['CONTAINERNS']}):


In [7]:
print(f"Total Documents: {len(documents)}")

Total Documents: 1


In [8]:
print(documents[0].metadata)

{'id': 'epub-cffc5c28f4df4d1da6de069631240333', 'title': 'Islamic Laws', 'author': 'Sayyid Ali Hussaini Sistani', 'language': 'en', 'description': '', 'type': 'epub', 'embeddings': 'openaiembeddings'}


# 3. Setting up Index and Vectorstore

In [9]:
# The name "IndexedVectorStore" emphasizes that the class handles both the vector store and the index

class IndexedVectorStore:
    def __init__(self):
        self.db = chromadb.PersistentClient(path="./db")
        self.chroma_collection = self.db.get_or_create_collection("transcription_project")
        self.vector_store = ChromaVectorStore(chroma_collection=self.chroma_collection)
        self.index = VectorStoreIndex.from_vector_store(
            self.vector_store,
            embed_model=embed_model,
        )

    def add_documents(self, documents: list) -> None:
        # Add the documents to the LlamaIndex and persist them
        for document in documents:
            self.index.insert(document)
        self.index.storage_context.persist(persist_dir="./db")

In [10]:
vectorstore = IndexedVectorStore()

# 4. Transforming

After the data is loaded, you then need to process and transform your data before putting it into a storage system. These transformations include chunking, extracting metadata, and embedding each chunk. This is necessary to make sure that the data can be retrieved, and used optimally by the LLM.

An `IngestionPipeline` uses a concept of Transformations that are applied to input data. These Transformations are applied to your input data

Reference: [IngestionPipeline](https://docs.llamaindex.ai/en/stable/module_guides/loading/ingestion_pipeline/)

In [11]:
chunk_size = 512
overlap_percentage = 0.25
chunk_overlap = int(chunk_size * overlap_percentage)

print(f"Chunk Size: {chunk_size}, Overlap Percentage: {overlap_percentage}, Chunk Overlap: {chunk_overlap}")

Chunk Size: 512, Overlap Percentage: 0.25, Chunk Overlap: 128


In [12]:
transformations = [
    # MarkdownNodeParser(),
    SentenceSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap, paragraph_separator="\n\n\n"),
    embed_model
]

In [13]:
pipeline = IngestionPipeline(
    transformations=transformations,
    # vector_store=vectorstore.vector_store
)

In [14]:
chunks = pipeline.run(documents=documents)

In [15]:
print(f"Total Chunks: {len(chunks)}")
print(type(chunks[0]))

Total Chunks: 850
<class 'llama_index.core.schema.TextNode'>


In [19]:
for i, doc in enumerate(chunks[:10]):
    print(f"Document {i + 1} Length: {len(doc.text)}")

Document 1 Length: 298
Document 2 Length: 216
Document 3 Length: 1617
Document 4 Length: 1702
Document 5 Length: 1665
Document 6 Length: 1245
Document 7 Length: 1516
Document 8 Length: 1515
Document 9 Length: 1124
Document 10 Length: 7


In [56]:
print(f"Total Sentence Chunks: {len(chunks)}\n")

for i, chunk in enumerate(chunks):
    print(f"CHUNK {i}:\n{chunk.text}\n-------------------------------------------------\n-------------------------------------------------\n\n")

Total Sentence Chunks: 676

CHUNK 0:
Theological Instructions (Amuzish-e Aqa'id)

This text is a collection of 60 lessons on Islamic Theology. The topics are
organized in logical order and are full of traditions and verses of the Holy
Qur’an to support and clarify each subject. This is not only a traditional
text on theology but it approaches the modern and postmodern issues in a
traditional style and replies the doubts raised by that domain on traditional
bases.

Get PDF [4] Get EPUB [5] Get MOBI [6]
-------------------------------------------------
-------------------------------------------------


CHUNK 1:
Introduction

It is almost a truism to say that a person’s beliefs play a central role in
making up his personality. We are what we believe, and we become what we come
to believe. It is our beliefs that determine how we look at ourselves, how we
look at life, at the world around us and our own role and destiny in life. Our
beliefs determine our ideals and actions and mold the con

# 5. Recursive Retriever

In [10]:
vector_query_engine = vectorstore.index.as_query_engine()
vector_retriever = vectorstore.index.as_retriever()

In [11]:
from llama_index.core.retrievers import RecursiveRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core import get_response_synthesizer

recursive_retriever = RecursiveRetriever(
    "vector",
    retriever_dict={"vector": vector_retriever},
    verbose=True,
)

response_synthesizer = get_response_synthesizer(response_mode="compact")

query_engine = RetrieverQueryEngine.from_args(recursive_retriever, response_synthesizer=response_synthesizer)

In [12]:
response = query_engine.query("What are the signs given by Allah?")

[1;3;34mRetrieving with query id None: What are the signs given by Allah?
[0m[1;3;38;5;200mRetrieving text node: B. Familiar Signs

By reflecting upon the signs _(ayat)_ around us, man can arrive at an
understanding of his Creator. In Qur’anic terminology such reflection is
termed as ‘contemplation upon the signs of the Lord’. Hence everything that is
in the heavens and the earth and within man himself reflects God and channels
the heart to feel the presence of God’s guidance in the universe.

This very book that you have in your hand is a sign from the author. Is it not
true that by reading it you are becoming aware that the writer is intelligent
and has a purpose?

Have you ever thought that this work is the result of a chain of reactions
without any intent? Is it not an absurd idea that an encyclopedia of a hundred
volumes could have come into existence as an effect of an explosion which came
to occur in a metal mine, the fragments of which took the form of letters and
through ac