# 1. Setup

## 1.1 Installing Libraries

Reference: [Llama Index Installation and Setup](https://docs.llamaindex.ai/en/stable/getting_started/installation/)

In [21]:
!pip install python-dotenv llama-index chromadb llama-index-vector-stores-chroma EbookLib html2text

Collecting EbookLib
  Using cached EbookLib-0.18-py3-none-any.whl
Collecting html2text
  Downloading html2text-2024.2.26.tar.gz (56 kB)
     ---------------------------------------- 0.0/56.5 kB ? eta -:--:--
     ------- -------------------------------- 10.2/56.5 kB ? eta -:--:--
     ------- -------------------------------- 10.2/56.5 kB ? eta -:--:--
     -------------- ------------------------ 20.5/56.5 kB 93.9 kB/s eta 0:00:01
     -------------- ------------------------ 20.5/56.5 kB 93.9 kB/s eta 0:00:01
     -------------------- ----------------- 30.7/56.5 kB 119.1 kB/s eta 0:00:01
     --------------------------- ---------- 41.0/56.5 kB 151.3 kB/s eta 0:00:01
     ---------------------------------- --- 51.2/56.5 kB 163.8 kB/s eta 0:00:01
     -------------------------------------- 56.5/56.5 kB 147.9 kB/s eta 0:00:00
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting require


[notice] A new release of pip is available: 23.2.1 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip



   ---------------------------------------- 3.8/3.8 MB 1.9 MB/s eta 0:00:00
Building wheels for collected packages: html2text
  Building wheel for html2text (pyproject.toml): started
  Building wheel for html2text (pyproject.toml): finished with status 'done'
  Created wheel for html2text: filename=html2text-2024.2.26-py3-none-any.whl size=33132 sha256=c2de4fd73f9c2c2028d2fcf8495e790f793611fd9d1e195fabc00a7aa1dbdeb9
  Stored in directory: c:\users\sufiyaanusmani\appdata\local\pip\cache\wheels\23\58\7c\d9c8c4d924a1ac2b621add1b2c1d30b639629a33cfdfde6a45
Successfully built html2text
Installing collected packages: lxml, html2text, EbookLib
Successfully installed EbookLib-0.18 html2text-2024.2.26 lxml-5.3.0


## 1.2 Importing Libraries

In [1]:
import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Document, PromptTemplate
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.ingestion import IngestionPipeline
from llama_index.llms.openai import OpenAI
from tqdm.asyncio import tqdm
from ebooklib import epub
import uuid
import os
from pathlib import Path
from dotenv import load_dotenv

## 1.3 Importing Environment Variables

In [2]:
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

## 1.4 Setting up Embedding Model

In [3]:
embed_model = OpenAIEmbedding(api_key=OPENAI_API_KEY)

## 1.5 Setting up LLM

In [4]:
llm = OpenAI(api_key=OPENAI_API_KEY, model_name="gpt-4o-mini", temperature=0.1)

# 2. Setting up VectorStore

In [5]:
class VectorStore:
    def __init__(self):
        self.db = chromadb.PersistentClient(path="./db")
        self.chroma_collection = self.db.get_or_create_collection("transcription_project")
        self.vector_store = ChromaVectorStore(chroma_collection=self.chroma_collection)
        self.index = VectorStoreIndex.from_vector_store(
            self.vector_store,
            embed_model=embed_model,
        )

    def add_documents(self, documents: list) -> None:
        # Add the documents to the LlamaIndex and persist them
        for document in documents:
            self.index.insert(document)
        self.index.storage_context.persist(persist_dir="./db")

In [6]:
vectorstore = VectorStore()

# 3. Loading Data from Directory using `SimpleDirectoryReader`

Reference: [Loaders](https://docs.llamaindex.ai/en/stable/understanding/loading/loading/)

Extracting Metadata Reference: [SimpleDirectoryReader](https://docs.llamaindex.ai/en/stable/module_guides/loading/simpledirectoryreader/)

We can specify a function that will read each file and extract metadata that gets attached to the resulting Document object for each file by passing the function as `file_metadata`

In [8]:
def extract_epub_metadata(book_path: str) -> dict:
    book_path = Path(book_path)
    if not book_path.exists():
        raise FileNotFoundError(f"EPUB file not found at path: {book_path}")
    book = epub.read_epub(str(book_path))

    return {
        "id": f"epub-{uuid.uuid4().hex}",
        "title": book.get_metadata("DC", "title")[0][0].rstrip(".epub") if book.get_metadata("DC", "title") else "N/A",
        "author": book.get_metadata("DC", "creator")[0][0] if book.get_metadata("DC", "creator") else "",
        "language": book.get_metadata("DC", "language")[0][0] if book.get_metadata("DC", "language") else "",
        "description": book.get_metadata("DC", "description")[0][0] if book.get_metadata("DC", "description") else "",
        "type": "epub",
    }

In [58]:
documents = SimpleDirectoryReader(input_dir="./data", file_metadata=extract_epub_metadata).load_data()

In [59]:
print(f"Total Documents: {len(documents)}")

Total Documents: 2


In [60]:
print(documents[0].metadata)

{'id': 'epub-1b21e77b9e224f789670b13ca27fe58e', 'title': 'Child Psychology', 'author': 'Mohamed A. Khalfan - XKP', 'language': 'en', 'description': 'This is a "household" book with 30 Chapters written for Muslim parents on the important subject of the Upbringing of Children with the application of Simple Psychology, Broad Parental Vision and Islamic Values. The Book is useful to the parents in helping their children develop a strong personality and assert it fully in the adult life to fare well in the society for a dignified survival as human life gets more complex with newer challenges and a wider spectrum of competition .\nPublisher: Tabligh Centre, Dar es Salaam\nFirst Edition: December, 2002\n-\nISLAMICMOBILITY.COM', 'type': 'epub'}


Loading a new book

In [15]:
new_book = SimpleDirectoryReader(input_files=["./data/give_and_take.epub"], file_metadata=extract_epub_metadata).load_data()
print(f"Metadata of first element: {new_book[0].metadata}")

# This way, we can load a new book and can use the same VectorStore object to add the new book to the index

Metadata of first element: {'id': 'epub-6d7d7179a80e414c9c18e6715cae457f', 'title': 'Give and Tak', 'author': 'Unknown', 'language': 'en', 'description': '', 'type': 'epub'}


# 4. Transforming

After the data is loaded, you then need to process and transform your data before putting it into a storage system. These transformations include chunking, extracting metadata, and embedding each chunk. This is necessary to make sure that the data can be retrieved, and used optimally by the LLM.

An `IngestionPipeline` uses a concept of Transformations that are applied to input data. These Transformations are applied to your input data

Reference: [IngestionPipeline](https://docs.llamaindex.ai/en/stable/module_guides/loading/ingestion_pipeline/)

In [16]:
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=512, chunk_overlap=10)
    ],
    vector_store=vectorstore.vector_store,
)

In [None]:
documents = pipeline.run(documents=documents)

In [84]:
print(f"Total Documents: {len(documents)}")

Total Documents: 502


Storing documents in vectorstore

In [89]:
documents = [Document(text=doc.text, metadata=doc.metadata, id_=doc.id_) for doc in documents]
vectorstore.add_documents(documents)

Function to load and add a new book to vectorstore

In [18]:
def add_book(book_path: str):
    print(f"Loading book from path: {book_path}")
    new_book = SimpleDirectoryReader(input_files=[book_path], file_metadata=extract_epub_metadata).load_data()
    print(f"Loaded book with metadata: {new_book[0].metadata}")
    new_book = pipeline.run(documents=new_book)
    new_book = [Document(text=doc.text, metadata=doc.metadata, id_=doc.id_) for doc in new_book]
    vectorstore.add_documents(new_book)
    print("Book added successfully!")

In [19]:
add_book("./data/give_and_take.epub")

Loading book from path: ./data/give_and_take.epub


  for root_file in tree.findall('//xmlns:rootfile[@media-type]', namespaces={'xmlns': NAMESPACES['CONTAINERNS']}):


Loaded book with metadata: {'id': 'epub-b09eb4f758f54495a29b6b23763dffb3', 'title': 'Give and Tak', 'author': 'Unknown', 'language': 'en', 'description': '', 'type': 'epub'}
Book added successfully!


# 5. Query Translation

In [9]:
question_template = """You are an AI language model assistant specializing in query expansion. Your task is to generate {num_queries} diverse versions of the given user question. These variations will be used to retrieve relevant documents from a vector database, helping to overcome limitations of distance-based similarity search.

Original question: {query}

Instructions:
1. Create {num_queries} unique variations of the original question.
2. Ensure each variation maintains the core intent of the original question.
3. Use different phrasings, synonyms, or perspectives for each variation.
4. Consider potential context or implications not explicitly stated in the original question.
5. Avoid introducing new topics or drastically changing the meaning of the question.

Please provide your {num_queries} question variations, each on a new line:
"""

question_prompt = PromptTemplate(question_template)

In [10]:
def generate_query_variations(question: str):
    print(f"Generating query variations for: {question}")

    fmt_prompt = question_prompt.format(num_queries=5, query=question)
    response = llm.complete(fmt_prompt)
    queries = response.text.split("\n")

    print("Generated query variations:")
    for query in queries:
        print(f"  {query}")

    return queries

In [11]:
question = "Why is there only one God?"
query_variations = generate_query_variations(question)

Generating query variations for: Why is there only one God?
Generated query variations:
  1. What is the reason behind the belief in a singular God?
  2. How do monotheistic religions justify the existence of only one God?
  3. What factors contribute to the concept of a solitary deity in various faiths?
  4. Is there a specific rationale for the monotheistic view of a singular divine being?
  5. What leads different cultures and religions to uphold the idea of a sole supreme being?


# 6. Performing Vector Search

In [12]:
top_n = 5
vector_retriever = vectorstore.index.as_retriever(similarity_top_k=top_n)

In [13]:
def fuse_results(results_dict: dict, similarity_top_k: int = 2):
    """Fuse results."""
    k = 60.0  # `k` is a parameter used to control the impact of outlier rankings.
    fused_scores = {}
    text_to_node = {}

    # compute reciprocal rank scores
    for nodes_with_scores in results_dict.values():
        for rank, node_with_score in enumerate(
            sorted(
                nodes_with_scores, key=lambda x: x.score or 0.0, reverse=True
            )
        ):
            text = node_with_score.node.get_content()
            text_to_node[text] = node_with_score
            if text not in fused_scores:
                fused_scores[text] = 0.0
            fused_scores[text] += 1.0 / (rank + k)

    # sort results
    reranked_results = dict(
        sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
    )

    # adjust node scores
    reranked_nodes = []
    for text, score in reranked_results.items():
        reranked_nodes.append(text_to_node[text])
        reranked_nodes[-1].score = score

    return reranked_nodes[:similarity_top_k]

In [14]:
async def retrieve_documents(
    queries: list[str],
    retrievers: list,
    top_n: int
):
    print("Retrieving documents")
    tasks = []

    for query in queries:
        for i, retriever in enumerate(retrievers):
            tasks.append(retriever.aretrieve(query))
    task_results = await tqdm.gather(*tasks)

    results_dict = {}
    for i, (query, query_result) in enumerate(zip(queries, task_results)):
        results_dict[(query, i)] = query_result

    docs = fuse_results(results_dict, top_n)
    print(f"Retrieved {len(docs)} documents (after fusion)")
    if len(docs) > top_n:
        print(f"Pruning documents from {len(docs)} to limit of {top_n}")
        docs = docs[:top_n]

    return docs

In [15]:
docs = await retrieve_documents(query_variations, [vector_retriever], top_n)

Retrieving documents


100%|██████████| 5/5 [00:02<00:00,  1.85it/s]

Retrieved 5 documents (after fusion)





In [25]:
# Printing first document
print("------- BOOK INFO -------")
print(f"Book Title: {docs[0].metadata['title']}")
print(f"Book ID   : {docs[0].metadata['id']}")
print(f"Author    : {docs[0].metadata['author']}")

print("\n------- TEXT -------")
print(docs[0].text)

------- BOOK INFO -------
Book Title: Theological Instructions (Amuzish-e Aqa'id)
Book ID   : epub-5344a60e09d24e24b20d8b03232d1684
Author    : Muhammad Taqi Misbah Yazdi

------- TEXT -------
There is no power except by Allah!” (Holy
Qur’an,[18:39](/printepub/book/export/html/123734#quran_ref_189052)).** _

### D. The Two Important Results Achieved

The result of the unity of Divine action is that nothing other than God
deserves worship, because as we have indicated before, a being does not
deserve to be worshipped by just being a creator or a lord. In other words,
Divinity _(uluhiyyah)_ is the necessary condition of lordship and creatorship.

From another angle, the result of monotheism in the latter meaning is that the
entirety of human reliance must be upon God, and in all of works He must be
trusted and solely from Him help must be requested. Man’s fear and hope ought
to be from Him, and when the sources for the completion of needs are out of
reach, one must not despair, because G

# 7. Preparing Context

In [29]:
def prepare_context(documents) -> str:
    contexts = []
    for doc in documents:
        # Be deliberate about which fields we actually send to the LLM context
        if doc.metadata["type"] == "video":
            context = f"""
                ### title: {doc.metadata["title"]}
                id: {len(contexts) + 1}
                type: video
                source_url: {doc.metadata["source_url"]}
                content:
                {doc.text}
            """
        if doc.metadata["type"] == "book" or doc.metadata["type"] == "epub":
            context = f"""
                ### title: {doc.metadata["title"]}
                id: {len(contexts) + 1}
                author: {doc.metadata["author"]}
                book_id: {doc.metadata["id"]}
                type: book
                content:
                {doc.text}
            """

        contexts.append(context)

    return "\n".join(contexts)

In [31]:
context = prepare_context(docs)
print(context)


                ### title: Theological Instructions (Amuzish-e Aqa'id)
                id: 1
                author: Muhammad Taqi Misbah Yazdi
                book_id: epub-5344a60e09d24e24b20d8b03232d1684
                type: book
                content:
                There is no power except by Allah!” (Holy
Qur’an,[18:39](/printepub/book/export/html/123734#quran_ref_189052)).** _

### D. The Two Important Results Achieved

The result of the unity of Divine action is that nothing other than God
deserves worship, because as we have indicated before, a being does not
deserve to be worshipped by just being a creator or a lord. In other words,
Divinity _(uluhiyyah)_ is the necessary condition of lordship and creatorship.

From another angle, the result of monotheism in the latter meaning is that the
entirety of human reliance must be upon God, and in all of works He must be
trusted and solely from Him help must be requested. Man’s fear and hope ought
to be from Him, and when the so

# 8. Generate Answer

In [32]:
answer_template = """You are a knowledgeable AI assistant tasked with answering questions based on the provided context. Your goal is to provide a comprehensive, accurate, and well-structured response using Chain-of-Thought reasoning.

Context:
{context_str}

Question: {query_str}

Instructions:
1. Carefully analyze the given context and question.
2. Use Chain-of-Thought reasoning to break down your answer into clear steps:
   a. First, identify the key components of the question, such as sub-problems that need to be explained before an answer can be derived
   b. Then, for each component, explain your thought process as you analyze the relevant information from the context.
   c. Show how you're connecting different pieces of information to form your conclusion.
3. Provide a detailed answer using only the information from the context.
4. If the context doesn't contain enough information to fully answer the question, state this clearly and explain why.
5. Organize your response with appropriate headings and subheadings for clarity.
6. Use bullet points or numbered lists where applicable to improve readability.
7. If relevant, include brief examples or analogies to illustrate key points.
8. After your detailed Chain-of-Thought reasoning, summarize your main points at the end of the response.
9. At the end, list all the contexts used in your reasoning. After your response, add a "References" section where you list the full contexts that you used arrive at your answer. Provide as much detail as available from each context (e.g., book title, author, full text of the relevant contexts. For video sources, include the url to the video.

For the references, use the format:

# References:
(for each context:)
## Context Id: title
Context excerpt (print as it is)

Please format your entire response in markdown for optimal readability.
"""

answer_prompt = PromptTemplate(answer_template)

In [33]:
def generate_answer(question: str, context: str) -> str:
    llm_answer_prompt = answer_prompt.format(context_str=context, query_str=question)

    print("QUESTION SENT TO LLM:")
    print(llm_answer_prompt)

    query_engine = vectorstore.index.as_query_engine(llm=llm)
    answer = query_engine.query(llm_answer_prompt)

    print(f"LLM Output (answer generation): \n{answer}")

    return str(llm_answer_prompt), str(answer)

In [35]:
llm_answer_prompt, answer_md = generate_answer(question, context)

QUESTION SENT TO LLM:
You are a knowledgeable AI assistant tasked with answering questions based on the provided context. Your goal is to provide a comprehensive, accurate, and well-structured response using Chain-of-Thought reasoning.

Context:

                ### title: Theological Instructions (Amuzish-e Aqa'id)
                id: 1
                author: Muhammad Taqi Misbah Yazdi
                book_id: epub-5344a60e09d24e24b20d8b03232d1684
                type: book
                content:
                There is no power except by Allah!” (Holy
Qur’an,[18:39](/printepub/book/export/html/123734#quran_ref_189052)).** _

### D. The Two Important Results Achieved

The result of the unity of Divine action is that nothing other than God
deserves worship, because as we have indicated before, a being does not
deserve to be worshipped by just being a creator or a lord. In other words,
Divinity _(uluhiyyah)_ is the necessary condition of lordship and creatorship.

From another angle

In [36]:
print(answer_md)

# Why is there only one God?

## Analysis:

### Monotheism in the Context:
- The context discusses the concept of monotheism, emphasizing the belief in one God.
- It explains the reasons behind the development of polytheism and the basis of monotheistic beliefs.
- The text highlights the importance of worshiping and relying solely on God, rejecting the worship of multiple deities.

### Key Components to Address:
1. **Development of Polytheism:**
   - Causes for the emergence of polytheism.
   - Basis of polytheism and why assuming multiple gods is not valid.

2. **Monotheistic Beliefs:**
   - Why the belief in one God is essential.
   - Criticism of the idea of multiple gods creating and governing the universe.

## Explanation:

### Development of Polytheism:
1. **Causes for Polytheism:**
   - Polytheism arose from human tendencies towards tangible representations of divinity.
   - Idols and symbols were created for worship, leading to the belief in multiple gods.
   - Egotistic intere