## Setup and Import Libraries

In [2]:
import os
import pandas as pd
import numpy as np
import nest_asyncio
from llama_index.core.schema import Document
from llama_index.llms.openai import OpenAI
from llama_index.core import (
    SimpleDirectoryReader, VectorStoreIndex, Settings, 
    StorageContext, load_index_from_storage
)
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.postprocessor import SentenceTransformerRerank
from llama_index.core.node_parser import HierarchicalNodeParser, get_leaf_nodes
from llama_index.core.retrievers import AutoMergingRetriever
from llama_index.core.query_engine.retriever_query_engine import RetrieverQueryEngine
from llama_index.core.response.notebook_utils import display_response

from trulens_eval.feedback import GroundTruthAgreement
from trulens_eval import TruLlama, Tru
from trulens.core import Feedback, FeedbackMode
from trulens_eval import OpenAI as fOpenAI
from utils import build_sentence_window_index, get_sentence_window_query_engine, get_prebuilt_trulens_recorder
from copy import deepcopy
from dotenv import load_dotenv

import warnings
warnings.filterwarnings('ignore')

In [3]:
load_dotenv()

True

In [4]:
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")
os.environ["HUGGINGFACE_API_KEY"] = os.getenv("HUGGINGFACE_API_KEY")

In [5]:
documents = SimpleDirectoryReader(
    input_files=["./eBook-How-to-Build-a-Career-in-AI.pdf"]
).load_data()

In [6]:
print(type(documents), "\n")
print(len(documents), "\n")
print(type(documents[0]))
print(documents[0])

<class 'list'> 

41 

<class 'llama_index.core.schema.Document'>
Doc ID: b53d0cdb-4fe4-4390-af33-5e9c25758015
Text: PAGE 1 Founder, DeepLearning.AI Collected Insights from Andrew
Ng How to  Build Your Career in AI A Simple Guide


In [7]:
document = Document(text="\n\n".join([doc.text for doc in documents]))

In [8]:
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)

## Auto-Merging Retrieval Setup

In [9]:
node_parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[2048, 512, 128]
)

In [10]:
nodes = node_parser.get_nodes_from_documents([document])

In [11]:
leaf_nodes = get_leaf_nodes(nodes)
print(leaf_nodes[30].text)

Of course, I also encourage learning driven by curiosity. If something interests you, go ahead 
and learn it regardless of how useful it might turn out to be!  Maybe this will lead to a creative 
spark or technical breakthrough.
How much math do you need to know to be a machine learning engineer?


In [12]:
nodes_by_id = {node.node_id: node for node in nodes}

parent_node = nodes_by_id[leaf_nodes[30].parent_node.node_id]
print(parent_node.text)

On some days, maybe you’ll end up studying for an 
hour or longer.

PAGE 12
Should You 
Learn Math to 
Get a Job in AI? 
CHAPTER 3
LEARNING

PAGE 13
Should you Learn Math to Get a Job in AI? CHAPTER 3
Is math a foundational skill for AI? It’s always nice to know more math! But there’s so much to 
learn that, realistically, it’s necessary to prioritize. Here’s how you might go about strengthening 
your math background.
To figure out what’s important to know, I find it useful to ask what you need to know to make 
the decisions required for the work you want to do. At DeepLearning.AI, we frequently ask, 
“What does someone need to know to accomplish their goals?” The goal might be building a 
machine learning model, architecting a system, or passing a job interview.
Understanding the math behind algorithms you use is often helpful, since it enables you to 
debug them. But the depth of knowledge that’s useful changes over time. As machine learning 
techniques mature and become more reliabl

## Building the Index

In [13]:
Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"
)

In [14]:
Settings.node_parser = node_parser

In [16]:
if not os.path.exists("./merging_index"):
    storage_context = StorageContext.from_defaults()
    storage_context.docstore.add_documents(nodes)

    automerging_index = VectorStoreIndex(
        leaf_nodes,
        storage_context=storage_context,
        embed_model=Settings.embed_model,
        node_parser=Settings.node_parser
    )
    
    automerging_index.storage_context.persist(persist_dir="./merging_index")
else:
    storage_context = StorageContext.from_defaults(persist_dir="./merging_index")
    automerging_index = load_index_from_storage(storage_context)

Loading llama_index.core.storage.kvstore.simple_kvstore from ./merging_index\docstore.json.
Loading llama_index.core.storage.kvstore.simple_kvstore from ./merging_index\index_store.json.


## Define Retriever

In [17]:
automerging_retriever = automerging_index.as_retriever(
    similarity_top_k=12
)

In [18]:
retriever = AutoMergingRetriever(
    automerging_retriever, 
    automerging_index.storage_context, 
    verbose=True
)

In [19]:
rerank = SentenceTransformerRerank(top_n=6, model="BAAI/bge-reranker-base")

## Define Query Engine

In [20]:
auto_merging_engine = RetrieverQueryEngine.from_args(
    automerging_retriever, node_postprocessors=[rerank]
)

In [21]:
auto_merging_response = auto_merging_engine.query(
    "What is the importance of networking in AI?"
)

In [22]:
display_response(auto_merging_response)

**`Final Response:`** Networking in AI is crucial as it helps in building a strong professional community and support system. By connecting with others in the field, individuals can gain valuable insights, advice, and opportunities that can propel their careers forward. Additionally, networking allows for collaboration, influence, and the exchange of ideas, which are essential in a rapidly evolving field like AI.

## Putting All Together

In [26]:
def build_automerging_index(
    documents, llm, embed_model="BAAI/bge-small-en-v1.5",
    save_dir="merging_index", chunk_sizes=None,
):
    
    chunk_sizes = chunk_sizes or [2048, 512, 128]
    node_parser = HierarchicalNodeParser.from_defaults(chunk_sizes=chunk_sizes)
    nodes = node_parser.get_nodes_from_documents(documents)
    
    leaf_nodes = get_leaf_nodes(nodes)
    
    storage_context = StorageContext.from_defaults()
    storage_context.docstore.add_documents(nodes)

    if not os.path.exists(save_dir):
        automerging_index = VectorStoreIndex(
            leaf_nodes,
            storage_context=storage_context,
            embed_model=Settings.embed_model,
            node_parser=Settings.node_parser
        )
        
        automerging_index.storage_context.persist(persist_dir=save_dir)
    else:
        storage_context = StorageContext.from_defaults(persist_dir=save_dir)
        automerging_index = load_index_from_storage(storage_context)
        
    return automerging_index

In [27]:
def get_automerging_query_engine(
    automerging_index, similarity_top_k=12,
    rerank_top_n=6,
):
    base_retriever = automerging_index.as_retriever(similarity_top_k=similarity_top_k)
    retriever = AutoMergingRetriever(
        base_retriever, automerging_index.storage_context, verbose=True
    )
    
    rerank = SentenceTransformerRerank(
        top_n=rerank_top_n, model="BAAI/bge-reranker-base"
    )
    
    auto_merging_engine = RetrieverQueryEngine.from_args(
        retriever, node_postprocessors=[rerank]
    )
    
    return auto_merging_engine

In [28]:
index = build_automerging_index(
    [document],
    llm=OpenAI(model="gpt-3.5-turbo", temperature=0.1),
    save_dir="./merging_index",
)

Loading llama_index.core.storage.kvstore.simple_kvstore from ./merging_index\docstore.json.
Loading llama_index.core.storage.kvstore.simple_kvstore from ./merging_index\index_store.json.


In [29]:
query_engine = get_automerging_query_engine(index, similarity_top_k=6)

## TruLens Evaluation

In [35]:
eval_questions = []
with open('generated_questions/generated_questions.text', 'r') as file:
    for line in file:
        # Remove newline character and convert to integer
        item = line.strip()
        eval_questions.append(item)

In [36]:
def run_evals(eval_questions, tru_recorder, query_engine):
    for question in eval_questions:
        with tru_recorder as recording:
            response = query_engine.query(question)

In [37]:
Tru().reset_database()

Updating app_name and app_version in apps table: 0it [00:00, ?it/s]
Updating app_id in records table: 0it [00:00, ?it/s]
Updating app_json in apps table: 0it [00:00, ?it/s]


### Two layers

In [38]:
auto_merging_index_0 = build_automerging_index(
    documents,
    llm=OpenAI(model="gpt-3.5-turbo", temperature=0.1),
    embed_model="BAAI/bge-small-en-v1.5",
    save_dir="merging_index_0",
    chunk_sizes=[2048,512],
)

Loading llama_index.core.storage.kvstore.simple_kvstore from merging_index_0\docstore.json.
Loading llama_index.core.storage.kvstore.simple_kvstore from merging_index_0\index_store.json.


In [39]:
auto_merging_engine_0 = get_automerging_query_engine(
    auto_merging_index_0,
    similarity_top_k=12,
    rerank_top_n=6,
)

In [40]:
tru_recorder = get_prebuilt_trulens_recorder(
    auto_merging_engine_0,
    app_id ='app_0'
)

instrumenting <class 'llama_index.embeddings.huggingface.base.HuggingFaceEmbedding'> for base <class 'llama_index.embeddings.huggingface.base.HuggingFaceEmbedding'>
instrumenting <class 'llama_index.embeddings.huggingface.base.HuggingFaceEmbedding'> for base <class 'llama_index.core.embeddings.multi_modal_base.MultiModalEmbedding'>
instrumenting <class 'llama_index.embeddings.huggingface.base.HuggingFaceEmbedding'> for base <class 'llama_index.core.base.embeddings.base.BaseEmbedding'>
instrumenting <class 'llama_index.embeddings.huggingface.base.HuggingFaceEmbedding'> for base <class 'llama_index.core.schema.TransformComponent'>
instrumenting <class 'llama_index.embeddings.huggingface.base.HuggingFaceEmbedding'> for base <class 'llama_index.core.schema.BaseComponent'>
instrumenting <class 'llama_index.embeddings.huggingface.base.HuggingFaceEmbedding'> for base <class 'pydantic.main.BaseModel'>
instrumenting <class 'llama_index.embeddings.huggingface.base.HuggingFaceEmbedding'> for base

In [41]:
run_evals(eval_questions, tru_recorder, auto_merging_engine_0)

> Merging 2 nodes into parent node.
> Parent node id: e3f5c9f4-c409-4f64-858e-7be67de22a96.
> Parent node text: PAGE 20
Working on projects requires making tough choices about what to build and how to go 
abou...

> Merging 1 nodes into parent node.
> Parent node id: 1afef9d1-9a84-49b0-8249-210b770884c2.
> Parent node text: PAGE 7
These phases apply in a wide 
range of professions, but AI 
involves unique elements.
For ...

> Merging 1 nodes into parent node.
> Parent node id: 338fbbf6-ba03-4af0-8595-202730001dd4.
> Parent node text: PAGE 15
One of the most important skills of an AI architect is the ability to identify ideas that...

> Merging 1 nodes into parent node.
> Parent node id: ced4f2af-8d42-48ba-aa99-cc29516fbf2f.
> Parent node text: PAGE 22
Over the course of a career, you’re likely to work on projects in succession, each growin...

> Merging 1 nodes into parent node.
> Parent node id: e4124c3b-91d5-45da-b44d-f2809ea02620.
> Parent node text: PAGE 18
It goes without saying t

In [42]:
Tru().get_leaderboard(app_ids=[])

Unnamed: 0_level_0,Unnamed: 1_level_0,Answer Relevance,latency,total_cost
app_name,app_version,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
app_0,base,1.0,6.17851,0.004783


In [44]:
# Tru().run_dashboard()

### Three layers

In [45]:
auto_merging_index_1 = build_automerging_index(
    documents,
    llm=OpenAI(model="gpt-3.5-turbo", temperature=0.1),
    embed_model="BAAI/bge-small-en-v1.5",
    save_dir="merging_index_1",
    chunk_sizes=[2048,512,128],
)

In [46]:
auto_merging_engine_1 = get_automerging_query_engine(
    auto_merging_index_1,
    similarity_top_k=12,
    rerank_top_n=6,
)

In [47]:
tru_recorder = get_prebuilt_trulens_recorder(
    auto_merging_engine_1,
    app_id ='app_1'
)

instrumenting <class 'llama_index.embeddings.huggingface.base.HuggingFaceEmbedding'> for base <class 'llama_index.embeddings.huggingface.base.HuggingFaceEmbedding'>
instrumenting <class 'llama_index.embeddings.huggingface.base.HuggingFaceEmbedding'> for base <class 'llama_index.core.embeddings.multi_modal_base.MultiModalEmbedding'>
instrumenting <class 'llama_index.embeddings.huggingface.base.HuggingFaceEmbedding'> for base <class 'llama_index.core.base.embeddings.base.BaseEmbedding'>
instrumenting <class 'llama_index.embeddings.huggingface.base.HuggingFaceEmbedding'> for base <class 'llama_index.core.schema.TransformComponent'>
instrumenting <class 'llama_index.embeddings.huggingface.base.HuggingFaceEmbedding'> for base <class 'llama_index.core.schema.BaseComponent'>
instrumenting <class 'llama_index.embeddings.huggingface.base.HuggingFaceEmbedding'> for base <class 'pydantic.main.BaseModel'>
instrumenting <class 'llama_index.embeddings.huggingface.base.HuggingFaceEmbedding'> for base

In [48]:
run_evals(eval_questions, tru_recorder, auto_merging_engine_1)

> Merging 4 nodes into parent node.
> Parent node id: d727ef1e-d83d-40cb-bdc2-aba030c0a29e.
> Parent node text: PAGE 20
Working on projects requires making tough choices about what to build and how to go 
abou...

> Merging 2 nodes into parent node.
> Parent node id: e90ceb02-069e-441c-a58a-cb1e95eb32f7.
> Parent node text: But when committing to a direction means making a costly investment or entering a one-
way door (...

> Merging 2 nodes into parent node.
> Parent node id: 53a34ee8-5897-403c-84d2-5cb3acf7937b.
> Parent node text: PAGE 20
Working on projects requires making tough choices about what to build and how to go 
abou...



In [49]:
Tru().get_leaderboard(app_ids=[])

Unnamed: 0_level_0,Unnamed: 1_level_0,Answer Relevance,Context Relevance,latency,total_cost
app_name,app_version,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
app_0,base,1.0,0.166667,6.17851,0.004783
app_1,base,1.0,,4.997449,0.002347


In [51]:
# Tru().run_dashboard()