# Advanced Evaluation with TruEra

Using TruLens, we can create fine-grained evaluations for complex queries.

Using the sub-question sentence-window engine from previous notebooks, let's create a custom evaluation framework.

In [1]:
#!pip install trulens-eval==0.12.0 llama-index==0.8.29post1 sentence-transformers transformers pypdf

## Query Engine Construction

In [2]:
import os
import openai

os.environ["OPENAI_API_KEY"] = "..."
openai.api_key = os.environ["OPENAI_API_KEY"]

In [3]:
!curl https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf --output IPCC_AR6_WGII_Chapter03.pdf

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 20.7M  100 20.7M    0     0  27.5M      0 --:--:-- --:--:-- --:--:-- 27.7M


In [4]:
from llama_index import SimpleDirectoryReader

documents = SimpleDirectoryReader(
    input_files=["./IPCC_AR6_WGII_Chapter03.pdf"]
).load_data()

In [5]:
# Merge into a single large document rather than one document per-page
from llama_index import Document

document = Document(text="\n\n".join([doc.text for doc in documents]))

In [6]:
from llama_index import ServiceContext
from llama_index.llms import OpenAI
from llama_index.node_parser import SentenceWindowNodeParser

# create the sentence window node parser w/ default settings
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,
    window_metadata_key="window",
    original_text_metadata_key="original_text",
)

llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
sentence_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model="local:BAAI/bge-small-en-v1.5",
    node_parser=node_parser,
)

In [7]:
from llama_index import VectorStoreIndex

sentence_index = VectorStoreIndex.from_documents(
    [document], service_context=sentence_context
)

In [8]:
from llama_index.indices.postprocessor import (
    MetadataReplacementPostProcessor,
    SentenceTransformerRerank,
)

sentence_window_engine = sentence_index.as_query_engine(
    similarity_top_k=6,
    # the target key defaults to `window` to match the node_parser's default
    node_postprocessors=[
        MetadataReplacementPostProcessor(target_metadata_key="window"),
        SentenceTransformerRerank(top_n=2, model="BAAI/bge-reranker-base"),
    ],
)

In [9]:
from llama_index.tools import QueryEngineTool, ToolMetadata
from llama_index.query_engine import SubQuestionQueryEngine

sentence_sub_engine = SubQuestionQueryEngine.from_defaults(
  [QueryEngineTool(
    query_engine=sentence_window_engine,
    metadata=ToolMetadata(name="climate_report", description="Climate Report on Oceans.")
  )],
  service_context=sentence_context,
  verbose=False,
)

In [10]:
import nest_asyncio
nest_asyncio.apply()

## Custom Eval Functions

In [35]:
import os

from trulens_eval import Feedback, OpenAI, Tru, TruLlama, feedback, Select, FeedbackMode

tru = Tru()

tru.reset_database()

Deleted 130 rows.


In [36]:
import numpy as np

# Initialize Huggingface-based feedback function collection class:
openai = feedback.OpenAI()

# Helpfulness
f_helpfulness = feedback = Feedback(openai.helpfulness).on_output() 

# Question/answer relevance between overall question and answer.
f_qa_relevance = Feedback(openai.relevance).on_input_output()

# Question/statement relevance between question and each context chunk.
# The context is located in a different place for the sub questions so we need to define that feedback separately
f_qs_relevance_subquestions = (
    Feedback(openai.qs_relevance)
    .on_input()
    .on(Select.Record.calls[0].rets.source_nodes[:].node.text)
    .aggregate(np.mean))

f_qs_relevance = (
    Feedback(openai.qs_relevance)
    .on_input()
    .on(Select.Record.calls[0].args.prompt_args.context_str)
    .aggregate(np.mean))

✅ In helpfulness, input text will be set to *.__record__.main_output or `Select.RecordOutput` .
✅ In Answer Relevance, input prompt will be set to *.__record__.main_input or `Select.RecordInput` .
✅ In Answer Relevance, input response will be set to *.__record__.main_output or `Select.RecordOutput` .
✅ In Subquestion Context Relevance, input question will be set to *.__record__.main_input or `Select.RecordInput` .
✅ In Subquestion Context Relevance, input statement will be set to *.__record__.calls[0].rets.source_nodes[:].node.text .
✅ In Context Relevance, input question will be set to *.__record__.main_input or `Select.RecordInput` .
✅ In Context Relevance, input statement will be set to *.__record__.calls[0].args.prompt_args.context_str .


In [37]:
# We'll use the recorder in deferred mode so we can log all of the subquestions before starting eval.
# This approach will give us smoother handling for the evals + more consistent logging at high volume.
# In addition, for our two different qs relevance definitions, deferred mode can just take the one that evaluates.
tru_recorder = TruLlama(
    sentence_sub_engine,
    app_id="CustomEvalTest",
    feedbacks=[f_qa_relevance, f_qs_relevance, f_qs_relevance_subquestions, f_helpfulness],
    feedback_mode=FeedbackMode.DEFERRED
)

In [38]:
# subset of questions from the "Advanced RAG" notebook
questions = [
  "Based on the provided text, discuss the impact of human activities on the natural carbon dynamics of estuaries, shelf seas, and other intertidal and shallow-water habitats. Provide examples from the text to support your answer.",
  "Analyze the combined effects of exploitation and multi-decadal climate fluctuations on global fisheries yields. How do these factors make it difficult to assess the impacts of global climate change on fisheries yields? Use specific examples from the text to support your analysis.",
  "Based on the study by Gutiérrez-Rodríguez, A.G., et al., 2018, what potential benefits do seaweeds have in the field of medicine, specifically in relation to cancer treatment?",
  "According to the research conducted by Haasnoot, M., et al., 2020, how does the uncertainty in Antarctic mass-loss impact the coastal adaptation strategy of the Netherlands?",
  "Based on the context, explain how the decline in warm water coral reefs is projected to impact the services they provide to society, particularly in terms of coastal protection.",
  "Tell me something about the intricacies of tying a tie.",
]

In [40]:
for question in questions:
  with tru_recorder as recording:
    sentence_sub_engine.query(question)

In [41]:
tru.start_evaluator()

<Thread(Thread-77 (runloop), started 14354604032)>

⚡ Feedback task starting: relevance for app CustomEvalTest, record record_hash_1588b559040ebb69e2d6366147df0fb9
⚡ Feedback task starting: qs_relevance for app CustomEvalTest, record record_hash_1588b559040ebb69e2d6366147df0fb9
⚡ Feedback task starting: qs_relevance for app CustomEvalTest, record record_hash_1588b559040ebb69e2d6366147df0fb9
⚡ Feedback task starting: helpfulness for app CustomEvalTest, record record_hash_1588b559040ebb69e2d6366147df0fb9
⚡ Feedback task starting: relevance for app CustomEvalTest, record record_hash_9e0615cec3fbceb8a75651022098dbae
⚡ Feedback task starting: qs_relevance for app CustomEvalTest, record record_hash_9e0615cec3fbceb8a75651022098dbae
⚡ Feedback task starting: qs_relevance for app CustomEvalTest, record record_hash_9e0615cec3fbceb8a75651022098dbae
Could not locate *.calls[0].args.prompt_args.context_str in app/record.
⚡ Feedback task starting: helpfulness for app CustomEvalTest, record record_hash_9e0615cec3fbceb8a75651022098dbae
⚡ Feedback task 

1-10 rating regex failed to match on: 'I apologize, but as an AI language model, I am unable to assess the helpfulness, insightfulness, and appropriateness of a submission.'
1-10 rating regex failed to match on: 'I apologize for the confusion. Since there is no specific submission mentioned, I cannot provide a rating.'


In [None]:
# launches on http://localhost:8501/
tru.run_dashboard()

Starting dashboard ...
Config file already exists. Skipping writing process.
Credentials file already exists. Skipping writing process.
npx: installed 22 in 3.398s

Go to this url and submit the ip given here. your url is: https://whole-tires-travel.loca.lt

  Submit this IP Address: 35.230.82.227



<Popen: returncode: None args: ['streamlit', 'run', '--server.headless=True'...>