# 📓 LlamaIndex Quickstart

In this quickstart you will create a simple Llama Index app and learn how to log it and get feedback on an LLM response.

You'll also learn how to use feedbacks for guardrails, via filtering retrieved context.

For evaluation, we will leverage the RAG triad of groundedness, context relevance and answer relevance.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/truera/trulens/blob/main/trulens_eval/examples/quickstart/llama_index_quickstart.ipynb)

## Setup

### Install dependencies
Let's install some of the dependencies for this notebook if we don't have them already

In [None]:
# pip install trulens_eval llama_index openai

### Add API keys
For this quickstart, you will need an Open AI key. The OpenAI key is used for embeddings, completion and evaluation.

In [1]:
import os
os.environ["OPENAI_API_KEY"] = "sk-..."

### Import from TruLens

In [2]:
from trulens_eval import Tru
tru = Tru()
tru.reset_database()

🦑 Tru initialized with db url sqlite:///default.sqlite .
🛑 Secret keys may be written to the database. See the `database_redact_keys` option of `Tru` to prevent this.


### Download data

This example uses the text of Paul Graham’s essay, [“What I Worked On”](https://paulgraham.com/worked.html), and is the canonical llama-index example.

The easiest way to get it is to [download it via this link](https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt) and save it in a folder called data. You can do so with the following command:

In [3]:
import os
import urllib.request

url = "https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt"
file_path = 'data/paul_graham_essay.txt'

if not os.path.exists('data'):
    os.makedirs('data')

if not os.path.exists(file_path):
    urllib.request.urlretrieve(url, file_path)


### Create Simple LLM Application

This example uses LlamaIndex which internally uses an OpenAI LLM.

In [4]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI

Settings.chunk_size = 128
Settings.chunk_overlap = 16
Settings.llm = OpenAI()

documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine(similarity_top_k=3)

### Send your first request

In [5]:
response = query_engine.query("What did the author do growing up?")
print(response)

The author worked on writing and programming outside of school before college.


## Initialize Feedback Function(s)

In [6]:
from trulens_eval.feedback.provider import OpenAI
from trulens_eval import Feedback
import numpy as np

# Initialize provider class
provider = OpenAI()

# select context to be used in feedback. the location of context is app specific.
from trulens_eval.app import App
context = App.select_context(query_engine)

# Define a groundedness feedback function
f_groundedness = (
    Feedback(provider.groundedness_measure_with_cot_reasons, name = "Groundedness")
    .on(context.collect()) # collect context chunks into a list
    .on_output()
)

# Question/answer relevance between overall question and answer.
f_answer_relevance = (
    Feedback(provider.relevance_with_cot_reasons, name = "Answer Relevance")
    .on_input_output()
)
# Question/statement relevance between question and each context chunk.
f_context_relevance = (
    Feedback(provider.context_relevance_with_cot_reasons, name = "Context Relevance")
    .on_input()
    .on(context)
    .aggregate(np.mean)
)

✅ In Groundedness, input source will be set to __record__.app.query.rets.source_nodes[:].node.text.collect() .
✅ In Groundedness, input statement will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Answer Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Answer Relevance, input response will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Context Relevance, input question will be set to __record__.main_input or `Select.RecordInput` .
✅ In Context Relevance, input context will be set to __record__.app.query.rets.source_nodes[:].node.text .


## Instrument app for logging with TruLens

In [7]:
from trulens_eval import TruLlama
tru_query_engine_recorder = TruLlama(query_engine,
    app_id='LlamaIndex_App1',
    feedbacks=[f_groundedness, f_answer_relevance, f_context_relevance])

In [8]:
# or as context manager
with tru_query_engine_recorder as recording:
    query_engine.query("What did the author do growing up?")

## Use guardrails

In addition to making informed iteration, we can also directly use feedback results as guardrails at inference time. In particular, here we show how to use the context relevance score as a guardrail to filter out irrelevant context before it gets passed to the LLM. This both reduces hallucination and improves efficiency.

Below, you can see the TruLens feedback display of each context relevance chunk retrieved by our RAG.

In [9]:
last_record = recording.records[-1]

from trulens_eval.utils.display import get_feedback_result
get_feedback_result(last_record, "Context Relevance")

Wouldn't it be great if we could automatically filter out context chunks with relevance scores below 0.5?

We can do so with the TruLens guardrail, *WithFeedbackFilterNodes*. All we have to do is use the method `of_query_engine` to create a new filtered retriever, passing in the original retriever along with the feedback function and threshold we want to use.

In [10]:
from trulens_eval.guardrails.llama import WithFeedbackFilterNodes

# note: feedback function used for guardrail must only return a score, not also reasons
f_context_relevance_score = Feedback(provider.context_relevance)

filtered_query_engine = WithFeedbackFilterNodes(query_engine, feedback=f_context_relevance_score, threshold=0.5)

Then we can operate as normal

In [11]:
tru_recorder = TruLlama(filtered_query_engine,
    app_id='LlamaIndex_App1_Filtered',
    feedbacks=[f_answer_relevance, f_context_relevance, f_groundedness])

with tru_recorder as recording:
    llm_response = filtered_query_engine.query("What did the author do growing up?")

display(llm_response)

Response(response='The author focused on writing and programming outside of school before college. Specifically, the author wrote short stories, which were described as having characters with strong feelings but lacking in plot.', source_nodes=[NodeWithScore(node=TextNode(id_='a98829e7-c59e-4906-9ec8-d1a84ab231e4', embedding=None, metadata={'file_path': '/Users/jreini/Desktop/development/trulens/trulens_eval/examples/quickstart/data/paul_graham_essay.txt', 'file_name': 'paul_graham_essay.txt', 'file_type': 'text/plain', 'file_size': 75042, 'creation_date': '2024-07-03', 'last_modified_date': '2024-07-03'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='01d1924b-a1ae-4a1b-a728-02c0d2076cdd', node_type=<Obje

## See the power of context filters!

If we inspect the context relevance of our retreival now, you see only relevant context chunks!

In [12]:
last_record = recording.records[-1]

from trulens_eval.utils.display import get_feedback_result
get_feedback_result(last_record, "Context Relevance")

In [13]:
tru.get_leaderboard()

Unnamed: 0_level_0,Groundedness,Context Relevance,Answer Relevance,latency,total_cost
app_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LlamaIndex_App1_Filtered,1.0,0.8,0.8,1.0,0.005268
LlamaIndex_App1,0.8,0.4,0.8,1.0,0.000713


## Retrieve records and feedback

In [14]:
# The record of the app invocation can be retrieved from the `recording`:

rec = recording.get() # use .get if only one record
# recs = recording.records # use .records if multiple

display(rec)

Record(record_id='record_hash_9f960b879e7fbb4c48a58b1cdfb87b3f', app_id='LlamaIndex_App1_Filtered', cost=Cost(n_requests=5, n_successful_requests=15, n_classes=0, n_tokens=3537, n_stream_chunks=0, n_prompt_tokens=3493, n_completion_tokens=44, cost=0.005267500000000001), perf=Perf(start_time=datetime.datetime(2024, 7, 3, 5, 47, 20, 778856), end_time=datetime.datetime(2024, 7, 3, 5, 47, 22, 824169)), ts=datetime.datetime(2024, 7, 3, 5, 47, 22, 824425), tags='-', meta=None, main_input='What did the author do growing up?', main_output='The author focused on writing and programming outside of school before college. Specifically, the author wrote short stories, which were described as having characters with strong feelings but lacking in plot.', main_error=None, calls=[RecordAppCall(call_id='74117bff-ce0f-4fd8-9f34-d299e7a8d80f', stack=[RecordAppCallMethod(path=Lens().app, method=Method(obj=Obj(cls=trulens_eval.guardrails.llama.WithFeedbackFilterNodes, id=14295036320, init_bindings=None), na

In [None]:
tru.run_dashboard()

In [15]:
# The results of the feedback functions can be rertireved from
# `Record.feedback_results` or using the `wait_for_feedback_result` method. The
# results if retrieved directly are `Future` instances (see
# `concurrent.futures`). You can use `as_completed` to wait until they have
# finished evaluating or use the utility method:

for feedback, feedback_result in rec.wait_for_feedback_results().items():
    print(feedback.name, feedback_result.result)

# See more about wait_for_feedback_results:
# help(rec.wait_for_feedback_results)

Answer Relevance 0.8
Context Relevance 0.8
Groundedness 1.0


In [16]:
records, feedback = tru.get_records_and_feedback(app_ids=[])

records.head()

Unnamed: 0,app_id,app_json,type,record_id,input,output,tags,record_json,cost_json,perf_json,ts,Groundedness,Answer Relevance,Context Relevance,Groundedness_calls,Answer Relevance_calls,Context Relevance_calls,latency,total_tokens,total_cost
0,LlamaIndex_App1,"{""tru_class_info"": {""name"": ""TruLlama"", ""modul...",RetrieverQueryEngine(llama_index.core.query_en...,record_hash_d8d7b57e2f4927576f58e09f26469c42,"""What did the author do growing up?""","""The author worked on writing and programming ...",-,"{""record_id"": ""record_hash_d8d7b57e2f4927576f5...","{""n_requests"": 2, ""n_successful_requests"": 3, ...","{""start_time"": ""2024-07-03T05:47:14.007165"", ""...",2024-07-03T05:47:15.029467,0.8,0.8,0.4,[{'args': {'source': ['I remember taking the b...,[{'args': {'prompt': 'What did the author do g...,[{'args': {'question': 'What did the author do...,1,487,0.000713
1,LlamaIndex_App1_Filtered,"{""tru_class_info"": {""name"": ""TruLlama"", ""modul...",WithFeedbackFilterNodes(trulens_eval.guardrail...,record_hash_9f960b879e7fbb4c48a58b1cdfb87b3f,"""What did the author do growing up?""","""The author focused on writing and programming...",-,"{""record_id"": ""record_hash_9f960b879e7fbb4c48a...","{""n_requests"": 5, ""n_successful_requests"": 15,...","{""start_time"": ""2024-07-03T05:47:20.778856"", ""...",2024-07-03T05:47:22.824425,1.0,0.8,0.8,"[{'args': {'source': [""What I Worked On\n\nFeb...",[{'args': {'prompt': 'What did the author do g...,[{'args': {'question': 'What did the author do...,1,3537,0.005268


In [17]:
tru.get_leaderboard(app_ids=[])

Unnamed: 0_level_0,Answer Relevance,Context Relevance,Groundedness,latency,total_cost
app_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LlamaIndex_App1_Filtered,0.8,0.8,1.0,2.0,0.005268
LlamaIndex_App1,0.8,0.4,0.8,2.0,0.000713


## Explore in a Dashboard

In [None]:
tru.run_dashboard() # open a local streamlit app to explore

# tru.stop_dashboard() # stop if needed

Alternatively, you can run `trulens-eval` from a command line in the same folder to start the dashboard.