# Quickstart

In this quickstart you will create a simple Llama Index App and learn how to log it and get feedback on an LLM response.

## Setup

### Install dependencies
Let's install some of the dependencies for this notebook if we don't have them already

In [1]:
%load_ext autoreload

In [2]:
#!pip install trulens-eval
#!pip install llama_index==0.6.31

### Add API keys
For this quickstart, you will need Open AI and Huggingface keys

In [3]:
#import os
#os.environ["OPENAI_API_KEY"] = "..."
#os.environ["HUGGINGFACE_API_KEY"] = "..."

### Import from LlamaIndex and TruLens

In [4]:
# Imports main tools:
from trulens_eval import TruLlama, Feedback, Tru, feedback
tru = Tru()


No .env found in /home/shayak/code/trulens/trulens_eval/examples/frameworks/llama_index or its parents. You may need to specify secret keys in another manner.


### Create Simple LLM Application

This example uses LlamaIndex which internally uses an OpenAI LLM.

In [5]:
# LLama Index starter example from: https://gpt-index.readthedocs.io/en/latest/getting_started/starter_example.html
# In order to run this, download into data/ Paul Graham's Essay 'What I Worked On' from https://github.com/jerryjliu/llama_index/blob/main/examples/paul_graham_essay/data/paul_graham_essay.txt 

from llama_index import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader('data').load_data()
index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine()

### Send your first request

In [6]:
#response = query_engine.query("What did the author do growing up?")
#print(response)

## Initialize Feedback Function(s)

In [7]:
import numpy as np

# Initialize Huggingface-based feedback function collection class:
hugs = feedback.Huggingface()
openai = feedback.OpenAI()

# Define a language match feedback function using HuggingFace.
f_lang_match = Feedback(hugs.language_match).on_input_output()
# By default this will check language match on the main app input and main app
# output.

# Question/answer relevance between overall question and answer.
f_qa_relevance = Feedback(openai.relevance).on_input_output()

# Question/statement relevance between question and each context chunk.
f_qs_relevance = Feedback(openai.qs_relevance).on_input().on(
    TruLlama.select_source_nodes().node.text
).aggregate(np.min)

✅ In language_match, input text1 will be set to *.__record__.main_input or `Select.RecordInput` .
✅ In language_match, input text2 will be set to *.__record__.main_output or `Select.RecordOutput` .
✅ In relevance, input prompt will be set to *.__record__.main_input or `Select.RecordInput` .
✅ In relevance, input response will be set to *.__record__.main_output or `Select.RecordOutput` .
✅ In qs_relevance, input question will be set to *.__record__.main_input or `Select.RecordInput` .
✅ In qs_relevance, input statement will be set to *.__record__.app.query.rets.source_nodes[:].node.text .


In [8]:
golden_set = [
    {"query": "What was the author's undergraduate major?", "response": "He didn't choose a major, and customized his courses."},
    {"query": "What company did the author start in 1995?", "response": "Viaweb, to make software for building online stores."},
    {"query": "Where did the author move in 1998 after selling Viaweb?", "response": "California, after Yahoo acquired Viaweb."},
    {"query": "What did the author do after leaving Yahoo in 1999?", "response": "He focused on painting and tried to improve his art skills."},
    {"query": "What program did the author start with Jessica Livingston in 2005?", "response": "Y Combinator, to provide seed funding for startups."}
]



In [9]:
f_groundtruth = Feedback(openai.GroundTruthAgreement(golden_set).agreement_measure).on_input_output()

✅ In agreement_measure, input prompt will be set to *.__record__.main_input or `Select.RecordInput` .
✅ In agreement_measure, input response will be set to *.__record__.main_output or `Select.RecordOutput` .


## Instrument chain for logging with TruLens

In [10]:
tru_query_engine = TruLlama(query_engine,
    app_id='LlamaIndex_App1',
    feedbacks=[f_lang_match, f_qa_relevance, f_qs_relevance, f_groundtruth])

✅ app LlamaIndex_App1 -> default.sqlite
✅ feedback def. feedback_definition_hash_893bf600540de189bca2d4c2d84731a4 -> default.sqlite
✅ feedback def. feedback_definition_hash_661a9eeb1b5c8d40fac0332aa866b848 -> default.sqlite
✅ feedback def. feedback_definition_hash_a7ff177399bdf82c0027aa7f2e4e3783 -> default.sqlite
✅ feedback def. feedback_definition_hash_a8adb28134a95ba7ff167af8b9832562 -> default.sqlite


In [11]:
# Instrumented query engine can operate like the original:
llm_response = tru_query_engine.query("Where was the author born?")

print(llm_response)


The author does not mention where they were born in the context information provided.
✅ record record_hash_0f9ef2099fa55956ff64769faf30e761 from LlamaIndex_App1 -> default.sqlite
✅ feedback feedback_result_hash_94525ec12e1b0c1057c0d82e65c185e4 on record_hash_0f9ef2099fa55956ff64769faf30e761 -> default.sqlite
✅ feedback feedback_result_hash_45fd96bdd2aeea3b3b39bca1f1396f47 on record_hash_0f9ef2099fa55956ff64769faf30e761 -> default.sqlite
✅ feedback feedback_result_hash_9c20340b5e4edaf800510b8e34ffbf52 on record_hash_0f9ef2099fa55956ff64769faf30e761 -> default.sqlite
✅ feedback feedback_result_hash_b9b6b806cbdb76edf799290703d6757f on record_hash_0f9ef2099fa55956ff64769faf30e761 -> default.sqlite


In [12]:
#Run and evaluate on groundtruth questions
for pair in golden_set:
    llm_response = tru_query_engine.query(pair['query'])
    print(llm_response)
    


The author's undergraduate major was philosophy.
✅ record record_hash_2e38582d67e232f205b4ecf9dba54e4e from LlamaIndex_App1 -> default.sqlite
He didn't choose a major, and customized his courses.
5
0.5
✅ feedback feedback_result_hash_f2d147ac97195592cb45d97ab882c5b0 on record_hash_2e38582d67e232f205b4ecf9dba54e4e -> default.sqlite
✅ feedback feedback_result_hash_60c97d19a95badfe4a2d326f27b39d9f on record_hash_2e38582d67e232f205b4ecf9dba54e4e -> default.sqlite
✅ feedback feedback_result_hash_07d226d3787225e7f62bb830ea491f06 on record_hash_2e38582d67e232f205b4ecf9dba54e4e -> default.sqlite
✅ feedback feedback_result_hash_b6b8feddf8957703433645d102350e68 on record_hash_2e38582d67e232f205b4ecf9dba54e4e -> default.sqlite

The author started the company Viaweb in 1995.
✅ record record_hash_3ab4955890239c83eab9a222c596fadb from LlamaIndex_App1 -> default.sqlite
Viaweb, to make software for building online stores.
10
1.0
✅ feedback feedback_result_hash_34abbeb1367a0f5cd77aa1823f6e507b on reco

Waiting for {'error': 'Model papluca/xlm-roberta-base-language-detection is currently loading', 'estimated_time': 44.49275207519531} (44.49275207519531) second(s).
Waiting for {'error': 'Model papluca/xlm-roberta-base-language-detection is currently loading', 'estimated_time': 44.49275207519531} (44.49275207519531) second(s).


4
0.4

The program the author started with Jessica Livingston in 2005 was Y Combinator, an angel investment firm.
✅ record record_hash_22620c71a0f55f534f509d10e357fcdb from LlamaIndex_App1 -> default.sqlite
Y Combinator, to provide seed funding for startups.


Waiting for {'error': 'Model papluca/xlm-roberta-base-language-detection is currently loading', 'estimated_time': 44.49275207519531} (44.49275207519531) second(s).
Waiting for {'error': 'Model papluca/xlm-roberta-base-language-detection is currently loading', 'estimated_time': 44.49275207519531} (44.49275207519531) second(s).


10
1.0


## Explore in a Dashboard

In [None]:
tru.run_dashboard() # open a local streamlit app to explore

# tru.stop_dashboard() # stop if needed

### Leaderboard

Understand how your LLM application is performing at a glance. Once you've set up logging and evaluation in your application, you can view key performance statistics including cost and average feedback value across all of your LLM apps using the chain leaderboard. As you iterate new versions of your LLM application, you can compare their performance across all of the different quality metrics you've set up.

Note: Average feedback values are returned and displayed in a range from 0 (worst) to 1 (best).

![Chain Leaderboard](https://www.trulens.org/Assets/image/Leaderboard.png)

To dive deeper on a particular chain, click "Select Chain".

### Understand chain performance with Evaluations
 
To learn more about the performance of a particular chain or LLM model, we can select it to view its evaluations at the record level. LLM quality is assessed through the use of feedback functions. Feedback functions are extensible methods for determining the quality of LLM responses and can be applied to any downstream LLM task. Out of the box we provide a number of feedback functions for assessing model agreement, sentiment, relevance and more.

The evaluations tab provides record-level metadata and feedback on the quality of your LLM application.

![Evaluations](https://www.trulens.org/Assets/image/Leaderboard.png)

### Deep dive into full chain metadata

Click on a record to dive deep into all of the details of your chain stack and underlying LLM, captured by tru_chain.

![Explore a Chain](https://www.trulens.org/Assets/image/Chain_Explore.png)

If you prefer the raw format, you can quickly get it using the "Display full chain json" or "Display full record json" buttons at the bottom of the page.

Note: Feedback functions evaluated in the deferred manner can be seen in the "Progress" page of the TruLens dashboard.

## Or view results directly in your notebook

tru.get_records_and_feedback(app_ids=[])[0] # pass an empty list of app_ids to get all