<a href="https://colab.research.google.com/github/smjune/ipynb/blob/main/1_Llama_index.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📓 Llama-Index Practice

In this section, we will create a simple Llama Index app and learn how to log and get feedback on an LLM response. Additionally, by evaluating our Llama Index app using TruLens, we will see how TruLens assesses an LLM app.  

Through this process, we hope to gain a clear understanding of the role and functionalities of the Llama Index, as well as the specific inputs and outputs of each method.  

<img src="https://miro.medium.com/v2/resize:fit:1400/0*CrywD0tloiK9dgy_.png" width="600">

This section is divided into three main stages:

###I. Build a Query Engine  
###II. Use Database Management Method  
###III. Initialize Evaluation Metrics and Evaluate Query Engine  

Let's first look at what is needed to build a Query Engine.

## I. Buliding a Query Engine

The task of this section is to create a Query Engine that takes a user's query, retrieves for related content, and returns a final summary.  

Ignore the `Vector DB` in the picture. We don't use external vector database in the pracetice. Instead, we will use local memory as simple vector store.


<img src="https://i.imgur.com/bR4xaBd.png">

To accomplish this task, the following steps will be taken.
<br/>

#### 1. Install and Import Libraries
#### 2. DownLoad Data
#### 3. Make Query Engine from Index Directly
#### 4. Conduct Activity


### 1. Install and Import Libraries

```Python
! pip install trulens_eval llama_index openai --quiet
! pip install packaging==23.2 streamlit==1.35.0 --quiet
```
```Python
import os
os.environ["OPENAI_API_KEY"] = "sk-..." #Insert your openai api key

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Document
from openai import OpenAI

from trulens_eval import Tru
from trulens_eval.feedback.provider import OpenAI
from trulens_eval import Feedback
from trulens_eval.app import App

import numpy as np

import textwrap
import openai
```

In [None]:
### YOUR CODE HERE ###



In [None]:
### YOUR CODE HERE ###



You can check if the ₩OPENAI_API_KEY₩ is properly set as an environment variable by using the following command.

```Python
! echo $OPENAI_API_KEY
```

In [None]:
### YOUR CODE HERE ###



### 2. DownLoad Data

Previously, we learned that the Query Engine receives the following elements as input and output.
<br/>
#### Input: **User Query**, **Vector Database**
#### Output: **Summary of Retrieved Passages**
<br/>

Here, the User Query is an element that we can decide. However, since the Vector Database must contain information related to the Query, we need to consider how to configure the Vector Database before proceeding with the practice.

The Vector Database can be divided into two categories: a storage that allows the query engine to retrieve related passages, and the data containing the related passages. In this practice, we will use local memory as a data storage instead of an external Vector Database, so the Vector Database will be referred to as a Vector Store from now on. For the data, we will use a simple txt file provided by Llama Index.

Before we begin the practical exercises, let's first take a look at the txt file we used. The code below can be used to download the txt file we used with the `wget` command.

```Python
!wget https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt -P data/
```

In [None]:
### YOUR CODE HERE ###



If there are no issues with the download, you should be able to see that the `paul_graham_essay.txt` file has been downloaded.

This example uses the text of Paul Graham’s essay, [“What I Worked On”](https://paulgraham.com/worked.html), and is the canonical llama-index example.


Let’s print the first fifteen lines of the txt file. This will allow us to see that the text file deals with the experiences and reflections of one person's life.


```Python
path_to_txt = '/path/to/paul_graham_essay.txt' #change this path

with open(path_to_txt, 'r', encoding='utf-8') as file:
    for i in range(15):
        line = file.readline()
        if not line:
            break
        print(line.strip())
print("\n...")
```

In [None]:
! pwd

In [None]:
### YOUR CODE HERE ###



### 3. Make Query Engine from Index Directly

Now let's create a query engine unsing the above information.

First, since Llama Index cannot directly handle text files in their raw form, we must convert them into Documents and Nodes. This can be accomplished with the following code.

```Python
documents = SimpleDirectoryReader("data").load_data()
```

SimpleDirectoryReader takes the directory path containing data that will go into the Vector Store as input and converts all the files within that directory into Document objects. The data types that can be converted into Documents are as follows.


*   .txt - text file  
*   .csv - comma-separated values  
*   .docx - Microsoft Word  
*   .epub - EPUB ebook format  
*   .hwp - Hangul Word Processor  
*   .ipynb - Jupyter Notebook  
*   .jpeg, .jpg - JPEG image  
*   .mbox - MBOX email archive  
*   .md - Markdown  
*   .mp3, .mp4 - audio and video  
*   .pdf - Portable Document Format  
*   .png - Portable Network Graphics  
*   .ppt, .pptm, .pptx - Microsoft PowerPoint  

If you want to know your current path, you can use the following command.
```Python
! pwd
```

In [None]:
### YOUR CODE HERE ###



You can view the text contained in the `documents` using the following command:

```Python
for i in range(len(documents)):
  print(documents[i].text)
```

In [None]:
### YOUR CODE HERE ###



Once the Document is complete, we need to use this data to create an index.

The index is created using the VectorStoreIndex.from_documents() method. This method takes a list of Node objects or a Document as input.


```Python
index = VectorStoreIndex.from_documents(documents)
```

If connected to an external vector database, Llama Index follows the index algorithm of that vector database. If there is no connection to an external vector database, it stores the data in the local memory in the form of a dictionary.

In [None]:
### YOUR CODE HERE ###



Now, let’s look at how the index is structured.  

Our index is built based on documents. However, as mentioned in the previous presentation, when Llamaindex stores data in the database, it breaks down the document into nodes. Therefore, let’s first check if the document has been divided into nodes.

```Python
node_id = index.index_struct.nodes_dict

for key, value in node_id.items():
    print(value)
    node_example_id = value
    break
    
print("The number of nodes: ", len(node_id.values()))

print(node_id.values())
```

In [None]:
### YOUR CODE HERE ###



By looking at this, we can check the id of the first node and the number of nodes the document has been divided into.

Next, let’s see how each node contains their text and has been transformed into embeddings.

```Python
index._storage_context.docstore.docs[node_example_id].text
```

```Python
index.vector_store.data.embedding_dict[node_example_id]
```

In [None]:
### YOUR CODE HERE ###



As you can see here, each node is automatically converted into embeddings, becoming vectors when the index is created.

Next, let’s verify why the total number of nodes is as shown. When splitting a document into nodes, Llamaindex uses a class called SentenceSplitter. This class allows us to divide the text data into predefined lengths.

```Python
from llama_index.core.node_parser import SentenceSplitter

parser = SentenceSplitter(chunk_size=1024, chunk_overlap=200)
#you can change chunk_size, chunk_overlap

nodes = parser.get_nodes_from_documents(documents)

print(len(nodes))
```

In [None]:
### YOUR CODE HERE ###



Each node has an attribute called `text`, allowing us to see what text data each node contains. Additionally, you can see that the texts overlap by the value of the `chunk_overlap`.

When checking the length of each text data, we can see that they are composed of different numbers of characters. This is because the `SentenceSplitter` class divides `documents` not by the number of characters but by tokens, which are a type of sub-word.

```Python
print(nodes[0].text)
print("\n----------------------\n")
print(nodes[1].text)
print("\n----------------------\n")
print(len(nodes[0].text), len(nodes[1].text))
```

In [None]:
### YOUR CODE HERE ###



We can use the default tokenizer used by Llamaindex’s `SentenceSplitter` to directly check how many tokens each text consists of.

By running the code below, we can bring in the tokenizer of `gpt-3.5-turbo`, which is used by default in `SentenceSplitter`, and perform tokenization to see the results and check how many tokens are generated.

When we look at the tokenization results, we will see that they differ from the `chunk_size` we set. This is because, when creating nodes, the amount of text data to be tokenized is determined by considering the metadata as well, in order to compose nodes of consistent size including the amount of metadata.

```Python
import tiktoken
enc = tiktoken.encoding_for_model("gpt-3.5-turbo")

encoded_text = enc.encode(nodes[0].text)
print(f"Encoded text: {encoded_text}")
print(f"Encoded text length: {len(encoded_text)}")

decoded_text = enc.decode(encoded_text)
print(f"Decoded text: \n\ndecoded_text}")
```

In [None]:
### YOUR CODE HERE ###



You can see that the result here matches the number we saw earlier.  

If you want to change the default `chunk_size` or `chunk_overlap` when you make `index`, you can use `transformations` argument in `VectorStoreIndex.from_documents()`.

```Python
text_splitter = SentenceSplitter(chunk_size=200, chunk_overlap=50)

index = VectorStoreIndex.from_documents(documents=documents, transformations=[text_splitter])
```
```Python
node_id = index.index_struct.nodes_dict

for key, value in node_id.items():
    print(key, value)
    node_example_id = value
    break

print("The number of nodes: ", len(node_id.values()))

print(node_id.values())
```

In [None]:
### YOUR CODE HERE ###



In [None]:
### YOUR CODE HERE ###



But, in this practice, we will use default `SentenceSplitter`.

```Python
index = VectorStoreIndex.from_documents(documents)
```

In [None]:
### YOUR CODE HERE ###



Once the index is created, Llama Index provides a way to immediately create a query engine using the index. With the following code, we can create a query engine using the index.

```Python
query_engine = index.as_query_engine()
```

In [None]:
### YOUR CODE HERE ###



### 4. Conduct Activity

Now let's actually use the query engine. We will check the results of query engine using the following questions.

#### 4-1. Answerable Questions from Data

In this activity, we will check how query engine will answer **a question related to the given data, and the information needed for the correct answer is directly provided in the data**.

#### 4-2. Unrelated Question to the Data

In this activity, we will check how query engine will answer **a question that has absolutely nothing to do with the given data**.

#### 4-3. Question Requiring Complex Reasoning

In this activity, we will check how query engine will answer **a question that is hard to answer, which means that question is related to the given data, but the information needed for the correct answer is not directly given in the data**.

### 4-1. Answerable Questions from Data
First, let's ask the query engine some questions that it can answer and some that it cannot.

For example, we could ask what program the author of this data has written about. The answer to this question is in the second sentence of the txt file.

#### Question: **What is the first programs the author tried writing?**

#### In paul_graham_essay: **The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing."**

Intuitively, it is expected that the query engine will generate the correct answer, which is IBM 1401. Let's verify that.

```Python
text_chunk = "The first programs I tried writing were on the IBM 1401 that our school district used for what was then called \"data processing.\""

if text_chunk in documents[0].text:
    print("True")
else:
    print("False")
```

```Python
response = query_engine.query("What is the first programs the author tried writing?")

print(response)

```

In [None]:
### YOUR CODE HERE ###



True


In [None]:
### YOUR CODE HERE ###



The first programs the author tried writing were on the IBM 1401 using an early version of Fortran.


<img src="https://xe.obg.co.kr/files/attach/images/4199/352/004/7e2189cf45a27f9b9cda5fef28c1dd5f.gif">

As expected, the results are accurate.  


### 4-2. Unrelated Question to the Data

This time, We will check whether the query engine performs its intended function well.  
A query engine is **a generic interface that allows users to ask questions about external data**.  
Therefore, **it should not be able to answer questions that cannot be answered through external data**.

The reason is that if the query engine can answer questions that are completely unrelated to the external data, it implies that the query engine can freely use parameterized knowledge. This, in turn, means that the answers from the query engine may contain information that is incorrect or not up-to-date.

<img src="https://media0.giphy.com/media/ANbD1CCdA3iI8/200w.gif?cid=6c09b952ooe7fzryithl047ty5npdt1xd50tlu9gcpqnzh87&ep=v1_gifs_search&rid=200w.gif&ct=g" width="200" height="200"><img src="https://st2.depositphotos.com/4421345/11492/v/950/depositphotos_114921728-stock-illustration-public-speaking-robot.jpg" width="200">

So, let’s pass on a question that has nothing to do with the original data but that the query engine’s LLM can answer.  
For example, we could use a question like this.

#### Question: **How many countries participated in the production of the space station?**
#### Answer: **15**

```Python
response = query_engine.query("How many countries participated in the production of the space station?")

print(response)
```

In [None]:
### YOUR CODE HERE ###



<img src = "https://img.danawa.com/images/descFiles/6/164/5163558_1_16652362092043267.gif
" width = 200>

It doesn't produce the correct answer as expected. However, we can find what kinds of LLM used in query engine through the official LlamaIndex documentation.

Actually, Llamaindex uses `gpt-3.5-turbo` as LLM of response synthesizer. So, we can ask `gpt-3.5-turbo` to answer the question.

Let's check this out with a simple code. We can use the following code to separately utilize the OpenAI LLM.  


```Python
def generate_answer(question):
    messages = [
        {
            "role": "user",
            "content": question,
        },
    ]
    response = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        temperature=0,
        messages=messages,
    )

    return response.choices[0].message.content
```
```Python
question = "How many countries participated in the production of the space station?"

print(generate_answer(question))
```

In [None]:
### YOUR CODE HERE ###



In [None]:
### YOUR CODE HERE ###



In fact, as in the example above, OpenAI LLM already knows the answer to this question.  
However, since the relevant information cannot be found in the txt file that query engine can refer to here, it does not generate a correct answer for a question that is already known.  

<img src="https://i.imgur.com/sxSRQks.png">

From this, we can understand that LLM in the response synthesizer of the query engine uses the information from the retrieved passages produced by the retriever, excluding its own parametrized knowledge.  

### 4-3. Question Requiring Complex Reasoning

Now, let’s explore more deeply. There is a question related to the data that is hard to answer. We will see how the query engine responds to such questions.

The question we will ask in this section is who is the author of the `paul_graham_essay.txt`. Becuase the original text file does not mention the author's name directly, this is a complex question.

Question: **Who is the author?**

Let's see whether the writer's name is actually not in the original text file or not.

```Python
path_to_txt = '/path/to/paul_graham_essay.txt'
wrapper = textwrap.TextWrapper(width=80)

with open(path_to_txt, 'r', encoding='utf-8') as file:
    for line in file:
        if 'paul graham' in line.lower():
            formatted_text = wrapper.fill(line.strip())
            print(formatted_text.split('.')[-1])
```

In [None]:
### YOUR CODE HERE ###



This is the only sentence in the original txt file that mentions 'Paul Graham'.  
Now, let's read it ourselves and consider whether it actually specifies the author's name.

<br/>

Someone might infer that the author's name is Paul Graham from this sentence, but I think it's not certain

In my opinion, this sentence does not explicitly state the author's name, so it does not provide enough useful information

<br/>

However, opinions on this may vary, so let's verify it ourselves.

```Python
response_complex = query_engine.query("Who is the author?")

print(response_complex)
```

In [None]:
### YOUR CODE HERE ###



<img src="https://i3.ruliweb.net/ori/21/07/30/17af799f631524cc2.gif">

Hmm... The query engine answers better than we think.  
Then, let's see what data were referred to by the query engine.
<br/>

<img src="https://i.imgur.com/sxSRQks.png">

The query engine made using `as_query_engine` uses a retriever generated directly from the index by using `as_retriever` method.   
Therefore, if we look at the results of this retriever, we can see which retrieved passages the query engine has seen.

```Python
retriever = index.as_retriever()

ret_passages = retriever.retrieve("Who is the author?")

for i in range(len(ret_passages)):
  print("###Retrieved Passage\n", ret_passages[i].text)
  print("\n\n\n")
```

In [None]:
### YOUR CODE HERE ###



This is quiet lone passages. So, it is hard to check whether there is any evidence of query engine's response or not.  
Can you find any supporting evidence of the question?  

Let's see if 'Paul Graham' is in there.

```Python
passages = ''

for i in range(len(ret_passages)):
  passages += ret_passages[i].text

if 'paul graham' in passages.lower():
    print("The word 'Paul Graham' is found in the passages.")
else:
    print("The word 'Paul Graham' is not found in the passages.")
```

In [None]:
### YOUR CODE HERE ###



<img src="https://i2.ruliweb.net/ori/21/07/24/17ad8e756ba54811a.gif" width="400">

???

Retriever couldn't find the only sentence with the word Paul Graham on it. Then how did query engine know that the answer was Paul Graham?

To verify this, let’s using the query engine’s LLM as we did in the previous section and ask the same question. We will use a prompt as similar as possible to the one used in LlamaIndex response synthesizer, but with an additional instruction to provide the reason for the LLM’s answer.  

The prompt used by the original Response Synthesizer is as follows.


<img src = "https://i.imgur.com/ZXUDAJi.jpeg" width="450">  


<br/>

Therefore, we will write the code as shown below to check the results of the LLM.


```Python
ret_context = ""
for ret_result in ret_passages:
  ret_context += ret_result.text

question = f"""Context information is below.
---------------------
{ret_context}
---------------------
Given the context information and not prior knowledge,
answer the query and the reasons of answer.
Query: Who is the author?
Answer:
"""

print(generate_answer(question))
```


In [None]:
### YOUR CODE HERE ###



Well... it seems like query engine can **capture useful information not only in direct evidence but also in indirect content by using its prior knowledge, i.e., parametrized knowledge**.  

Therefore, we can consider that the query engine has the ability to provide correct answers even when the given query is not explicitly answered in the data.  

However, the problem is that **without such an explanation, the reasoning process cannot be trusted** because it is impossible to know whether the LLM used reliable information in its reasoning.

<br/>

Here’s what we can infer from this:

the query engine has the capability to answer not only questions directly addressed in the data **but also questions that can be indirectly inferred from the data.**  
However, in such cases, **the reliability of the answers is compromised because we cannot verify the information used and the reasoning steps.**   

To address this issue, we can use one of the following two methods.  


#### **1.** Force the response synthesizer to utilize only the information in a given retrieved passages.
#### **2.** When the response synthesizer generates a summary, make sure that the reason is also generated.

## II. Use Database Management Method

Now, let’s consider a scenario. If we have new data and want to create a query engine using it, we could recreate the `document`, `index`, and `query engine` as we did above.

However, this is quite cumbersome and time-consuming. In addition, if you want to insert new data so that query engine uses that data, the above method is very expansive because a new index must be generated for every inserts operation.

So, Llama Index is providing the following functions to manage data smoothly.

### 1. Insert

We can insert new data to the existing query engine.

To check this, we will ask a question which can't get useful information from the data before inserting new data, and see what happens after inserting data.

In this section, the following data and question will be used.

Inserted Data: **"Natural diamonds were (and are) formed (thousands of million years ago) in the upper mantle of Earth in metallic melts at temperatures of 900–1,400 °C and at pressures of 5–6 GPa."**  
Question: **What are the temperature and air pressure conditions under which natural diamonds are produced?**

```Python
data_text = "Natural diamonds were (and are) formed (thousands of million years ago) in the upper mantle of Earth in metallic melts at temperatures of 900–1,400 °C and at pressures of 5–6 GPa."
question = "What are the temperature and air pressure conditions under which natural diamonds are produced?"

res = query_engine.query(question)

print(res)
```

In [None]:
### YOUR CODE HERE ###



<img src = "https://i.pinimg.com/originals/15/8b/ed/158bed9819e4fccf7e18a5eeeaf79c6b.png" width = 200>

Hmm… the query engine answers quite well. It seems the response synthesizer is using its parameterized knowledge. However, there are no specific details like 900–1,400°C and pressures of 5–6 GPa.

Anyway, it's time to insert the data.

```Python
docu = Document(text=data_text, id_="new_doc_id")

index.insert(docu)

new_query_engine = index.as_query_engine()
```

```Python
new_res = new_query_engine.query(question)

print(new_res)
```

In [None]:
### YOUR CODE HERE ###



In [None]:
### YOUR CODE HERE ###



It's as expected. Query engine is generating an appropriate answer with detail using information in new `Document` object.

Then, what happens if we change the information in that data?

### 2. Update

If a Document is already present within an index, you can "update" a `Document` with the same doc `id_`. The `update_ref_doc` method receives a single `Document` object and updates the value of data that has the same `id`.

Or you can "refresh" all document at once. The `refresh_ref_doc` method finds the `id` of the `Document` object that came into the input, updates the data with the same `id`, or inserts the data if there is no other data with the same `id`

**`update_ref_doc`**  
**Input**: single `Document` object  
**Output**: None

**`refresh_ref_doc`**  
**Inpu**t: list of `Document` object  
**Output**: a boolean list, indicating which documents in the input have been refreshed in the index.

You will see `False` in that boolean list if text of document doesn't change.  

In both methods, we can set `delete_from_docstore` to `True` or `False`. A detailed description of this will be given in Delete.


<br/>

Let's check by changing the base question a little bit.  

I changed the information about temperature and air pressure in the question, so check it out for yourself.  

```Python
docu.text = "Natural diamonds were (and are) formed (thousands of million years ago) in the upper mantle of Earth in metallic melts at temperatures of 2,000–6,000 °C and at pressures of 8–9 GPa."

output = index.update_ref_doc(
    docu,
    update_kwargs={"delete_kwargs": {"delete_from_docstore": True}},
)

print(output)

query_engine_update = index.as_query_engine()

res_update = query_engine_update.query(question)

print(res_update)
```

```Python
docu.text = "Natural diamonds were (and are) formed (thousands of million years ago) in the upper mantle of Earth in metallic melts at temperatures of 6,000–8,000 °C and at pressures of 12–15 GPa."

output = index.refresh_ref_docs(
    [docu]
)

print(output)

query_engine_refresh = index.as_query_engine()

res_refresh = query_engine_refresh.query(question)

print(res_refresh)
```

In [None]:
### YOUR CODE HERE ###



In [None]:
### YOUR CODE HERE ###



You can check list of doc_id in the index. So, if you want more experience with other data, you can check whether your data is inserted properly by looking at that list

```Python
print(index.ref_doc_info.keys())
```

In [None]:
### YOUR CODE HERE ###



### 3. Delete

We wrapped the new information above with a `Document` object and inserted it into the index.

At this time, we were able to specify the `Document` object using `doc_id`. What this doc_id is can be confirmed as follows.

```Python
id = docu.doc_id

print(id)
```

In [None]:
### YOUR CODE HERE ###



Using this, we can change the contents of data with the `doc_id` using the `delete` method of the index.

Let's delete the data related to a diamond, and see how query engine answers for the same question.

Through the value of `delete_from_docstroe`, it is possible to determine whether the data with that id actually disappears on the database, or disappears only on the index and remains in the database. Even if it disappears only on the index, the information of the data cannot be used by the query engine.


```Python
index.delete_ref_doc(id, delete_from_docstore=True)

query_engine_delete = index.as_query_engine()

res_delete = query_engine_delete.query(question)

print(res_delete)

```

In [None]:
### YOUR CODE HERE ###



If the answer still contains very details of condition of natural diamonds, check the index whether that data was deleted properly.

The result of `index.ref_doc_info.keys()` has to contain only one doc_id.

```Python
print(index.ref_doc_info.keys())
```

In [None]:
### YOUR CODE HERE ###



## III. Evaluate Query Engine

In this section, we will define evaluation metrics and go through the process of evaluating the performance of the query engine, in order to become familiar with the TruLens API, which is used to measure the performance of RAG later on.
<br/>

<img src="https://www.trulens.org/assets/images/RAG_Triad.jpg" width="600">

For evaluation, we will leverage the "hallucination triad" of groundedness, context relevance and answer relevance.

Simply put, using the three metrics described above, we can verify whether the summaries generated by the query engine are supported by the retrieved passages and related to the query. Detailed explanations of the evaluation metric and TruLens will be covered in the subsequent RAG Practice.

The purpose of this section is to become familiar with the inputs and outputs used in creating evaluation metrics with TruLens.

This section is composed of the following four stages:
 <br/>
#### 1. Initialize Feedback Functions
#### 2. Make Instrument app for Logging with TruLens
#### 3. Check Records and Feedback


### 1. Initialize Feedback Functions

In this section, we will define the metrics used for evaluation. The evaluation metric can be implemented using the `Feedback` method of TruLens.

```Python
tru = Tru()

provider = OpenAI()

context = App.select_context(query_engine)

f_groundedness = (
    Feedback(provider.groundedness_measure_with_cot_reasons, name = "Groundedness")
    .on(context.collect())
    .on_output()
)

f_answer_relevance = (
    Feedback(provider.relevance, name = "Answer Relevance")
    .on_input_output()
)

f_context_relevance = (
    Feedback(provider.context_relevance_with_cot_reasons, name = "Context Relevance")
    .on_input()
    .on(context.collect())
    .aggregate(np.mean)
)
```

In [None]:
### YOUR CODE HERE ###



### 2. Make Instrument app for Logging with TruLens

Next, it is necessary for TruLens to recognize the defined evaluation metric and to apply the query engine to the TruLens API. This can be accomplished with the code below. Here, the evaluation results can be checked using the `app_id`.

```Python
from trulens_eval import TruLlama
tru_query_engine_recorder = TruLlama(query_engine,
    app_id='LlamaIndex_App1',
    feedbacks=[f_groundedness, f_answer_relevance, f_context_relevance])
```
```Python
with tru_query_engine_recorder as recording:
    query_engine.query("What did the author do growing up?")
```

In [None]:
### YOUR CODE HERE ###



In [None]:
### YOUR CODE HERE ###



### 3. Check Records and Feedback

Now, let's see which values of the query engine were evaluated by the TruLens API and how the evaluation turned out. First, using the code below, we can view the input, intermediate values, and output of the query engine as recognized by TruLens.

```Python
eval_record = recording.get()

print("###Query\n", eval_record.main_input)
print("""      ↓
      ↓
      ↓
      ↓   Retriever retrieved relevant passages.
      ↓
      ↓
      ↓""")
for i in range(len(eval_record.calls[0].rets)):
  print("###Retrieved Passages \n", eval_record.calls[0].rets[i]['node']['text'], "\n\n")
print("""      ↓
      ↓
      ↓
      ↓   Query engine summarize retrieved passages.
      ↓
      ↓
      ↓""")
print("###Summary of Query Engine \n", eval_record.main_output)
```

In [None]:
### YOUR CODE HERE ###



If it's confirmed that there are no issues with the values of the query engine received by TruLens, now let's check the actual results.
```Python
tru.run_dashboard()
```

In [None]:
### YOUR CODE HERE ###



The TruLens dashboard can be accessed through an external URL.  

The method to check the scores without accessing the dashboard is as follows

```Python
tru.get_leaderboard(app_ids=["LlamaIndex_App1"])
```
```Python
rec = recording.get()

for feedback, feedback_result in rec.wait_for_feedback_results().items():
    print(feedback.name, feedback_result.result)
```

In [None]:
### YOUR CODE HERE ###



In [None]:
### YOUR CODE HERE ###



```Python
records, feedback = tru.get_records_and_feedback(app_ids=["LlamaIndex_App1"])

records.head()
```

In [None]:
### YOUR CODE HERE ###



You can see the evidence for LLM's evaluation.  

If you don't change the `app_id` in `recording`, and evaluate query engine again, then you will see one more row in the results of the previous shell.  

To see the evidence of that row, you have to change `i` to index of row.

```Python
i=0 #Index of the evaluation that you want to check

print(records['Answer Relevance_calls'][i][0]['ret']) #Answer Relevance score

print("\n\n")
print("-------------------------------------")

for idx, sentence in enumerate(records['Groundedness_calls'][i][0]['args']['statement'].split('. ')):
  print(f"STATEMENT {idx}", sentence) #Query Engine answer
print("\n")

for j in range(len(records['Groundedness_calls'][i])):
  print(records['Groundedness_calls'][i][j]['meta']['reasons']) #Groundedness evidence
  print(records['Groundedness_calls'][i][j]['ret']) #Groundedness score
  print("-------------------------------------")

print("\n\n")

for j in range(len(records['Context Relevance_calls'][i])):
  print(records['Context Relevance_calls'][i][j]['meta']['reason']) #Context Relevance evidence
  print(records['Context Relevance_calls'][i][j]['ret']) #Context Relevance score
  print("-------------------------------------")
```

In [None]:
### YOUR CODE HERE ###



If everything is okay, then stop the dashboard.

```Python
tru.stop_dashboard()
```

In [None]:
### YOUR CODE HERE ###

