# RAG evaluation using TruLens
This notebook sets up a RAG pipeline using LangChain, LlamaCpp, Qdrant (vector database), and TruLens for evaluation. The goal is to retrieve astrophysics-related information from a database and use a LLM (OLMo-7B-Instruct) to generate answers while evaluating the quality of responses using TruLens.

## 1. Import required libraries and dependencies

In [1]:
# !pip install openai trulens
# !pip install --no-deps trulens-apps-langchain
# !pip install trulens-providers-openai>=1.0.0

In [2]:
import os
import numpy as np
import pandas as pd
from datetime import datetime

from pathlib import Path

from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import PromptTemplate
from langchain_community.llms import LlamaCpp
from langchain_qdrant import Qdrant
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_qdrant import Qdrant
from qdrant_client import QdrantClient
from langchain_core.callbacks import StreamingStdOutCallbackHandler
from langchain_core.output_parsers import StrOutputParser

from trulens.apps.app import instrument
from trulens.apps.langchain import TruChain
from trulens.core import TruSession, Feedback
from trulens.providers.openai import OpenAI

import warnings
warnings.filterwarnings("ignore")

  warn(


In [3]:
# # Set the OpenAI API key for authentication (Replace with your actual API key)
# os.environ["OPENAI_API_KEY"] = "<API key>"

In [4]:
# import the function download_olmo_model from the ssec_tutorials repository
from ssec_tutorials import download_olmo_model
help(download_olmo_model)

Help on function download_olmo_model in module ssec_tutorials.setup:

download_olmo_model(model_file: 'str | None' = None, force=False) -> 'Path'
    Download the OLMO model from the Hugging Face model hub.
    
    Parameters
    ----------
    model_file : str | None, optional
        The name of the model file to download, by default None
    force : bool, optional
        Whether to force the download even if the file already exists, by default False
    
    Returns
    -------
    pathlib.Path
        The path to the downloaded model file



In [5]:
# initialize a TruLens session and reset its database:
session = TruSession()
session.reset_database()

🦑 Initialized with db url sqlite:///default.sqlite .
🛑 Secret keys may be written to the database. See the `database_redact_keys` option of `TruSession` to prevent this.


Updating app_name and app_version in apps table: 0it [00:00, ?it/s]
Updating app_id in records table: 0it [00:00, ?it/s]
Updating app_json in apps table: 0it [00:00, ?it/s]


In [6]:
# simple document formatting function
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

## 2. Set up the vector database (Qdrant)

In [7]:
# load a sentence embedding model for text representation
embedding = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L12-v2")

In [8]:
# Connect to Qdrant
qdrant_path = Path("/workspaces/Rubin-RAG/resources/rubin_qdrant")
qdrant_collection = "rubin_telescope"

# client.close()
client = QdrantClient(path=str(qdrant_path))

lcqdrant = Qdrant(client=client, 
                  collection_name=qdrant_collection, 
                  embeddings=embedding)
# lcqdrant = Qdrant.from_existing_collection(
#     collection_name=qdrant_collection, embedding=embedding, path=qdrant_path
# )

In [9]:
# set up a retriever for fetching relevant documents
retriever = lcqdrant.as_retriever(search_type="mmr", search_kwargs={"k": 2})

## 3. Setup the LLM (OLMo-7B-Instruct) model

In [10]:
# download the OLMo model
model_path = download_olmo_model()

Model already exists at /home/mambauser/.cache/ssec_tutorials/OLMo-7B-Instruct-Q4_K_M.gguf


In [11]:
# setup the Langchain llama.cpp model object: we are using the `OLMo-7B-Instruct` model.
# llama-cpp-python is a Python binding for llama.cpp C++ library as mentioned in previous modules.
olmo = LlamaCpp(
    model_path=str(model_path),  # the path to the OLMo model in GGUF file format
    callbacks=[StreamingStdOutCallbackHandler()],
    temperature=0.8,  # set the randomness of the model's output
    n_ctx=4096,  # set limit for the length of the input context
    max_tokens=512,  # set limit for the length of the generated text
    verbose=False,  # determines whether the model should print out debug information
    echo=False,  # determines whether the input prompt should be included in the output
)

## 4. RAG pipeline

In [12]:
# Define the RAG class
class RAG:
    @instrument
    # Retrieve relevant text from vector store
    def retrieve(self, query: str) -> list:
        results = vector_store.query(query_texts=query, n_results=4)
        return [doc for sublist in results["documents"] for doc in sublist] # return them as a list of text chunks

rag = RAG()

In [13]:
# Define a prompt template
custom_prompt = PromptTemplate(
    input_variables=["context", "question"],
    template="You are an astrophysics expert."
             "Please answer the question on astrophysics based on the following context:\n\n"
             "Context: {context}\n"
             "Question: {question}\n"
)

In [14]:
# RAG chain definition
rag_chain = (
    {
        "context": retriever | format_docs,  # retrieve and format the context
        "question": RunnablePassthrough() # Pass the user’s question directly
    }
    | custom_prompt
    | olmo 
    | StrOutputParser()  # Parse and extract only the final model output
)

## 5. TruLens Feedback Evaluation pipeline

I will be initializing a provider class using OpenAI. It uses OpenAI’s language model to evaluate responses (provide feedback scores based on OpenAI-generated judgments)

In [None]:
# initialize provider class
provider = OpenAI()

In [16]:
# select context to be used in feedback
context = TruChain.select_context(rag_chain)

### 5.1 Feedback Functions

In [17]:
# define a groundedness feedback function
# that checks if the answer is factually supported by retrieved documents.
f_groundedness = (
    Feedback(
        provider.groundedness_measure_with_cot_reasons, name="Groundedness"
    )
    .on(context.collect())  # collect context chunks into a list
    .on_output()
)

✅ In Groundedness, input source will be set to __record__.app.first.steps__.context.first.invoke.rets[:].page_content.collect() .
✅ In Groundedness, input statement will be set to __record__.main_output or `Select.RecordOutput` .


In [18]:
# define a question/answer relevance function
# that evaluates how relevant the RAG's answer is to the question
f_answer_relevance = Feedback(
    provider.relevance_with_cot_reasons, name="Answer Relevance"
).on_input_output()

✅ In Answer Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Answer Relevance, input response will be set to __record__.main_output or `Select.RecordOutput` .


In [19]:
# define a context relevance feedback function 
# evaluates how relevant the context is to the question
f_context_relevance = (
    Feedback(
        provider.context_relevance_with_cot_reasons, name="Context Relevance"
    )
    .on_input()
    .on(context)
    .aggregate(np.mean)
)

✅ In Context Relevance, input question will be set to __record__.main_input or `Select.RecordInput` .
✅ In Context Relevance, input context will be set to __record__.app.first.steps__.context.first.invoke.rets[:].page_content .


### 5.2 Set up a TruLens evaluation recorder
Next we will create a TruLens evaluation recorder that monitors and logs the performance of out RAG system.

In [20]:
tru_recorder = TruChain(
    rag_chain,
    app_name="RubinChat",
    app_version="v1",
    feedbacks=[f_answer_relevance, f_context_relevance, f_groundedness],
)

instrumenting <class 'langchain_core.runnables.base.RunnableParallel'> for base <class 'langchain_core.runnables.base.RunnableParallel'>
	instrumenting invoke
	instrumenting ainvoke
	instrumenting stream
	instrumenting astream
instrumenting <class 'langchain_core.runnables.base.RunnableParallel'> for base <class 'langchain_core.runnables.base.RunnableSerializable'>
	instrumenting invoke
	instrumenting ainvoke
	instrumenting stream
	instrumenting astream
instrumenting <class 'langchain_core.runnables.base.RunnableParallel'> for base <class 'langchain_core.load.serializable.Serializable'>
instrumenting <class 'langchain_core.vectorstores.base.VectorStoreRetriever'> for base <class 'langchain_core.vectorstores.base.VectorStoreRetriever'>
	instrumenting invoke
	instrumenting ainvoke
	instrumenting stream
	instrumenting astream
	instrumenting _get_relevant_documents
	instrumenting get_relevant_documents
	instrumenting aget_relevant_documents
	instrumenting _aget_relevant_documents
instrumen

## 6. Test and Evaluate RAG Responses
### 6.1 Load questions from the LSST community forum dataset


In [21]:
# load the CSV file
lsst_forum_responses_path = "data/input/lsst_forum_responses.csv"
qa_df = pd.read_csv(lsst_forum_responses_path)
print(qa_df.shape)
# print(qa_df.columns)
qa_df.head()

(20663, 13)


Unnamed: 0,category_id,question_header,question_author_id,question,question_date,answer,answer_date,community_role,community_visual_badge,is_moderator,is_admin,is_staff,is_accepted_answer
0,55,Lasair watchlist with large search radius vers...,2517,"Hi, I want to define Lasair filters for a give...",3/7/25,Thanks Roy. \n I’m experimenting with filters ...,3/10/25,,,False,False,False,False
1,55,Lasair watchlist with large search radius vers...,2517,"Hi, I want to define Lasair filters for a give...",3/7/25,I think the maximum radius is 1000 arc seconds...,3/10/25,,,False,False,False,False
2,55,Lasair watchlist with large search radius vers...,2517,"Hi, I want to define Lasair filters for a give...",3/7/25,"Thanks Roy, \n it looks like the maximum defau...",3/10/25,,,False,False,False,False
3,55,Lasair watchlist with large search radius vers...,2517,"Hi, I want to define Lasair filters for a give...",3/7/25,Hi Ismael \nBoth watchlist and watchmap are im...,3/8/25,,,False,False,False,True
4,10,Science Pipeline Release 29.0.0 Status and Dis...,1185,We are starting the science pipelines release ...,3/5/25,Release candidate v29_0_0_rc1 is now available...,3/5/25,LSSTDM,LSSTDM,False,False,False,False


In [22]:
# now I don't want all the rows, I just want rows with accepted answer
accepted_qa_df = qa_df[qa_df["is_accepted_answer"] == True]

# accepted_qa_df = accepted_qa_df[accepted_qa_df["category_id"] == 23]
accepted_qa_df.reset_index(drop=True, inplace=True)

print(accepted_qa_df.shape)
accepted_qa_df.head()

(513, 13)


Unnamed: 0,category_id,question_header,question_author_id,question,question_date,answer,answer_date,community_role,community_visual_badge,is_moderator,is_admin,is_staff,is_accepted_answer
0,55,Lasair watchlist with large search radius vers...,2517,"Hi, I want to define Lasair filters for a give...",3/7/25,Hi Ismael \nBoth watchlist and watchmap are im...,3/8/25,,,False,False,False,True
1,6,Where to find the Vera C. Rubin Observatory na...,440,There used to be a the Vera C. Rubin Observat...,3/1/25,"Hi Meg, thanks for this. \n I can confirm this...",3/1/25,LSSTDM,CST,True,True,True,True
2,6,Rubin Data Rights for Rubin source IDs and pla...,2316,I am currently writing a proposal for a NASA R...,2/25/25,Hi @kkruszynska – thanks for this question. ...,2/25/25,LSSTDM,CST,True,False,True,True
3,6,RubinObservatory.org site issue with Chrome on...,229,"Hello, Rubin website team- \n This is to repor...",2/6/25,"Hi @TomLoredo , thanks for checking out the ...",2/6/25,LSSTDM,CST,True,True,True,True
4,60,UK IDAC offline for maintenance during 4th--5t...,245,Apologies for short notice. Due to planned mai...,2/4/25,The UK IDAC is now back online. Apologies for ...,2/6/25,,,True,False,True,True


In [23]:
# also seems like we have a lot of rows. For testing purposes, I will sample 5 rows.
random_seed = 42

sampled_accepted_qa_df = accepted_qa_df.sample(n=5, random_state=random_seed)
sampled_accepted_qa_df.reset_index(drop=True, inplace=True)

print(sampled_accepted_qa_df.shape)
sampled_accepted_qa_df.head()

(5, 13)


Unnamed: 0,category_id,question_header,question_author_id,question,question_date,answer,answer_date,community_role,community_visual_badge,is_moderator,is_admin,is_staff,is_accepted_answer
0,6,Single_frame task produces different results e...,443,"Hi, \nI’m following this tutorial: The LSST S...",8/11/22,Quick comment on the code: \n \n \n \n petarz...,8/20/22,LSSTDM,LSSTDM,False,False,False,True
1,6,How to make a C++ object iterable in Python,14,I have the following C++ class : \n class CcdI...,9/28/15,After several iteration with @ktl and @rowe...,10/1/15,LSSTDM,LSSTDM,False,False,False,True
2,34,For how long will forced photometry be run on ...,381,Question on how forced photometry will be run ...,2/18/20,I take this to mean that a DIASource which is ...,2/19/20,LSSTDM,LSSTDM,False,False,False,True
3,49,Listing available repos,247,"Hi there, \n Is there some way I find out what...",1/24/24,Hi James \nmaybe \n dafButler.Butler.get_known...,1/24/24,,,False,False,False,True
4,6,Issue building FFTW 3.3.3 in LSST Stack with t...,73,I’m having trouble building FFTW with texinfo ...,9/8/15,This has now been fixed and 3.3.4 is the curre...,9/10/15,LSSTDM,LSSTDM,False,False,False,True


In [24]:
print(sampled_accepted_qa_df.shape)
sampled_accepted_qa_df.head()

(5, 13)


Unnamed: 0,category_id,question_header,question_author_id,question,question_date,answer,answer_date,community_role,community_visual_badge,is_moderator,is_admin,is_staff,is_accepted_answer
0,6,Single_frame task produces different results e...,443,"Hi, \nI’m following this tutorial: The LSST S...",8/11/22,Quick comment on the code: \n \n \n \n petarz...,8/20/22,LSSTDM,LSSTDM,False,False,False,True
1,6,How to make a C++ object iterable in Python,14,I have the following C++ class : \n class CcdI...,9/28/15,After several iteration with @ktl and @rowe...,10/1/15,LSSTDM,LSSTDM,False,False,False,True
2,34,For how long will forced photometry be run on ...,381,Question on how forced photometry will be run ...,2/18/20,I take this to mean that a DIASource which is ...,2/19/20,LSSTDM,LSSTDM,False,False,False,True
3,49,Listing available repos,247,"Hi there, \n Is there some way I find out what...",1/24/24,Hi James \nmaybe \n dafButler.Butler.get_known...,1/24/24,,,False,False,False,True
4,6,Issue building FFTW 3.3.3 in LSST Stack with t...,73,I’m having trouble building FFTW with texinfo ...,9/8/15,This has now been fixed and 3.3.4 is the curre...,9/10/15,LSSTDM,LSSTDM,False,False,False,True


### 6.2 Test the RAG pipeline

In [25]:
# run RAG pipeline on all the sampled Q&A
responses = []

for _, row in sampled_accepted_qa_df.iterrows():
    question = row["question"]
    true_answer = row["answer"]

    # run the RAG pipeline with TruLens recording
    print("\n\nQuestion: ", question)
    context = retriever.invoke(question)
    with tru_recorder as recording:
        rag_response = rag_chain.invoke(question)

    # Store results ina  dict and append to a list
    responses.append({
        "question": question,
        "context":format_docs(context),
        "true_answer": true_answer,
        "RAG_generated_answer": rag_response
    })

# convert the responses into a DataFrame
responses_df = pd.DataFrame(responses)



Question:  Hi, 
I’m following this tutorial:  The LSST Science Pipelines — LSST Science Pipelines  and I’ve ran the first step “single_frame” task a few times. Each time it runs it produces different results: if I go through all calexps in the output collection ( butler.registry.queryDatasets("calexp", collections=collection) ), and look at their sky coverage (calexp width, height and WCS mapping), and then find the total coverage of the whole collection (max and min ra, dec coordinates), I get different results each time it runs. And I’m starting it like this (verbatim what’s in the tutorial): 
 pipetask run -b $RC2_SUBSET_DIR/SMALL_HSC/butler.yaml \
             -p $RC2_SUBSET_DIR/pipelines/DRP.yaml#singleFrame \
             -i HSC/RC2/defaults \
             -o u/$USER/single_frame \
             --register-dataset-types
 
 What could be the explanation for this behavior? 
 Thanks, 
Petar


```go

Answer:

The behavior you observe is expected as the tutorial uses different input files each time it runs. This is done to ensure that the results produced by the pipeline are not affected by the order in which the input data is processed. The WCS mapping of the output images can also change based on the input coordinates and size, as these values are randomly chosen within certain boundaries for each execution of the task. Therefore, it's normal for the sky coverage to vary with each run.

It is important to note that the tutorial does not explicitly ensure that the results will always have a specific minimum sky coverage, but you can make an assumption based on the default input parameters used in the tutorial (e.g., sky coverage = 0.05% or 5%). If you need a more consistent set of output images for further analysis, consider running the task multiple times with different input data.

In summary, the tutorial's purpose is to demonstrate the flexibility and scalability of the 

In [26]:
# get records and feedback from TruLens
records, feedback = session.get_records_and_feedback()

# select required columns from the records df and merge with the ground truth df
records_selected = records[["input"] + feedback]
full_trulens_results_df = responses_df.merge(records_selected, 
                                             left_on="question", 
                                             right_on="input", 
                                             how="left")

full_trulens_results_df.drop(columns=["input"], inplace=True)
full_trulens_results_df = full_trulens_results_df[["question", "true_answer", "context", "RAG_generated_answer", "Answer Relevance", "Groundedness", "Context Relevance"]]
# full_trulens_results_df = full_trulens_results_df[full_trulens_results_df["app_id"] == ""]
full_trulens_results_df.rename(columns={"Answer Relevance": "trulens_Answer_Relevance", "Groundedness":"trulens_Groundedness", "Context Relevance":"trulens_Context_Relevance"}, inplace=True)
full_trulens_results_df.head()

Unnamed: 0,question,true_answer,context,RAG_generated_answer,trulens_Answer_Relevance,trulens_Groundedness,trulens_Context_Relevance
0,"Hi, \nI’m following this tutorial: The LSST S...",Quick comment on the code: \n \n \n \n petarz...,Draft\nLVV-P106: Data Management Acceptance Te...,```go\n\nAnswer:\n\nThe behavior you observe i...,0.0,0.0,0.5
1,I have the following C++ class : \n class CcdI...,After several iteration with @ktl and @rowe...,"In most cases, the SWIG files from the current...",\nAnswer:\nTo make the CcdImageList iterable i...,0.666667,0.0,0.5
2,Question on how forced photometry will be run ...,I take this to mean that a DIASource which is ...,DPDD | LSE-163 | Latest Revision 2023-07-10\n1...,\nAnswer: Forced photometry measurements with ...,1.0,0.555556,0.666667
3,"Hi there, \n Is there some way I find out what...",Hi James \nmaybe \n dafButler.Butler.get_known...,3 Overview\nThe Butler is implemented as thr...,\nAnswer:\nThe 'butler' object created in line...,1.0,0.5,0.5
4,I’m having trouble building FFTW with texinfo ...,This has now been fixed and 3.3.4 is the curre...,LARGE SYNOPTIC SURVEY TELESCOPE\nNotes on use ...,\nAnswer: The known issue with Texinfo and FFT...,,,


### 6.3 TruLens Evaluation

In [27]:
session.get_leaderboard()


Unnamed: 0_level_0,Unnamed: 1_level_0,Answer Relevance,Context Relevance,Groundedness,latency,total_cost
app_name,app_version,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
RubinChat,v1,0.666667,0.541667,0.263889,221.441845,0.0


In [28]:
# more inteactive UI
session.run_dashboard()

Starting dashboard ...


Accordion(children=(VBox(children=(VBox(children=(Label(value='STDOUT'), Output())), VBox(children=(Label(valu…

Dashboard started at http://localhost:55775 .


<Popen: returncode: None args: ['streamlit', 'run', '--server.headless=True'...>

## 7. Save the results

In [29]:
timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
file_name = f"data/results/trulens_results_{timestamp}.csv"
full_trulens_results_df.to_csv(file_name, index=False)