# Run Evaluations on our RAG chatbot! 

<div align="center">
    <p style="text-align:left">
        <img alt="phoenix logo" src="https://repository-images.githubusercontent.com/564072810/f3666cdf-cb3e-4056-8a25-27cb3e6b5848" width="800"/>
        <br>
        <a href="https://arize.com/docs/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email">Community</a>
    </p>
</div>

## Let's get started! 

In [1]:
%pip install -qqq "arize-phoenix==11.21.0" "openai>=1" nest_asyncio

Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
from getpass import getpass
import phoenix as px
import nest_asyncio

if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = openai_api_key

nest_asyncio.apply()

  from .autonotebook import tqdm as notebook_tqdm


<img alt="Document Retrieval Evaluation Image" src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/phoenix-docs-images/documentRelevanceDiagram.png" width="1000"/>

In [3]:
from phoenix.session.evaluation import get_retrieved_documents
retrieved_documents_df = get_retrieved_documents(px.Client(), project_name="our-rag-project", timeout=None)
retrieved_documents_df

Unnamed: 0_level_0,Unnamed: 1_level_0,context.trace_id,input,reference
context.span_id,document_position,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a11dcc2f37759fa2,0,2f4acf21a7d6e2492d809861978bbf71,course title,The purpose of this document is to capture fre...
a11dcc2f37759fa2,1,2f4acf21a7d6e2492d809861978bbf71,course title,"Yes, we will keep all the materials after the ..."
a11dcc2f37759fa2,2,2f4acf21a7d6e2492d809861978bbf71,course title,GitHub - DataTalksClub data-engineering-zoomca...
a11dcc2f37759fa2,3,2f4acf21a7d6e2492d809861978bbf71,course title,You can start by installing and setting up all...
a11dcc2f37759fa2,4,2f4acf21a7d6e2492d809861978bbf71,course title,"Yes, even if you don't register, you're still ..."
68a6f6eb226a7d51,0,e63e8853d1a83a9e6af773d79ea3f225,prerequisites,GitHub - DataTalksClub data-engineering-zoomca...
68a6f6eb226a7d51,1,e63e8853d1a83a9e6af773d79ea3f225,prerequisites,You can start by installing and setting up all...
68a6f6eb226a7d51,2,e63e8853d1a83a9e6af773d79ea3f225,prerequisites,Solution:\nCheck if you’re on the Developer Pl...
def40c892a1aa7e7,0,c2afc1c738bb824d43e684500bc25aa6,grading system leaderboard,When you set up your account you are automatic...
def40c892a1aa7e7,1,c2afc1c738bb824d43e684500bc25aa6,grading system leaderboard,After you submit your homework it will be grad...


In [4]:
from phoenix.session.evaluation import get_qa_with_reference

queries_df = get_qa_with_reference(px.Client(), project_name="our-rag-project", timeout=None)
queries_df

Unnamed: 0_level_0,input,output,reference
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
953950d641a5a79b,what is the course title of this course,The course title is **Data Engineering Zoomcam...,The purpose of this document is to capture fre...
eac8b87bf408d7d1,What are the prerequisites for the course?,The prerequisites for the **Data Engineering Z...,GitHub - DataTalksClub data-engineering-zoomca...
74e5940a662d6a64,How does the grading system and leaderboard work?,The grading system for the **Data Engineering ...,When you set up your account you are automatic...
f1b85458d1c8739f,Who created the Data Engineering Zoomcamp and ...,I couldn't find specific information on who cr...,Copy the file found in the Java example: data-...
7819d69b24d70940,What version of Python will be officially requ...,The officially recommended version of Python f...,A generator is a function in python that retur...
9e9913ae3f4dd513,"Does the course cover AWS S3 in detail, or onl...",The information available did not specify deta...,The purpose of this document is to capture fre...
d5888d47c37c426d,Where can I find the Zoom link for weekly offi...,You can find the link to join the weekly offic...,The zoom link is only published to instructors...
5c47afa5a725ff1d,How do I request a refund if I can’t continue ...,The available information did not specify the ...,The purpose of this document is to capture fre...
4cbb4d9dfabd34dd,What is the expected weekly schedule of live l...,The details regarding the specific weekly sche...,We will probably have some calls during the Ca...
0486baa9a11c8c1d,Why does the course use GCP rather than other ...,The course primarily uses **Google Cloud Platf...,"For uniformity at least, but you’re not restri..."


In [5]:
from phoenix.evals import (
    HallucinationEvaluator,
    OpenAIModel,
    QAEvaluator,
    RelevanceEvaluator,
    run_evals,
)

eval_model = OpenAIModel(model="gpt-4")
relevance_evaluator = RelevanceEvaluator(eval_model)
hallucination_evaluator = HallucinationEvaluator(eval_model)
qa_evaluator = QAEvaluator(eval_model)

In [6]:
retrieved_documents_relevance_df = run_evals(
    evaluators=[relevance_evaluator],
    dataframe=retrieved_documents_df,
    provide_explanation=True,
    concurrency=20,
)[0]
retrieved_documents_relevance_df

run_evals |██████████| 48/48 (100.0%) | ⏳ 00:07<00:00 |  6.48it/s

Unnamed: 0_level_0,Unnamed: 1_level_0,label,score,explanation
context.span_id,document_position,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a11dcc2f37759fa2,0,unrelated,0,The question asks for the 'course title'. Howe...
a11dcc2f37759fa2,1,unrelated,0,The question is asking for the title of a cour...
a11dcc2f37759fa2,2,unrelated,0,The question is asking for a 'course title'. T...
a11dcc2f37759fa2,3,unrelated,0,The question is asking for the 'course title'....
a11dcc2f37759fa2,4,unrelated,0,The question is asking for the title of a cour...
68a6f6eb226a7d51,0,unrelated,0,The question is asking for 'prerequisites' but...
68a6f6eb226a7d51,1,relevant,1,The question is asking for 'prerequisites'. Th...
68a6f6eb226a7d51,2,relevant,1,The question is 'prerequisites' which is vague...
def40c892a1aa7e7,0,unrelated,0,The question is asking about a grading system ...
def40c892a1aa7e7,1,relevant,1,The question asks about the grading system lea...


In [None]:
hallucination_eval_df, qa_eval_df = run_evals(
    dataframe=queries_df,
    evaluators=[hallucination_evaluator, qa_evaluator],
    provide_explanation=True,
    concurrency=20,
)
hallucination_eval_df

run_evals |██████████| 48/48 (100.0%) | ⏳ 00:38<00:00 |  1.25it/s


Unnamed: 0_level_0,label,score,explanation
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
953950d641a5a79b,factual,0,The query asks for the title of the course. Th...
eac8b87bf408d7d1,factual,0,The answer is factual because it accurately li...
74e5940a662d6a64,factual,0,The answer provided is factual. It accurately ...
f1b85458d1c8739f,factual,0,The answer is factual because it accurately re...
7819d69b24d70940,hallucinated,1,The answer states that the officially recommen...
9e9913ae3f4dd513,hallucinated,1,The answer is hallucinated because the referen...
d5888d47c37c426d,factual,0,The answer is factual because it accurately re...
5c47afa5a725ff1d,factual,0,The answer is factual because it correctly sta...
4cbb4d9dfabd34dd,hallucinated,1,The answer is hallucinated because it introduc...
0486baa9a11c8c1d,factual,0,The answer is factual because it accurately re...




In [8]:
from phoenix.trace import DocumentEvaluations, SpanEvaluations

px.Client().log_evaluations(
    SpanEvaluations(eval_name="Hallucination", dataframe=hallucination_eval_df),
    SpanEvaluations(eval_name="QA Correctness", dataframe=qa_eval_df),
    DocumentEvaluations(
        eval_name="Retrieval Relevance", dataframe=retrieved_documents_relevance_df
    ),
)