<a href="https://colab.research.google.com/github/sheikita/Moka/blob/master/Eval_Driven_Development_%E2%80%94_AI_Builders_Summit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<font color="red" size="4"><b>This is a public notebook. Make a copy before running to keep your API keys safe.</b></font>

# Eval Driven Development

In this workshop we'll run through the basics of setting up an evaluation pipeline and see how we can iteratively make changes and then get feedback on the results of those changes to drive performance.


## Setup
First we need to get setup with Humanloop and OpenAI.

1. You'll need an OpenAI API Key to get started. Go to https://platform.openai.com/api-keys and create a new API key. Save the Key as you'll need it both for this Colab and in Humanloop.


2. Humanloop is an LLM Evals platform. You can create a free account by going to https://app.humanloop.com. sign- up and verify your email. Skip the initial onboarding and you'll be prompted to enter an OpenAI API Key. Enter the Key You created.

3. Generate a Humanloop API Key by going to Organisation Settings and then API keys on Humanloop.

4. Add your Humanloop and OpenAI API Keys to this google colab by clicking on the key icon in the left side-bar.

Once you have the Humanloop Project setup, you can clone the template project from the Humanloop Library.

5. Go to the Library in the top left of Humanloop and find the MedQA template and clone that to the top level of your workspace.

In [None]:
#Install the requirements
!pip install humanloop chromadb openai pandas httpx==0.27.2 "protobuf<5.0.0" -q

This tutorial demonstrates how to take an existing RAG pipeline and use Humanloop to evaluate it. At the end of the tutorial you'll understand how to:

1. Run an Eval on your RAG pipeline.
2. Set up detailed logging with SDK decorators.
3. Log to Humanloop manually

In this tutorial we'll first implement a simple RAG pipeline to do Q&A over medical documents without Humanloop. Then we'll add Humanloop and use it for evals. Our RAG system will have three parts:

* Dataset: A version of the MedQA dataset from Hugging Face.
* Retriever: Chroma as a simple local vector DB.
* Prompt: Managed in code, populated with the user's question and retrieved context.

__Note__: This is an abridged version of our [Evaluate a RAG app](https://humanloop.com/docs/v5/tutorials/rag-evaluation) tutorial. It focuses on evaluating the pipeline.


In [None]:
# Import the Requirements
import os
import json
import inspect

from chromadb import chromadb
from openai import OpenAI
from humanloop import Humanloop
import pandas as pd

from google.colab import userdata
openai = OpenAI(api_key=userdata.get('openai-key'))
humanloop = Humanloop(api_key=userdata.get("humanloop-key"))

<h3> Prepare the vector database</h3>


In [None]:
!ls /content/textbooks.parquet || wget https://github.com/humanloop/humanloop-cookbook/raw/refs/heads/main/assets/sources/textbooks.parquet

/content/textbooks.parquet


In [None]:
# Use ChromaDB a vector database to store the content

chroma = chromadb.Client()
collection = chroma.get_or_create_collection(name="MedQA")
knowledge_base = pd.read_parquet("/content/textbooks.parquet")
knowledge_base = knowledge_base.sample(100, random_state=42)
collection.add(
    documents=knowledge_base["contents"].to_list(),
    ids=knowledge_base["id"].to_list(),
)

In [None]:
pd.read_parquet("/content/textbooks.parquet")

Unnamed: 0,id,title,content,contents
0,Anatomy_Gray_0,Anatomy_Gray,What is anatomy? Anatomy includes those struct...,Anatomy_Gray. What is anatomy? Anatomy include...
1,Anatomy_Gray_1,Anatomy_Gray,Observation and visualization are the primary ...,Anatomy_Gray. Observation and visualization ar...
2,Anatomy_Gray_2,Anatomy_Gray,How can gross anatomy be studied? The term ana...,Anatomy_Gray. How can gross anatomy be studied...
3,Anatomy_Gray_3,Anatomy_Gray,"This includes the vasculature, the nerves, the...","Anatomy_Gray. This includes the vasculature, t..."
4,Anatomy_Gray_4,Anatomy_Gray,Each of these approaches has benefits and defi...,Anatomy_Gray. Each of these approaches has ben...
...,...,...,...,...
125842,Surgery_Schwartz_14344,Surgery_Schwartz,"feedback. However, the evidence base upon whic...","Surgery_Schwartz. feedback. However, the evide..."
125843,Surgery_Schwartz_14345,Surgery_Schwartz,College of Physicians Council of Associates; F...,Surgery_Schwartz. College of Physicians Counci...
125844,Surgery_Schwartz_14346,Surgery_Schwartz,This review of 10 articles published between 2...,Surgery_Schwartz. This review of 10 articles p...
125845,Surgery_Schwartz_14347,Surgery_Schwartz,a systematic review. J Surg Educ. 2015;72(6):1...,Surgery_Schwartz. a systematic review. J Surg ...


# Create the RAG System

The RAG system will have two components.

A retriever that uses the ChromaDB we created and a call to an LLM with the template below.

In [None]:
MODEL = "gpt-4o"
TEMPERATURE = 0
TEMPLATE = [
    {
        "role": "system",
        "content": """Answer the following question factually.

Question: {{question}}

Options:
- {{option_A}}
- {{option_B}}
- {{option_C}}
- {{option_D}}
- {{option_E}}

---

Make sure you rely only on the following retrieved information to answer your question.
Retrieved data:
{{retrieved_data}}

---

Give you answer in 3 sections using the following format. Do not include the quotes or the brackets. Do include the "---" separators.
```
<chosen option verbatim>
---
<clear explanation of why the option is correct and why the other options are incorrect. keep it ELI5.>
---
<quote relevant information snippets from the retrieved data verbatim. every line here should be directly copied from the retrieved data>
```
""",
    }
]

In [None]:
# Retriever
def retrieval_tool(question: str) -> str:
    """Retrieve most relevant document from the vector db (Chroma) for the question."""
    response = collection.query(query_texts=[question], n_results=10)
    retrieved_doc = response["documents"][0][9]
    return retrieved_doc


# Generator
def call_model(**inputs):
    # Populate the Prompt template
    messages = humanloop.prompts.populate_template(TEMPLATE, inputs)

    # Call OpenAI to get response
    chat_completion = openai.chat.completions.create(
        model=MODEL,
        temperature=0,
        presence_penalty=0,
        frequency_penalty=0,
        messages=messages,
    )
    return chat_completion.choices[0].message.content

# Full RAG pipleline
def ask_question(**inputs)-> str:
    """Ask a question and get an answer using a simple RAG pipeline"""

    # Retrieve context
    retrieved_data = retrieval_tool(inputs["question"])
    inputs = {**inputs, "retrieved_data": retrieved_data}

    # Call LLM
    return call_model(**inputs)

In [None]:
# Example Question
print(ask_question(
    question="A junior orthopaedic surgery resident is completing a carpal tunnel repair with the department chairman as the attending physician. During the case, the resident inadvertently cuts a flexor tendon. The tendon is repaired without complication. The attending tells the resident that the patient will do fine, and there is no need to report this minor complication that will not harm the patient, as he does not want to make the patient worry unnecessarily. He tells the resident to leave this complication out of the operative report. Which of the following is the correct next action for the resident to take?",
    option_A="Disclose the error to the patient but leave it out of the operative report",
    option_B="Disclose the error to the patient and put it in the operative report",
    option_C="Tell the attending that he cannot fail to disclose this mistake",
    option_D="Report the physician to the ethics committee",
    option_E="Refuse to dictate the operative report",
))


# Make it frictionless to look at your data

To be able to do eval driven development, we need to be able to look at all of our data easily. We need to know on each step of a pipeline what did the model see exactly, what tools were called, what did the retriever feed the model.

To enable this observability, we can add logging with Humanloop to our pipeline:


In [None]:
# The humanloop decorator simply ensures that when the function is called, the inputs and outputs are logged to a folder on Humanloop
@humanloop.tool(path="Medical QA/Retrieval")
def retrieval_tool(question: str) -> str:
    """Retrieve most relevant document from the vector db (Chroma) for the question."""
    response = collection.query(query_texts=[question], n_results=10)
    retrieved_doc = response["documents"][0][0]
    return retrieved_doc

# The humanloop decorator simply ensures that when the function is called, the inputs and outputs are logged
@humanloop.prompt(path="Medical QA/Call LLM")
def call_model(**inputs):
    # Populate the Prompt template
    messages = humanloop.prompts.populate_template(TEMPLATE, inputs)

    # Call OpenAI to get response
    chat_completion = openai.chat.completions.create(
        model=MODEL,
        temperature=0,
        presence_penalty=0,
        frequency_penalty=0,
        messages=messages,
    )
    return chat_completion.choices[0].message.content


# The attributes field acts as metadata so that Humanloop knows if we make a change to part of our pipeline
@humanloop.flow(
  path="Medical QA/Pipeline",
  attributes={
    "prompt": {
      "template": TEMPLATE
      "model_name": "gpt-4o",
      "temperature": 0
    },
    "tool": {
      "name": "retrieval_tool_v3",
      "description": "Retrieval tool for MedQA.",
      "source_code": inspect.getsource(retrieval_tool)
    }
})
def ask_question(**inputs)-> str:
    """Ask a question and get an answer using a simple RAG pipeline"""

    # Retrieve context
    retrieved_data = retrieval_tool(inputs["question"])
    inputs = {**inputs, "retrieved_data": retrieved_data}

    # Call LLM
    return call_model(**inputs)

In [None]:
humanloop.evaluations.run(
    name="RAG Pipeline Evaluation",
    file={
        "path": "Medical QA/Pipeline",
        "callable": ask_question,
    },
    dataset={
        "path": "Medical QA/MedQA",
    },
    evaluators=[
        {"path": "Medical QA/Levenshtein Distance"},
        {"path": "Medical QA/Exact Match"}
    ],
)

In [None]:
humanloop.evaluations.run(
    name="RAG Pipeline Evaluation 2",
    file={
        "path": "Medical QA/Pipeline",
        "callable": ask_question,
    },
    dataset={
        "path": "Medical QA/MedQA",
    },
    evaluators=[
        {"path": "Medical QA/Levenshtein Distance"},
        {"path": "Medical QA/Exact Match"},
        {"path": "Medical QA/Context Relevancy"}
    ],
)