An example describing how to integrate eyeball-plus-plus for a simple task which answers questions based on a given context

In [None]:
%pip install eyeball_pp openai

Define your llm task

In [6]:
import eyeball_pp
import openai

openai.api_key = "YOUR_API_KEY_HERE"

# Setting a sample_rate of 1 means that every call to the ask function will be recorded.
# You might want to change this on production to a lower value like 0.1 if you only want to record 10% of the calls.
eyeball_pp.set_config(sample_rate=1)

@eyeball_pp.record_task(args_to_record=["context", "question"])
def ask(context: str, question: str) -> str:
    # You can write arbitrary code here, the only thing the eval framework
    # cares about is the input and output of this function.
    # In this case the inputs context and question are recorded and the output which is the return value of this function is recorded.

    system = """
    You are trying to answer a question strictly using the information provided in the context. Reply I don't know if you don't know the answer.
    """

    prompt = f"""
    Context: {context}
    Question: {question}
    """

    # eval params can be set when you are trying to evaluate this agent
    # with different parameters eg. different models, providers or hyperparameters like temperature
    model = eyeball_pp.get_eval_param("model") or "gpt-3.5-turbo"
    temperature = eyeball_pp.get_eval_param("temperature") or 0.5

    # Note you can use any arbitrary LLM here, this example uses the openai API but you can 
    # use anthropic claude, or any other open source LLM
    output = openai.ChatCompletion.create(  # type: ignore
        model=model,
        temperature=temperature,
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": prompt},
        ],
    )["choices"][0]["message"][
        "content"
    ]  # type: ignore
    return output

In [7]:
# Run the task with a few different inputs
answer1 = ask(context="The quick brown fox jumps over the lazy dog",
    question="What color is the fox?",
)
print(answer1)

answer2 = ask(context="The lazy dog which is not brown jumps over the quick brown fox",
    question="What color is the dog?",
)
print(answer2)

I don't know.
I don't know.


In [8]:
# Rerun recorded examples with different eval params
for input_vars in eyeball_pp.rerun_recorded_examples(
    {"model": "gpt-4", "temperature": 0.7}
):
    ask(input_vars["context"], input_vars["question"])


Will rerun 2 inputs for task:`ask`

Rerunning input #0:
context="The quick brown fox jumps over the lazy dog"
question="What color is the fox?"

Using eval params: {'model': 'gpt-4', 'temperature': 0.7}

Rerunning input #1:
context="The lazy dog which is not brown jumps over the quick brown fox"
question="What color is the dog?"

Using eval params: {'model': 'gpt-4', 'temperature': 0.7}


In [9]:
# Compare the recorded checkpoints using an objective
eyeball_pp.compare_recorded_checkpoints(
    task_objective="This agent tries to answer questions given a context. Verify that the agent answers the question correctly and that the answer is only based on the context."
)

Comparing 2 inputs for task:`ask`

Input #0 - Running 2 comparison(s)
[32m[improvement] task output improved from checkpoint 2023-07-28T06:42:12.770845  to 2023-07-28T06:42:14.798089 (model=gpt-4, temperature=0.7)[0m
[32m[improvement] task output improved from checkpoint 2023-07-28T06:25:17.238118 (model=gpt-4, temperature=0.7) to 2023-07-28T06:42:12.770845 [0m

Input #1 - Running 2 comparison(s)
[33m[neutral] task output is the same between checkpoints 2023-07-28T06:42:10.440686  & 2023-07-28T06:42:13.841368 (model=gpt-4, temperature=0.7) [0m
[32m[improvement] task output improved from checkpoint 2023-07-28T06:25:16.603537 (model=gpt-4, temperature=0.7) to 2023-07-28T06:42:10.440686 [0m

Summary:
---------
Your most sucessful re-runs:
2023-07-28T06:42:13.841295: 2/2 successes

Your most sucessful params:
default: 3/4 successes
model=gpt-4,temperature=0.7: 2/4 successes
