# Eyeball Simulation Demo
This notebook provides an example of how `eyeball_pp` can be used effectively during development. We'll show eyeball's recording, rerunning and evaluation systems in action under 3 scenarios.

### 0. Setup
Lets install required modules and set up eyeball and openai.

In [None]:
%pip install eyeball_pp openai pyyaml rich

In [1]:
import eyeball_pp
import openai

openai.api_key = "your-openai-key"

# Setting a sample_rate of 1 means that every call to the ask function will be recorded.
# You might want to change this on production to a lower value like 0.1 if you only want to record 10% of the calls.
eyeball_pp.set_config(sample_rate=1)

def _execute_completion(system_msg: str, prompt: str, model: str) -> str:
    """Convenience method for executing a completion."""
    return openai.ChatCompletion.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": system_msg},
            {"role": "user", "content": prompt},
        ],
    )["choices"][0]["message"]["content"]

### 1. First Run: 3 Failures
This run does the initial recording of the inputs and has 3 failures.

In [2]:
@eyeball_pp.record_task(args_to_record=["context", "question"])
def ask(context: str, question: str) -> str:

    # eval params can be set when you are trying to evaluate this agent
    # with different parameters eg. different models, providers or hyperparameters like temperature
    model = eyeball_pp.get_eval_param("model") or "gpt-3.5-turbo"
    
    system = """
    You are trying to answer a question strictly using the information provided in the context. Reply "I don't know" if you don't know the answer.
    """

    prompt = f"""
    Context: {context}
    Question: {question}
    """

    return _execute_completion(system, prompt, model)


In [3]:
sample_inputs = [
    {
        "context": "The quick brown fox jumps over the lazy dog", 
        "question": "What color is the fox?"
    }, {
        "context": "The lazy green dog jumps over the quick brown fox", 
        "question": "What color is the dog?"
    }, {
        "context": "Peter Piper picked a peck of pickled peppers", 
        "question": "How would you describe the peppers?"
    }
]

def _print_answers(sample_inputs):
    for input in sample_inputs:
        answer = ask(input["context"], input["question"])
        print(f"Context  : {input['context']}")
        print(f"Question : {input['question']}")
        print(f"Answer   : {answer}\n")

In [4]:
_print_answers(sample_inputs)

Context  : The quick brown fox jumps over the lazy dog
Question : What color is the fox?
Answer   : I don't know.

Context  : The lazy green dog jumps over the quick brown fox
Question : What color is the dog?
Answer   : I don't know.

Context  : Peter Piper picked a peck of pickled peppers
Question : How would you describe the peppers?
Answer   : I don't know.



In [5]:
from eyeball_pp import Criteria

eyeball_pp.evaluate_system(
    task_objective="This agent tries to answer questions given a context. Verify that the agent answers the question correctly and that the answer is only based on the context.",
    grading_criteria=[Criteria.CORRECTNESS, Criteria.RELEVANCE]
)

Comparing 3 inputs for task:`ask`

Input #0 - Scoring 1 checkpoints
Scored 2023-08-09T02:08:21.924439: 0.00 ({
  "evaluations": [
    {
      "name": "Criteria.CORRECTNESS",
      "rating": "No",
      "reason": "The response is incorrect. The color of the dog is clearly stated in the context as 'green'."
    },
    {
      "name": "Criteria.RELEVANCE",
      "rating": "No",
      "reason": "The response is not relevant to the context. The color of the dog is clearly stated in the context, but the agent responded with 'I don't know.'"
    }
  ]
})

Input #1 - Scoring 1 checkpoints
Scored 2023-08-09T02:08:21.436529: 0.00 ({
  "evaluations": [
    {
      "name": "Criteria.CORRECTNESS",
      "rating": "No",
      "reason": "The response is incorrect. The color of the fox is clearly stated in the context as 'brown'."
    },
    {
      "name": "Criteria.RELEVANCE",
      "rating": "No",
      "reason": "The response is not relevant to the context. The color of the fox is mentioned in the

### 2. Second Run: 2 Success, 1 Failure
This run makes a tweak to the `ask` function by changing the "system" prompt, and then reruns all recorded examples. The result is that 2 of the requests succeed.

In [6]:
@eyeball_pp.record_task(args_to_record=["context", "question"])
def ask(context: str, question: str) -> str:

    # eval params can be set when you are trying to evaluate this agent
    # with different parameters eg. different models, providers or hyperparameters like temperature
    model = eyeball_pp.get_eval_param("model") or "gpt-3.5-turbo"
    
    system = """
    You are trying to answer a question strictly using the information provided in the context. Think step by step. Reply "I don't know" if you don't know the answer.
    """

    prompt = f"""
    Context: {context}
    Question: {question}
    """

    return _execute_completion(system, prompt, model)


In [7]:
recorded_inputs = eyeball_pp.rerun_recorded_examples()
_print_answers(recorded_inputs)

Will rerun 3 inputs for task:`ask`

Rerunning input #0:
context="The quick brown fox jumps over the lazy dog"
question="What color is the fox?"

Context  : The quick brown fox jumps over the lazy dog
Question : What color is the fox?
Answer   : I don't know.


Rerunning input #1:
context="The lazy green dog jumps over the quick brown fox"
question="What color is the dog?"

Context  : The lazy green dog jumps over the quick brown fox
Question : What color is the dog?
Answer   : The color of the dog is green.


Rerunning input #2:
context="Peter Piper picked a peck of pickled peppers"
question="How would you describe the peppers?"

Context  : Peter Piper picked a peck of pickled peppers
Question : How would you describe the peppers?
Answer   : The peppers are pickled.



In [8]:
eyeball_pp.evaluate_system(
    task_objective="This agent tries to answer questions given a context. Verify that the agent answers the question correctly and that the answer is only based on the context.",
    grading_criteria=[Criteria.CORRECTNESS, Criteria.RELEVANCE]
)

Comparing 3 inputs for task:`ask`

Input #0 - Scoring 2 checkpoints
Scored 2023-08-09T02:09:13.078008: 0.00 ({
  "evaluations": [
    {
      "name": "Criteria.CORRECTNESS",
      "rating": "No",
      "reason": "The response is incorrect. The color of the fox is clearly stated in the context as 'brown'."
    },
    {
      "name": "Criteria.RELEVANCE",
      "rating": "No",
      "reason": "The response is not relevant to the context. The color of the fox is mentioned in the context, but the response does not refer to it."
    }
  ]
})
Using cached score for 2023-08-09T02:08:21.436529: {'task_output': OutputScore(score=0.0, message='{\n  "evaluations": [\n    {\n      "name": "Criteria.CORRECTNESS",\n      "rating": "No",\n      "reason": "The response is incorrect. The color of the fox is clearly stated in the context as \'brown\'."\n    },\n    {\n      "name": "Criteria.RELEVANCE",\n      "rating": "No",\n      "reason": "The response is not relevant to the context. The color of th

### Third Run: 3 Success
This run reruns recorded examples using the `gpt-4` model instead. The result is that all requests succeed.

In [9]:
recorded_inputs = eyeball_pp.rerun_recorded_examples(
    {"model": "gpt-4"}
)
_print_answers(recorded_inputs)

Will rerun 3 inputs for task:`ask`

Rerunning input #0:
context="The quick brown fox jumps over the lazy dog"
question="What color is the fox?"

Using eval params: {'model': 'gpt-4'}
Context  : The quick brown fox jumps over the lazy dog
Question : What color is the fox?
Answer   : The fox is brown.


Rerunning input #1:
context="The lazy green dog jumps over the quick brown fox"
question="What color is the dog?"

Using eval params: {'model': 'gpt-4'}
Context  : The lazy green dog jumps over the quick brown fox
Question : What color is the dog?
Answer   : The dog is green.


Rerunning input #2:
context="Peter Piper picked a peck of pickled peppers"
question="How would you describe the peppers?"

Using eval params: {'model': 'gpt-4'}
Context  : Peter Piper picked a peck of pickled peppers
Question : How would you describe the peppers?
Answer   : The peppers are described as pickled.



In [10]:
eyeball_pp.evaluate_system(
    task_objective="This agent tries to answer questions given a context. Verify that the agent answers the question correctly and that the answer is only based on the context.",
    grading_criteria=[Criteria.CORRECTNESS, Criteria.RELEVANCE]
)

Comparing 3 inputs for task:`ask`

Input #0 - Scoring 3 checkpoints
Scored 2023-08-09T02:11:32.003474: 1.00 ({
  "evaluations": [
    {
      "name": "Criteria.CORRECTNESS",
      "rating": "Yes",
      "reason": "The response is correct and accurate. The fox is indeed described as brown in the context."
    },
    {
      "name": "Criteria.RELEVANCE",
      "rating": "Yes",
      "reason": "The response is relevant as it directly answers the question using the information provided in the context."
    }
  ]
})
Using cached score for 2023-08-09T02:09:13.078008: {'task_output': OutputScore(score=0.0, message='{\n  "evaluations": [\n    {\n      "name": "Criteria.CORRECTNESS",\n      "rating": "No",\n      "reason": "The response is incorrect. The color of the fox is clearly stated in the context as \'brown\'."\n    },\n    {\n      "name": "Criteria.RELEVANCE",\n      "rating": "No",\n      "reason": "The response is not relevant to the context. The color of the fox is mentioned in the 