# Eyeball Simulation Demo
This notebook provides an example of how `eyeball_pp` can be used effectively during development. We'll show eyeball's recording, rerunning and evaluation systems in action under 3 scenarios.

### 0. Setup
Lets install required modules and set up eyeball and openai.

In [None]:
%pip install eyeball_pp openai pyyaml rich

In [1]:
import eyeball_pp
import openai

openai.api_key = "your-openai-key"

# Setting a sample_rate of 1 means that every call to the ask function will be recorded.
# You might want to change this on production to a lower value like 0.1 if you only want to record 10% of the calls.
eyeball_pp.set_config(sample_rate=1)

def _execute_completion(system_msg: str, prompt: str, model: str) -> str:
    """Convenience method for executing a completion."""
    return openai.ChatCompletion.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": system_msg},
            {"role": "user", "content": prompt},
        ],
    )["choices"][0]["message"]["content"]

### 1. First Run: 3 Failures
This run does the initial recording of the inputs and has 3 failures.

In [2]:
@eyeball_pp.record_task(input_names=["context", "question"])
def ask(context: str, question: str) -> str:

    model = eyeball_pp.get_eval_param("model") or "gpt-3.5-turbo"
    
    system = """
    You are trying to answer a question strictly using the information provided in the context. Reply "I don't know" if you don't know the answer.
    """

    prompt = f"""
    Context: {context}
    Question: {question}
    """

    return _execute_completion(system, prompt, model)


In [3]:
sample_inputs = [
    {
        "context": "The quick brown fox jumps over the lazy dog", 
        "question": "What color is the fox?"
    }, {
        "context": "The lazy green dog jumps over the quick brown fox", 
        "question": "What color is the dog?"
    }, {
        "context": "Peter Piper picked a peck of pickled peppers", 
        "question": "How would you describe the peppers?"
    }
]

def _answer_and_print(context, question):
    answer = ask(context, question)
    print(f"Context  : {input['context']}")
    print(f"Question : {input['question']}")
    print(f"Answer   : {answer}\n")

In [4]:
for input in sample_inputs:
    _answer_and_print(**input)

Context  : The quick brown fox jumps over the lazy dog
Question : What color is the fox?
Answer   : I don't know.

Context  : The lazy green dog jumps over the quick brown fox
Question : What color is the dog?
Answer   : I don't know.

Context  : Peter Piper picked a peck of pickled peppers
Question : How would you describe the peppers?
Answer   : I don't know.



In [5]:
from eyeball_pp import Criteria

eyeball_pp.evaluate_system(
    grading_criteria=[Criteria.CORRECTNESS, Criteria.RELEVANCE]
)

Evaluating 3 inputs for task:`ask`


Inputs: 100%|██████████|3/3


Summary:
---------












### 2. Second Run: 2 Success, 1 Failure
This run makes a tweak to the `ask` function by changing the "system" prompt, and then reruns all recorded examples. The result is that 2 of the requests succeed.

In [8]:
@eyeball_pp.record_task(input_names=["context", "question"])
def ask(context: str, question: str) -> str:

    model = eyeball_pp.get_eval_param("model") or "gpt-3.5-turbo"
    
    system = """
    You are trying to answer a question strictly using the information provided in the context. Think step by step. Reply "I don't know" if you don't know the answer.
    """

    prompt = f"""
    Context: {context}
    Question: {question}
    """

    return _execute_completion(system, prompt, model)


In [9]:
recorded_inputs = eyeball_pp.rerun_recorded_examples()
for input in recorded_inputs:
    _answer_and_print(input["context"], input["question"])

Will rerun 3 inputs for task:`ask`

Rerunning input #0:
context="The quick brown fox jumps over the lazy dog"
question="What color is the fox?"

Context  : The quick brown fox jumps over the lazy dog
Question : What color is the fox?
Answer   : I don't know.


Rerunning input #1:
context="The lazy green dog jumps over the quick brown fox"
question="What color is the dog?"

Context  : The lazy green dog jumps over the quick brown fox
Question : What color is the dog?
Answer   : The color of the dog is green.


Rerunning input #2:
context="Peter Piper picked a peck of pickled peppers"
question="How would you describe the peppers?"

Context  : Peter Piper picked a peck of pickled peppers
Question : How would you describe the peppers?
Answer   : The peppers are pickled.



In [10]:
eyeball_pp.evaluate_system(
    task_objective="This agent tries to answer questions given a context. Verify that the agent answers the question correctly and that the answer is only based on the context.",
    grading_criteria=[Criteria.CORRECTNESS, Criteria.RELEVANCE]
)

Evaluating 3 inputs for task:`ask`

Input #0 - Grading 2 checkpoints
Scored 2023-08-09T16:45:55.605154: 1.0
Using cached score for 2023-08-09T16:34:19.644654: 0.0

Input #0 - Running 1 comparison(s)
[32m[improvement] task output improved from checkpoint 2023-08-09T16:34:19.644654  to 2023-08-09T16:45:55.605154 [0m

Input #1 - Grading 2 checkpoints
Scored 2023-08-09T16:45:55.012159: 0.0
Using cached score for 2023-08-09T16:34:19.010248: 0.0

Input #1 - Running 1 comparison(s)
[33m[neutral] task output is the same between checkpoints 2023-08-09T16:34:19.010248  & 2023-08-09T16:45:55.012159  [0m

Input #2 - Grading 2 checkpoints
Scored 2023-08-09T16:45:56.185086: 1.0
Using cached score for 2023-08-09T16:34:20.204640: 0.0

Input #2 - Running 1 comparison(s)
[32m[improvement] task output improved from checkpoint 2023-08-09T16:34:20.204640  to 2023-08-09T16:45:56.185086 [0m

Summary:
---------
Your most sucessful re-runs:
2023-08-09T16:45:55.011848: 3/3 successes

Your most sucessful p

### Third Run: 3 Success
This run reruns recorded examples using the `gpt-4` model instead. The result is that all requests succeed.

In [11]:
recorded_inputs = eyeball_pp.rerun_recorded_examples(
    {"model": "gpt-4"}
)
for input in recorded_inputs:
    _answer_and_print(**input)

Will rerun 3 inputs for task:`ask`

Rerunning input #0:
context="The quick brown fox jumps over the lazy dog"
question="What color is the fox?"

Using eval params: {'model': 'gpt-4'}
Context  : The quick brown fox jumps over the lazy dog
Question : What color is the fox?
Answer   : The fox is brown.


Rerunning input #1:
context="The lazy green dog jumps over the quick brown fox"
question="What color is the dog?"

Using eval params: {'model': 'gpt-4'}
Context  : The lazy green dog jumps over the quick brown fox
Question : What color is the dog?
Answer   : The dog is green.


Rerunning input #2:
context="Peter Piper picked a peck of pickled peppers"
question="How would you describe the peppers?"

Using eval params: {'model': 'gpt-4'}
Context  : Peter Piper picked a peck of pickled peppers
Question : How would you describe the peppers?
Answer   : The peppers are described as pickled.



In [12]:
eyeball_pp.evaluate_system(
    task_objective="This agent tries to answer questions given a context. Verify that the agent answers the question correctly and that the answer is only based on the context.",
    grading_criteria=[Criteria.CORRECTNESS, Criteria.RELEVANCE]
)

Evaluating 3 inputs for task:`ask`

Input #0 - Grading 3 checkpoints
Scored 2023-08-09T16:54:17.211499: 1.0
Using cached score for 2023-08-09T16:45:55.012159: 0.0
Using cached score for 2023-08-09T16:34:19.010248: 0.0

Input #0 - Running 2 comparison(s)
[32m[improvement] task output improved from checkpoint 2023-08-09T16:45:55.012159  to 2023-08-09T16:54:17.211499 (model=gpt-4)[0m
Using cached comparison result for 1ac98f2a029fafe8d8d8caa05df2d12402770d5e92e7eb61297ca2f0a7719aa8
[33m[neutral] task output is the same between checkpoints 2023-08-09T16:34:19.010248  & 2023-08-09T16:45:55.012159  [0m

Input #1 - Grading 3 checkpoints
Scored 2023-08-09T16:54:18.785326: 1.0
Using cached score for 2023-08-09T16:45:55.605154: 1.0
Using cached score for 2023-08-09T16:34:19.644654: 0.0

Input #1 - Running 2 comparison(s)
[33m[neutral] task output is the same between checkpoints 2023-08-09T16:45:55.605154  & 2023-08-09T16:54:18.785326 (model=gpt-4) [0m
Using cached comparison result for f2a