**NOTE**: you will need to create a `.env` file with this entry `OPENAI_API_KEY=<your key here>` that contains your OpenAI API Key

# Evaluating OpenAI `4.0` and `4o-mini` against two evals

This example shows how the EvalHarness is most commonly used, which is to evaluate a large number of test cases (Evals) against a small to large number of LLMs (Candidates). 

For demonstration purposes, we'll show how to use the EvalHarness to evaluate OpenAI `4.0` and `4o-mini` against two fictitious evals. The candidates and evals in this example are defined in yaml files.

An EvalHarness is a class that encapsulates a set of Evals and set of Candidates, and runs the Evals against each of the Candidates. It also provides additional options for multiprocessing, asynchronous requests (to the underlying LLMs), and callback mechansism for saving files or logging/handling errors.

In this example, the Eval and Candidate objects are added to the EvalHarness via `add_evals_from_yamls(...)` and `add_candidates_from_yamls(...)` which, in this case, load all yaml files found in the directories provided. Eval and Candidate objects can be added to the EvalHarness directly through `add_evals()` and `add_candidates()`. Evals/candidates that are defined in yaml files can be individually added to the EvalHarness via `add_eval_from_yaml()` which takes a string to the yaml file.

In [1]:
# EvalHarness runs evals asychronously, so we need to install nest_asyncio to avoid errors
# running the evals in a notebook
# !pip install nest_asyncio

import nest_asyncio
nest_asyncio.apply()  # needed for running async in jupyter notebook

# set path to the root directory of the project
import os
os.chdir('..')

In [2]:
import time
from examples.utils import print_result
from llm_eval.eval import EvalHarness

harness = EvalHarness(callback=print_result)
harness.add_evals_from_yamls('examples/evals/*.yaml')
harness.add_candidate_from_yaml('examples/candidates/openai_4.0.yaml')
harness.add_candidate_from_yaml('examples/candidates/openai_4o-mini.yaml')

print("# of Evals: ", len(harness.evals))
print("# of Candidates: ", len(harness.candidates))

print("Starting eval_harness")
start = time.time()
results = harness()  # run the evals
end = time.time()
print(f"Total time: {end - start}")

# of Evals:  3
# of Candidates:  2
Starting eval_harness
Num Checks: 2
Num Successful Checks: 2
---
Num Checks: 3
Num Successful Checks: 2
---
Num Checks: 3
Num Successful Checks: 3
---
Num Checks: 2
Num Successful Checks: 2
---
Num Checks: 3
Num Successful Checks: 2
---
Num Checks: 3
Num Successful Checks: 3
---
Total time: 27.229626178741455


---

The following is a dictionary representation of the second result (`[1]`) from the first candidate (`[0]`).

In [3]:
results[0][1].to_dict()

{'eval_obj': {'input': [{'role': 'system',
    'content': 'You are a helpful assistant.'},
   {'role': 'user',
    'content': 'Create a python function called `mask_emails` that uses regex to mask all emails. For each email in the format of `x@y.z`, the local part (`x`) should be masked with [MASKED], but the domain (`@y.z`) should be retained. Use type hints and docstrings.'},
   {'role': 'system',
    'content': 'Here is the function to mask emails:\n\n```python\nimport re\n\ndef mask_emails(value: str) -> str:\n    """\n    Masks all emails in the input string.\n    """\n    return re.sub(\n        r\'([a-zA-Z0-9._%+-]+)@([a-zA-Z0-9.-]+)\\\\.([a-zA-Z]{2,})\',\n        r\'[MASKED]@\\\\2.\\\\3\',\n        value\n    )\n```\n'},
   {'role': 'user',
    'content': 'Create a set of assertion statements that test the function.'}],
  'checks': [{'value': 'assert mask_emails(', 'check_type': 'CONTAINS'},
   {'check_type': 'PYTHON_CODE_BLOCKS_PRESENT'},
   {'code_setup': 'import re\n',
    '

---

The following code contains an example of how to summarize the eval results.

The EvalHarness returns a list of lists. The outer list corresponds to each candidate and contains the eval results for that candate. So if there were 5 candidates evaluated the `results` object would be a list of 5 items (which are also lists). If there were 10 evals (evaulated against the 5 candidates) then each inner list would contain 10 `EvalResults` objects.

In [4]:
import pandas as pd

results_summary = []
# each outer list in results corresponds to a candidate
for cand_obj, cand_results in zip(harness.candidates, results):
    candidate_name = cand_obj.metadata['name']

    num_characters = sum(len(r.response) for r in cand_results)
    total_time = sum(r.total_time_seconds for r in cand_results)
    avg_chars_per_second = num_characters / total_time
    avg_cost = sum(r.response_metadata['total_cost'] for r in cand_results) / len(cand_results)
    num_checks = sum(len(r.check_results) for r in cand_results)
    num_successful_checks = sum(r.num_successful_checks for r in cand_results)
    percent_success = num_successful_checks / num_checks
    results_summary.append({
        'name': candidate_name,
        'Avg chars per second': avg_chars_per_second,
        'Avg cost': avg_cost,
        '# checks': num_checks,
        '# checks passed': num_successful_checks,
        '% checks passed': percent_success,
    })
    print(f"Results for {candidate_name}:")
    print(f"  {num_successful_checks}/{num_checks} ({percent_success:.1%}) successful checks")

pd.DataFrame(results_summary).style.format({
    'Avg chars per second': '{:.1f}',
    'Avg cost': '{:.4f}',
    '% checks passed': '{:.1%}',
})

Results for OpenAI GPT-4.0-Turbo (2024-04-09):
  7/8 (87.5%) successful checks
Results for OpenAI GPT-4o-mini:
  7/8 (87.5%) successful checks


Unnamed: 0,name,Avg chars per second,Avg cost,# checks,# checks passed,% checks passed
0,OpenAI GPT-4.0-Turbo (2024-04-09),85.6,0.0138,8,7,87.5%
1,OpenAI GPT-4o-mini,296.8,0.0002,8,7,87.5%


---

# Running a single Eval against a single Candidate

A less common scenario, which might be useful when generating evals or debugging, is running a single Eval against a single Candidate. Eval objects are callable and can be executed by passing a candidate.

In [5]:
from llm_eval.candidates import OpenAICandidate
from llm_eval.eval import Eval
from llm_eval.openai import user_message

candidate = OpenAICandidate(model='gpt-4o-mini')
eval_obj = Eval(
    input=[
        user_message("Create a python function called `mask_emails` that uses regex to mask all emails.")
    ],
    checks=[
        # checks can be defined as dictionaries or as objects (e.g. ContainsCheck, PythonCodeBlocksPresentCheck)
        {'check_type': 'CONTAINS', 'value': 'def mask_emails'},
        {'check_type': 'PYTHON_CODE_BLOCKS_PRESENT'},
    ],
)
result = eval_obj(candidate)
print(result)

<llm_eval.eval.EvalResult object at 0x116fcb510>


In [6]:
result.to_dict()

{'eval_obj': {'input': [{'role': 'user',
    'content': 'Create a python function called `mask_emails` that uses regex to mask all emails.'}],
  'checks': [{'value': 'def mask_emails', 'check_type': 'CONTAINS'},
   {'check_type': 'PYTHON_CODE_BLOCKS_PRESENT'}]},
 'candidate_obj': {'candidate_type': 'OPENAI', 'model': 'gpt-4o-mini'},
 'response': 'You can use the `re` module in Python to create a function that masks emails in a given text. The function will replace the part of the email address before the "@" symbol with asterisks (`*`), while leaving the domain visible. Here\'s how you can implement the `mask_emails` function:\n\n```python\nimport re\n\ndef mask_emails(text):\n    # Regex pattern to match email addresses\n    email_pattern = r"([a-zA-Z0-9._%+-]+)@([a-zA-Z0-9.-]+\\.[a-zA-Z]{2,})"\n    \n    # Function to replace the email address with masked version\n    def replace_email(match):\n        # Extract the email parts\n        username = match.group(1)\n        domain = mat

---