**NOTE**: you will need to create a `.env` file with this entry `OPENAI_API_KEY=<your key here>` that contains your OpenAI API Key

# Evaluating OpenAI 3.5 and 4.0 against two evals

This example shows how the EvalHarness is most commonly used, which is to evaluate a large number of test cases (Evals) against a small to large number of LLMs (Candidates). 

For demonstration purposes, we'll show how to use the EvalHarness to evaluate OpenAI 3.5 and 4.0 against two fictitious evals. The candidates and evals in this example are defined in yaml files.

An EvalHarness is a class that encapsulates a set of Evals and set of Candidates, and runs the Evals against each of the Candidates. It also provides additional options for multiprocessing, asynchronous requests (to the underlying LLMs), and callback mechansism for saving files or logging/handling errors.

In this example, the Eval and Candidate objects are added to the EvalHarness via `add_evals_from_yamls(...)` and `add_candidates_from_yamls(...)` which, in this case, load all yaml files found in the directories provided. Eval and Candidate objects can be added to the EvalHarness directly through `add_evals()` and `add_candidates()`. Evals/candidates that are defined in yaml files can be individually added to the EvalHarness via `add_eval_from_yaml()` which takes a string to the yaml file.

In [1]:
# EvalHarness runs evals asychronously, so we need to install nest_asyncio to avoid errors
# running the evals in a notebook
!pip install nest_asyncio

[0m

In [3]:
import time
from llm_eval.eval import EvalHarness, EvalResult
import nest_asyncio

nest_asyncio.apply()  # needed for running async in jupyter notebook

def print_result(result: EvalResult) -> None:
    """
    This function is used as a callback and prints the results of each evaluation.

    The callback can also be used, for example, to save the results to a file. If you're
    running a large number of evaluations, you may want to save the results to a file
    periodically in case there are issues/errors before the entire EvalHarness completes.
    """
    print(result)
    print('---')

harness = EvalHarness(callback=print_result)
harness.add_evals_from_yamls('evals/*.yaml')
harness.add_candidate_from_yaml('candidates/openai_3.5.yaml')
harness.add_candidate_from_yaml('candidates/openai_4.0.yaml')

print("# of Evals: ", len(harness.evals))
print("# of Candidates: ", len(harness.candidates))

print("Starting eval_harness")
start = time.time()
results = harness()  # run the evals
end = time.time()
print(f"Total time: {end - start}")

# of Evals:  2
# of Candidates:  2
Starting eval_harness


EvalResult:
    Candidate:                  OpenAI GPT-3.5-Turbo (0125)
    Eval:                        Fibonacci Sequence
    # of Prompts Tested:         2
    Cost:                       $0.0007
    Total Response Time:         9.5 seconds
    # of Response Characters:    1,248
    Characters per Second:       131.2
    # of Checks:                 5
    # of Successful Checks:      4
    % of Successful Checks:      80.0%
    # of Code Blocks Generated:  2
    # of Successful Code Blocks: 2
    # of Code Tests Defined:     1
    # of Successful Code Tests:  0
---
EvalResult:
    Candidate:                  OpenAI GPT-3.5-Turbo (0125)
    Eval:                        Python Function to Mask Emails
    # of Prompts Tested:         2
    Cost:                       $0.0007
    Total Response Time:         8.4 seconds
    # of Response Characters:    1,428
    Characters per Second:       170.5
    # of Checks:                 6
    # of Successful Checks:      5
    % of Successful C

---

The following code contains an example of how to summarize the eval results.

The EvalHarness returns a list of lists. The outer list corresponds to each candidate and contains the eval results for that candate. So if there were 5 candidates evaluated the `results` object would be a list of 5 items (which are also lists). If there were 10 evals (evaulated against the 5 candidates) then each inner list would contain 10 `EvalResults` objects.

In [4]:
import pandas as pd

results_summary = []
# each outer list in results corresponds to a candidate
for cand_obj, cand_results in zip(harness.candidates, results):
    candidate_name = cand_obj.metadata['name']
    avg_chars_per_second = sum(r.characters_per_second for r in cand_results) / len(cand_results)
    avg_cost = sum(r.cost for r in cand_results) / len(cand_results)
    num_checks = sum(r.num_checks for r in cand_results)
    num_successful_checks = sum(r.num_successful_checks for r in cand_results)
    percent_success = num_successful_checks / num_checks
    num_code_blocks_generated = sum(r.num_code_blocks for r in cand_results)
    num_code_blocks_successful = sum(r.get_num_code_blocks_successful() for r in cand_results)
    percent_code_blocks_successful = num_code_blocks_successful / num_code_blocks_generated
    results_summary.append({
        'name': candidate_name,
        'Avg chars per second': avg_chars_per_second,
        'Avg cost': avg_cost,
        '# checks': num_checks,
        '# checks passed': num_successful_checks,
        '% checks passed': percent_success,
        '# code blocks generated': num_code_blocks_generated,
        '# blocks successfully executed': num_code_blocks_successful,
        '% blocks successfully executed': percent_code_blocks_successful,
    })
    print(f"Results for {candidate_name}:")
    print(f"  {num_successful_checks}/{num_checks} ({percent_success:.1%}) successful checks")
    print(f"  {num_code_blocks_successful}/{num_code_blocks_generated} ({percent_code_blocks_successful:.1%}) successful code blocks")  # noqa

pd.DataFrame(results_summary).style.format({
    'Avg chars per second': '{:.1f}',
    'Avg cost': '{:.4f}',
    '% checks passed': '{:.1%}',
    '% blocks successfully executed': '{:.1%}',
})

Results for OpenAI GPT-3.5-Turbo (0125):
  9/11 (81.8%) successful checks
  3/4 (75.0%) successful code blocks
Results for OpenAI GPT-4.0-Turbo (2024-04-09):
  9/11 (81.8%) successful checks
  3/4 (75.0%) successful code blocks


Unnamed: 0,name,Avg chars per second,Avg cost,# checks,# checks passed,% checks passed,# code blocks generated,# blocks successfully executed,% blocks successfully executed
0,OpenAI GPT-3.5-Turbo (0125),150.9,0.0007,11,9,81.8%,4,3,75.0%
1,OpenAI GPT-4.0-Turbo (2024-04-09),92.2,0.0423,11,9,81.8%,4,3,75.0%


---

# Running a single Eval against a single Candidate

A less common scenario, which might be useful when generating evals or debugging, is running a single Eval against a signle Candidate. Eval objects are callable and can be executed by passing a candidate.

In [5]:
from llm_eval.candidates import OpenAICandidate
from llm_eval.eval import Eval

candidate = OpenAICandidate({'parameters': {'model_name': 'gpt-3.5-turbo-1106'}})
eval_obj = Eval(prompt_sequence={
    'prompt': "Create a python function called `mask_emails` that uses regex to mask all emails.",
    'checks': [
        {'check_type': 'CONTAINS', 'value': 'def mask_emails'},
        {'check_type': 'PYTHON_CODE_BLOCKS_PRESENT'},
    ],
})
result = eval_obj(candidate)
print(result)

EvalResult:
    # of Prompts Tested:         1
    Cost:                       $0.0002
    Total Response Time:         2.7 seconds
    # of Response Characters:    346
    Characters per Second:       130.2
    # of Checks:                 2
    # of Successful Checks:      2
    % of Successful Checks:      100.0%
    # of Code Blocks Generated:  1


In [6]:
result.to_dict()

{'eval_obj': {'prompt_sequence': [{'prompt': 'Create a python function called `mask_emails` that uses regex to mask all emails.',
    'checks': [{'value': 'def mask_emails', 'check_type': 'CONTAINS'},
     {'check_type': 'PYTHON_CODE_BLOCKS_PRESENT'}]}]},
 'candidate_obj': {'metadata': {'parameters': {'model_name': 'gpt-3.5-turbo-1106'}},
  'candidate_type': 'OPENAI'},
 'responses': ['```python\nimport re\n\ndef mask_emails(text):\n    email_pattern = r\'\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b\'\n    masked_text = re.sub(email_pattern, \'[email]\', text)\n    return masked_text\n\n# Example usage\ntext = "You can contact me at john.doe@example.com or jane_smith@gmail.com"\nmasked_text = mask_emails(text)\nprint(masked_text)\n```'],
 'total_time_seconds': 2.658125400543213,
 'num_code_blocks': 1,
 'cost': 0.000175,
 'timestamp': '2024-06-25 16:23:21 UTC',
 'results': [[{'value': True,
    'success': True,
    'metadata': {'check_type': 'CONTAINS',
     'check_value': 'def

---