# Evals

An Eval is a way to define a test scenario. The response of the LLM is evaluated based on the "checks" that are defined for that Eval. Evals can be defined with a dictionary/yaml, or with llm_eval python classes.

Here's a simple example where we want to test the LLM's ability to create a python function that we've described. (A more detailed version of this Eval can be found in `./evals/mask_emails.yaml`)

## Defining a single Eval and single prompt.

In [1]:
# set path to the root directory of the project
import os
os.chdir('..')

In [2]:
from textwrap import dedent
from llm_eval.eval import Eval
from llm_eval.checks import RegexCheck, PythonCodeBlockTests
from llm_eval.openai import user_message

eval = Eval(
    metadata={
        # metadata is a dictionary that can contain any key-value string pairs
        'name': 'Mask Emails',
        'description': 'Evaluates the ability of a model to mask emails in text',
    },
    input=[
        {
            'role': 'user',
            'content': dedent(
                """
                Create a python function called `mask_emails` that takes a single string and masks all
                emails in that string.

                For each email in the format of `x@y.z`, the local part (`x`) should be masked with
                [MASKED].
                """
            ),
        }
    ],
    checks=[
        RegexCheck(pattern=r'def mask_emails\([a-zA-Z_]+(\: str)?\)( -> str)?\:'),
        PythonCodeBlockTests(
            code_tests=[
                'assert mask_emails("no email") == "no email"',
                'assert mask_emails("my email is a@b.c") == "my email is [MASKED]@b.c"',
            ],
        ),
    ],
)
eval.to_dict()

{'input': [{'role': 'user',
   'content': '\nCreate a python function called `mask_emails` that takes a single string and masks all\nemails in that string.\n\nFor each email in the format of `x@y.z`, the local part (`x`) should be masked with\n[MASKED].\n'}],
 'checks': [{'pattern': 'def mask_emails\\([a-zA-Z_]+(\\: str)?\\)( -> str)?\\:',
   'check_type': 'REGEX'},
  {'code_tests': ['assert mask_emails("no email") == "no email"',
    'assert mask_emails("my email is a@b.c") == "my email is [MASKED]@b.c"'],
   'check_type': 'PYTHON_CODE_BLOCK_TESTS'}],
 'metadata': {'name': 'Mask Emails',
  'description': 'Evaluates the ability of a model to mask emails in text'}}

The `PythonCodeBlockTests` check object extracts all of the code blocks that were generated by the LLM and runs the code in the background in an isolated environment. It tracks the number of code blocks that were generated and the number of code blocks that successfully executed. The object also has a set of `code_test` where the user can write python code (either single statements (assertion or statements that resolve to booleans) or functions (which return boolean values)). The code that is defined in these `code_tests` is run in the same isolated enviornment and can test any function, variable, or class that is created from code generated by the LLM.

See `./evals/mask_emails.yaml` for additional examples, and the documentation for `PythonCodeBlockTests` in `llm_eval/checks.py`.

In the Eval above, we can define checks with dictionaries using the `check_type` key and values like `REGEX` and `PYTHON_CODE_BLOCK_TESTS`, which are "registered" so the classes can be instantiated in real time. Built-in registration types can be found in the `CheckType` enum.

For example, here is the class definition for `RegexCheck`

```
@Check.register(CheckType.REGEX)
class RegexCheck(Check):
    ...
```

Users can create their own custom checks and use the same registration system to register their checks, for example:

```
@Check.register('my-custom-check')
class CustomXYZCheck(Check):
    ...
```


Eval objects can be created directly from dictionaries, as shown below. However, the most common pattern is to define Evals as yaml files, and load many evals and run them with the `EvalHarness`, which will be described in another notebook.

In [3]:
Eval(**eval.to_dict())

<llm_eval.eval.Eval at 0x1155be9d0>

---

## Creating Candidates

A candidate is just a wrapper around an LLM/service that has a similar registration system (as Checks) that allows Candidates to be instantiated dynamically from dictionaries. 

In [4]:
from llm_eval.candidates import OpenAICandidate

candidate = OpenAICandidate(
    model='gpt-4o-mini',
    metadata={'name': 'OpenAI 4o-mini'},
    parameters={
        'temperature': 0.1,
    },
)
candidate.to_dict()

{'metadata': {'name': 'OpenAI 4o-mini'},
 'parameters': {'temperature': 0.1},
 'candidate_type': 'OPENAI',
 'model_name': 'gpt-4o-mini'}

---

# Running a single Eval against a single Candidate

This example shows running a single Eval against a Eingle candidate. However, as mentioned above, the most common pattern is to define Evals as yaml files, and load many evals and run them with the `EvalHarness`, which will be described in another notebook.

The Eval object is callable, and takes a single candidate.

In [5]:
result = eval(candidate)
print(result)

<llm_eval.eval.EvalResult object at 0x114b35090>


In [9]:
result.to_dict()

{'eval': {'input': [{'role': 'user',
    'content': '\nCreate a python function called `mask_emails` that takes a single string and masks all\nemails in that string.\n\nFor each email in the format of `x@y.z`, the local part (`x`) should be masked with\n[MASKED].\n'}],
  'checks': [{'pattern': 'def mask_emails\\([a-zA-Z_]+(\\: str)?\\)( -> str)?\\:',
    'check_type': 'REGEX'},
   {'code_tests': ['assert mask_emails("no email") == "no email"',
     'assert mask_emails("my email is a@b.c") == "my email is [MASKED]@b.c"'],
    'check_type': 'PYTHON_CODE_BLOCK_TESTS'}],
  'metadata': {'name': 'Mask Emails',
   'description': 'Evaluates the ability of a model to mask emails in text'}},
 'candidate': {'metadata': {'name': 'OpenAI 4o-mini'},
  'parameters': {'temperature': 0.1},
  'candidate_type': 'OPENAI',
  'model_name': 'gpt-4o-mini'},
 'response': 'You can create a Python function called `mask_emails` that uses regular expressions to find and mask email addresses in the specified format

In [6]:
print(f"Num Checks: {result.num_checks}")
print(f"Num Passed: {result.num_successful_checks}")
print(f"Percent Passed: {result.perc_successful_checks:.1%}")

Num Checks: 2
Num Passed: 1
Percent Passed: 50.0%


In [7]:
result.response_metadata

{'prompt_tokens': 59,
 'completion_tokens': 469,
 'total_tokens': 528,
 'prompt_cost': 8.85e-06,
 'completion_cost': 0.0002814,
 'total_cost': 0.00029025000000000003,
 'completion_characters': 1750}

In [8]:
print(result.response)

You can create a Python function called `mask_emails` that uses regular expressions to find and mask email addresses in the specified format. Below is an implementation of this function:

```python
import re

def mask_emails(text):
    # Define a regex pattern to match emails in the format x@y.z
    email_pattern = r'\b([a-zA-Z0-9._%+-]+)@([a-zA-Z0-9.-]+)\.([a-zA-Z]{2,})\b'
    
    # Function to replace the local part of the email with [MASKED]
    def mask_email(match):
        return '[MASKED]@' + match.group(2) + '.' + match.group(3)
    
    # Substitute the emails in the text using the mask_email function
    masked_text = re.sub(email_pattern, mask_email, text)
    
    return masked_text

# Example usage
input_text = "Please contact us at support@example.com or sales@domain.org."
masked_text = mask_emails(input_text)
print(masked_text)
```

### Explanation:
1. **Regular Expression**: The regex pattern `r'\b([a-zA-Z0-9._%+-]+)@([a-zA-Z0-9.-]+)\.([a-zA-Z]{2,})\b'` is used to matc

---