We've seen how we can define Evals and Candidates in dictionaries and yaml files. Now we'll show how you can use a single dictionary (or yaml) to define multiple Evals and/or multiple Candidates.

Here are common use-cases:

- Evaluating the effectiveness of the same prompt across multiple system messages.
- Comparing different prompts against the same set of Checks.
- Evaluating the same LLM across multiple model parameters (e.g. temperature).

# Example: Evaluating the different system messages across the same prompt.

Below we've defined a similar eval to the one we saw in the previous notebook. It defines a single eval, that tests the LLMs' ability to create a python function called masks emails, and then defines specific Checks that allow us to measure success. This Eval also specifies the `system_message` that we want to use for this particular prompt.

In [1]:
eval_dict = {
    'metadata': {
        'uuid': 'f1b3b3b0-0b7b-4b1b-8b1b-0b1b0b1b0b1b',
        'name': 'Mask Emails',
        'description': 'Evaluates the ability of a model to mask emails in text',
    },
    'system_message': 'You are a helpful AI assistant.',
    'prompt_sequence': [
        {
            'prompt':
                """
                Create a python function called `mask_emails` that takes a single string and masks all
                emails in that string.

                For each email in the format of `x@y.z`, the local part (`x`) should be masked with
                [MASKED].
                """,
            'checks': [
                {
                    'check_type': 'REGEX',
                    'pattern': 'def mask_emails\\([a-zA-Z_]+(\\: str)?\\)( -> str)?\\:',
                },
                {
                    'check_type': 'PYTHON_CODE_BLOCK_TESTS',
                    'code_tests': [
                        'assert mask_emails("no email") == "no email"',
                        'assert mask_emails("my email is a@b.c") == "my email is [MASKED]@b.c"',
                    ],
                },
            ],
        },
    ],
}

If we add this eval to the EvalHarness, we can see that the number of evals is `1`, as expected.

In [2]:
from llm_eval.eval import EvalHarness

harness = EvalHarness()
harness.add_evals(eval_dict)

print("# of Evals: ", len(harness.evals))
print("# of Candidates: ", len(harness.candidates))

  from .autonotebook import tqdm as notebook_tqdm


# of Evals:  1
# of Candidates:  0


As mentioned, we set the system message:

In [3]:
print(eval_dict['system_message'])
print(harness.evals[0].system_message)

You are a helpful AI assistant.
You are a helpful AI assistant.


However, we can actually set multiple system messages. This will generate two separate evals that will share all other information, and will only differ by the system message. This is great for comparing the effectiveness of different system messages across the same prompt.

In [4]:
eval_dict['system_message'] = [
    'You are an expert python AI assistant and your goal is to generate very high quality code.',
    'Instead of responding with the answer, write one haiku about the wisdom of the question, and another about the solution.',
]

harness = EvalHarness()  # define a new harness
harness.add_evals(eval_dict)

print("# of Evals: ", len(harness.evals))
print("# of Candidates: ", len(harness.candidates))

# of Evals:  2
# of Candidates:  0


Now we have two Evals instead of one.

In [5]:
print("Eval 1")
print(harness.evals[0].metadata['uuid'])
print(harness.evals[0].system_message)
print(harness.evals[0].prompt_sequence[0].prompt[0:50] + '...')
print('---')
print("Eval 2")
print(harness.evals[1].metadata['uuid'])
print(harness.evals[1].system_message)
print(harness.evals[1].prompt_sequence[0].prompt[0:50] + '...')

Eval 1
f1b3b3b0-0b7b-4b1b-8b1b-0b1b0b1b0b1b
You are an expert python AI assistant and your goal is to generate very high quality code.
Create a python function called `mask_emails` that...
---
Eval 2
f1b3b3b0-0b7b-4b1b-8b1b-0b1b0b1b0b1b
Instead of responding with the answer, write one haiku about the wisdom of the question, and another about the solution.
Create a python function called `mask_emails` that...


In [6]:
from llm_eval.eval import EvalResult
import nest_asyncio

nest_asyncio.apply()  # needed for running async in jupyter notebook

def print_callback(result: EvalResult) -> None:
    print(result)
    print('---')

harness.add_candidates_from_yamls('candidates/openai_3.5.yaml')
harness.callback = print_callback
results = harness()

EvalResult:
    Candidate:                  OpenAI GPT-3.5-Turbo (0125)
    Eval:                        Mask Emails
    # of Prompts Tested:         1
    Cost:                       $0.0003
    Total Response Time:         3.4 seconds
    # of Response Characters:    635
    Characters per Second:       188.1
    # of Checks:                 2
    # of Successful Checks:      2
    % of Successful Checks:      100.0%
    # of Code Blocks Generated:  1
    # of Successful Code Blocks: 1
    # of Code Tests Defined:     2
    # of Successful Code Tests:  2
---
EvalResult:
    Candidate:                  OpenAI GPT-3.5-Turbo (0125)
    Eval:                        Mask Emails
    # of Prompts Tested:         1
    Cost:                       $0.0001
    Total Response Time:         1.6 seconds
    # of Response Characters:    139
    Characters per Second:       88.1
    # of Checks:                 2
    # of Successful Checks:      0
    % of Successful Checks:      0.0%
    # of Code

We can see in the results above that the LLM's response for the second Eval did not contain any code blocks. We can verify below by looking viewing the repsonse for each Eval.

In [7]:
from IPython.display import display, Markdown

# the outer list corresponds to the candidates (we only have one candidate, ChatGPT 3.5)
# the inner list corresponds to the evals (we have two evals in this case)
first_eval_result = results[0][0]
display(Markdown(first_eval_result.responses[0]))

Here is the implementation of the `mask_emails` function:

```python
import re

def mask_emails(input_string):
    def mask_email(match):
        local_part = match.group(1)
        return "[MASKED]@" + match.group(2)

    email_pattern = r"(\S+?)@(\S+?\.\S+)"
    masked_string = re.sub(email_pattern, mask_email, input_string)
    
    return masked_string

# Test the function
input_string = "Please contact me at john.doe@example.com for further information."
masked_output = mask_emails(input_string)
print(masked_output)
```

You can use this function to mask emails in a given string by replacing the local part with `[MASKED]`.

In [8]:
second_eval_result = results[0][1]
display(Markdown(second_eval_result.responses[0]))

Inquiry of code,
Protecting emails with care,
Wisdom in question.

Emails masked with care,
Local part hidden from view,
Solution is clear.

---

## Example: Comparing different prompts using the same Checks

The framework also allows users to specify the same prompts across 




`prompt_comparison`




In [9]:
eval_dict = {
    'metadata': {
        'name': 'Mask Emails',
    },
    'system_message': 'You are a helpful AI assistant.',
    'prompt_comparison': {
        # optional 'parameters' to share between prompts
        'prompt_parameters' : {
            'few_shot_examples': """
                Few shot example 1: X
                Few shot example 2: Y
                Few shot example 3: Z
                """,
        },
        'prompts': [
            """
            Prompt A

            {few_shot_examples}
            """,
            """
            {few_shot_examples}

            Prompt B
            """,
        ],
        'checks': [
            {
                'check_type': 'REGEX',
                'pattern': 'def mask_emails\\([a-zA-Z_]+(\\: str)?\\)( -> str)?\\:',
            },
            {
                'check_type': 'PYTHON_CODE_BLOCK_TESTS',
                'code_tests': [
                    'assert mask_emails("no email") == "no email"',
                    'assert mask_emails("my email is a@b.c") == "my email is [MASKED]@b.c"',
                ],
            },
        ],
    },
}

In [10]:
harness = EvalHarness()  # define a new harness
harness.add_evals(eval_dict)

print("# of Evals: ", len(harness.evals))
print("# of Candidates: ", len(harness.candidates))
print('---')
print("Eval 1")
print(harness.evals[0].metadata['name'])
print(harness.evals[0].prompt_sequence[0].prompt[0:50] + '...')
print(harness.evals[0].prompt_sequence[0].checks[0])
print('---')
print("Eval 2")
print(harness.evals[1].metadata['name'])
print(harness.evals[1].system_message)
print(harness.evals[1].prompt_sequence[0].prompt[0:50] + '...')
print(harness.evals[1].prompt_sequence[0].checks[0])

# of Evals:  2
# of Candidates:  0
---
Eval 1
Mask Emails
Prompt A


    Few shot example 1: X
    Few shot ...
RegexCheck(pattern='def mask_emails\([a-zA-Z_]+(\: str)?\)( -> str)?\:', metadata={})
---
Eval 2
Mask Emails
You are a helpful AI assistant.
Few shot example 1: X
    Few shot example 2: Y
  ...
RegexCheck(pattern='def mask_emails\([a-zA-Z_]+(\: str)?\)( -> str)?\:', metadata={})


---

In [11]:
eval_dict['system_message'] = [
    'System Message 1',
    'System Message 2',
]
harness = EvalHarness()  # define a new harness
harness.add_evals(eval_dict)

print("# of Evals: ", len(harness.evals))

# of Evals:  4
