# Running a Test Suite on an External Function

## Context
Vellum Test Suites provide a framework for performing quantiative evaluation on AI applications at scale. You can use them to measure the quality of Prompts, Workflows, and even custom functions defined outside of Vellum in your codebase!

This example details how to use Vellum Test Suites to run evals on an external function.


## Prerequisites
1. A Vellum account
2. A Vellum API key, which can be created at [https://app.vellum.ai/api-keys](https://app.vellum.ai/api-keys)
3. Install the `vellum-ai` pip package. We'll also use the getpass package in this notebook to store your Vellum API key.




In [1]:
!pip install vellum-ai getpass


Looking in indexes: https://pypi.org/simple, https://_json_key_base64:****@us-central1-python.pkg.dev/vocify-prod/vocify/simple/
[31mERROR: Could not find a version that satisfies the requirement getpass (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for getpass[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
from getpass import getpass

VELLUM_API_KEY = getpass()

 ········


## Test Suite Set Up
To run evals on your external function, you must first configure a Test Suite through the Vellum web application at [https://app.vellum.ai/test-suites](https://app.vellum.ai/test-suites).

Note that the Test Suite's "Execution Interface" must match that of the function that you'd like to evaluate. For example, if your function looks like:

```python
def my_function(arg_1: str, arg_2: str) -> str:
    pass
```

Then you will want your Test Suite's Execution interface to look like this:
![Test Suite Execution Interface](images/test-suite-execution-interface.png)

## Getting Started

Now that everything is set up, it's time to write some code! First, we need to define the function whose output we want to evaluate. Here's how we can actually invoke the Test Suite against our function.

Here we're using a Vellum Workflow as an example, but this code could do anything, including calling a Prompt Chain made via another
third-party library.

In [3]:
from vellum.types.named_test_case_variable_value_request import NamedTestCaseVariableValueRequest, NamedTestCaseStringVariableValueRequest
from vellum.types.test_case_variable_value import TestCaseVariableValue

def external_execution(inputs: list[TestCaseVariableValue]) -> list[NamedTestCaseVariableValueRequest]:
    output_value = "".join([variable.value for variable in inputs])
    output = NamedTestCaseStringVariableValueRequest(
        type="STRING",
        value=output_value,
        name="output"
    )
    return [output]

In [8]:
TEST_SUITE_ID = input()

 9580c9c2-ed5d-4206-a103-d6e487f6a54b


In [9]:
from vellum.client import Vellum
from vellum.lib.test_suites import VellumTestSuite


# Create a new VellumTestSuite object
client = Vellum(api_key=VELLUM_API_KEY)
test_suite = VellumTestSuite(test_suite_id=TEST_SUITE_ID, client=client)

## Running Evals

Here is where we actually trigger the Test Suite and pass in our executable function.

In [10]:
# Run the external execution
results = test_suite.run_external(executable=external_execution)

In [11]:
# Filter down to a specific metric and a specific output that it produces.
results.get_metric_outputs("Exact Match", "score")

[TestSuiteRunMetricNumberOutput(value=1.0, name='score', type='NUMBER'),
 TestSuiteRunMetricNumberOutput(value=0.0, name='score', type='NUMBER'),
 TestSuiteRunMetricNumberOutput(value=1.0, name='score', type='NUMBER')]

## Operating on the Results

Above we use the`get_metric_outputs` function to retrieve all `score`'s for the `Exact Match` output.

Note that under the hood, this function calls `wait_until_complete` to wait until the Test Suite Run has finished running.
You can also call this function explicitly if you like ahead of time.

`get_metric_outputs` is the primary way to interact with the outputs of a specified metric. With it, you can
perform a variety of assertions to enforce whatever quality thresholds you like.

If you want to operate directly on the raw executions for ultimate flexibility, use `results.all_executions`.

In [12]:
def print_result(msg: str, result: bool) -> None:
    print(msg, "Yes" if result else "No")

# Example of asserting that every Test Cases passes
all_test_cases_pass = all([result.value == 1.0 for result in results.get_metric_outputs("exact-match", "score")])
print_result("Do all Test Cases pass?", all_test_cases_pass)

# Example asserting that at least 50% of results have a score above a specified threshold
num_test_cases_passing = results.get_count_metric_outputs("exact-match", "score", predicate=lambda x: x.value >= 0.5)
num_test_cases_total = results.get_count_metric_outputs("exact-match", "score")
percent_test_cases_passing = num_test_cases_passing / num_test_cases_total
print_result(f"{percent_test_cases_passing * 100}% of Test Cases pass. Acceptable?", percent_test_cases_passing > 0.5)

# Example of asserting that the average score is greater than a specified threshold
avg_score_acceptable = results.get_mean_metric_output("exact-match", "score") > 0.5
print_result("Is the average score acceptable?", avg_score_acceptable)

# Example of asserting that the min score is greater than a specified threshold
min_score_acceptable = results.get_min_metric_output("exact-match", "score") > 0.5
print_result("Is the minimum score acceptable?", min_score_acceptable)

# Example of asserting that the max score is greater than a specified threshold
max_score_acceptable = results.get_min_metric_output("exact-match", "score") > 0.75
print_result("Is the maximum regressing?", max_score_acceptable)

# Print out all results
results.all_executions

Do all Test Cases pass? No
66.66666666666666% of Test Cases pass. Acceptable? Yes
Is the average score acceptable? Yes
Is the minimum score acceptable? No
Is the maximum regressing? No


[VellumTestSuiteRunExecution(id='95431327-d864-4cc1-bd47-50430c39ebeb', test_case_id='99971a73-429d-4a28-9003-afbe5cadb868', outputs=[TestSuiteRunExecutionOutput_String(name='output', value='Hello, world!', output_variable_id='c3f48fd5-6df7-4116-bd69-fb624d8d7d88', type='STRING')], metric_results=[TestSuiteRunExecutionMetricResult(metric_id='c4ac96a5-2101-4e1e-8dfb-3fccdc1ebde0', outputs=[TestSuiteRunMetricNumberOutput(value=1.0, name='score', type='NUMBER'), TestSuiteRunMetricNumberOutput(value=1.0, name='normalized_score', type='NUMBER')], metric_label='Exact Match', metric_definition=TestSuiteRunExecutionMetricDefinition(id='9a8a4c32-0258-41be-beac-063628fe50e6', label='Exact Match', name='exact-match'))]),
 VellumTestSuiteRunExecution(id='363833cf-4b0f-4993-876f-ed578d792414', test_case_id='3fdb81b7-2147-42c8-92b2-b8f322ad9853', outputs=[TestSuiteRunExecutionOutput_String(name='output', value='Failingtest', output_variable_id='c3f48fd5-6df7-4116-bd69-fb624d8d7d88', type='STRING')],