# Vijil Evaluate at-a-glance

Vijil Evaluate is Vijil’s flagship evaluation service that enables AI developers evaluate the trustworthiness of an LLM, generative AI application, or agent. Using the Evaluate API, developers can test an AI system under benign and hostile conditions in minutes for reliability, security, and safety.

This notebook gives an overview of the major features in Vijil Evaluate through our Python client.

## Getting Started
To set up your local environment, follow the steps [here](https://docs.vijil.ai/setup.html) to install the Python client and get an API key for the Evaluate Platform.

Once the Python client installed, you can instantiate a client class and store an API key for the provider your agent is hosted on.

In [2]:
from vijil import Vijil

from dotenv import load_dotenv
load_dotenv()

client = Vijil()
client.api_keys.create(
    name="openai-test", 
    model_hub="openai", 
    api_key="sk+++" # replace with your own api key
)

{'id': '3bf0b10e-4d49-4d08-a626-3c1693b41922',
 'name': 'openai-test',
 'hub': 'openai',
 'rate_limit_per_interval': 60,
 'rate_limit_interval': 10,
 'display_value': 'sk*++',
 'hub_config': None}

You are now ready to kick off an evaluation job!

## Trust Score
Vijil evaluates LLMs, AI applications, and agents for task-worthiness (along 5 dimensions of performacne) and trustworthiness (along 8 dimensions of trust). For each dimension of trust, we assessed vulnerability to several attack vectors and propensity to violating areas of compliance. Each attack vector is treated as one evaluation module. Results are summarized into a Vijil Trust Score.

The following command kicks off a full trust evaluation job on GPT-4o-mini, setting temperature at 0.

In [3]:
evaluation = client.evaluations.create(
    model_hub="openai",
    model_name="gpt-4o-mini",
    harnesses=["trust_score"],
)

Error: 400 Client Error: Bad Request for url: https://evaluate-api.vijil.ai/v1/evaluations based on response: {"detail":"Resource not found: No harness config found by the name vijil.harnesses.security_Small and version 1.2.44."}


To keep tab on the progress of the job, you can use the `get_status` command or utilize the UI. After the evaluation finishes, use the command again to retrieve the Trust Score for the LLM you tested.

In [45]:
client.evaluations.get_status(evaluation_id=evaluation["id"])

{'id': '7eb423be-373c-4b2e-81ec-7abf196b51dd',
 'name': 'OpenAI-gpt-4o-mini-04/11/2025, 10:20:35',
 'tags': [''],
 'status': 'COMPLETED',
 'cause': None,
 'total_test_count': 711,
 'completed_test_count': 711,
 'error_test_count': 0,
 'total_response_count': 711,
 'completed_response_count': 622,
 'error_response_count': 89,
 'total_generation_time': '96.000000',
 'average_generation_time': '5.6540084388185654',
 'score': 0.7941188448718497,
 'status_counts': {'probes': {'COMPLETED': 97},
  'tests': {'GENERATED': 711},
  'responses': {'ERROR': 89, 'COMPLETED': 622}},
 'hub': 'openai',
 'model': 'gpt-4o-mini',
 'url': '',
 'created_at': 1744392036,
 'created_by': '887ef7e6-565b-454e-8dae-277643d6dbab',
 'completed_at': 1744392138,
 'team_id': 'ef6bdaec-f563-487c-b036-674c912da053',
 'restart_count': 0,
 'metadata': None,
 'completion_tokens': 73507,
 'prompt_tokens': 83806,
 'total_tokens': 157313}

To get summarized scores at different levels of granularity, you can use `client.evaluations.summarize`. To get prompt-response level logs, you can use `client.evaluations.describe`.

## Benchmarks
For quickly testing an LLM or agent on well-known benchmarks, we have 21 benchmarks available across reliability (e.g. [OpenLLM](https://huggingface.co/open-llm-leaderboard), [OpenLLM v2](https://huggingface.co/collections/open-llm-leaderboard/open-llm-leaderboard-2-660cdb7601eba6852431fffc)), security (e.g. [garak](https://garak.ai/), [CyberSecEval 3](https://ai.meta.com/research/publications/cyberseceval-3-advancing-the-evaluation-of-cybersecurity-risks-and-capabilities-in-large-language-models/)), and safety (e.g. [StrongReject](https://arxiv.org/abs/2402.10260), [JailbreakBench](https://arxiv.org/abs/2404.01318)) in Vijil Evaluate.

The following command lists available benchmarks in Evaluate.

In [4]:
client.harnesses.list(type="benchmark")

Unnamed: 0,id,name,description
0,HarmBench,Evaluate LLMs across harmful behaviours,Using the HarmBench benchmark to evaluate LLMs...
1,GSM8k,GSM8k,Grade school math word problems created by hum...
2,winobias,Professional Bias,Checks if the model has gender stereotypes abo...
3,HellaSwag,HellaSwag,Trivial for humans but difficult for state-of-...
4,bbq,Question Answering Bias,"Using the BBQ benchmark, Measure bias in quest..."
5,strong_reject,Strong Reject,"Harness for Strong Reject prompts, including o..."
6,do_not_answer,Do Not Answer,Using the DoNotAnswer benchmark to capture a r...
7,advglue,Adversarial GLUE,An adversarial version of GLUE benchmark that ...
8,Winogrande,Winogrande,An an adversarial and difficult Winograd bench...
9,realtoxicityprompts,Real Toxicity Prompts,Gets the model to output toxic responses.


To run one or more benchmarks, simply supply the name(s) inside the `harnesses` parameter.

In [None]:
client.evaluations.create(
    model_hub="openai",
    model_name="gpt-4o-mini",
    model_params={"temperature": 0},
    harnesses=["CyberSecEval3","do_not_answer"],
)

{'id': '29d604ff-2c3b-4a91-8a79-5fbb6af888bb', 'status': 'CREATED'}

## Custom Harness

Besides a variety of pre-configured harnesses, you can also create your own harnesses in order to obtain a trust score specific to your organization.

You can create a custom policy adherence harness that checks whether your model adheres to its system prompt or an organizational policy. To do this, you need a system prompt specified as a string, and an optional organizational policy provided as a `.txt` or `.pdf` file. If you don't provide a policy file, we will create a harness based only on the provided system prompt. To specify that you want a policy adherence harness, you need to specify the `category` argument as `["AGENT_POLICY"]`.

The following examples uses the `harnesses.create` function to create a harness to test adherence against the NIST [AI Risk Management](https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf) framework.

In [54]:
harness_creation_job = client.harnesses.create(
    name="NIST AI RMF harness",
    system_prompt="You are a helpful assistant.", 
    policy_file_path="nist.ai.100-1.pdf", # download this file from the link above and store it first
    category=["AGENT_POLICY"]
)

You can use the `get_status` command to know the status of a harness creation job.

In [60]:
client.harnesses.get_status(harness_id=harness_creation_job['harness_config_id'])

{'created_by': '887ef7e6-565b-454e-8dae-277643d6dbab',
 'created_at': 1744392229,
 'id': 'e3ca1392-398e-40f2-a6f6-7b2632062316',
 'harness_config_id': 'c527607c-58c5-4d78-a3fb-d1b25cd93f7c',
 'harness_name': 'NIST AI RMF harness',
 'team_id': 'ef6bdaec-f563-487c-b036-674c912da053',
 'status': 'COMPLETED',
 'status_message': 'Harness NIST AI RMF harness created successfully!',
 'agent_system_prompt': 'You are a helpful assistant.',
 'started_at': None,
 'completed_at': None,
 'harness_config_version': '1.0.0'}

Once the harness is created, you can [run an evaluation](evaluations.md#create-an-evaluation) with it:

In [62]:
client.evaluations.create(
    harnesses=[harness_creation_job['harness_config_id']],
    model_hub="openai",
    model_name="gpt-4o-mini",
    harness_version="1.0.0",
)

{'id': '8125a5f5-5322-4a29-b75e-f0e2aeb21e4e', 'status': 'CREATED'}

For agents with knowledge bases (category `KNOWLEDGE_BASE`) and/or tools (CATEGORY `FUNCTION_ROUTE`) attached to them, we also allow creating custom harnesses by specifying pointers to the knowledge base/tools. See [this tutorial](./harness_create.ipynb) for more details on how to do this.