# Vijil Evaluate at-a-glance

Vijil Evaluate is Vijil’s flagship evaluation service that enables AI developers evaluate the trustworthiness of an LLM, generative AI application, or agent. Using the Evaluate API, developers can test an AI system under benign and hostile conditions in minutes for reliability, security, and safety.

This notebook gives an overview of the major features in Vijil Evaluate through our Python client.

## Getting Started
To set up your local environment, follow the steps [here](https://docs.vijil.ai/setup.html) to install the Python client and get an API key for the Evaluate Platform.

Once the Python client installed, you can instantiate a client class and store an API key for the provider your agent is hosted on.

In [None]:
from vijil import Vijil

from dotenv import load_dotenv
load_dotenv()

client = Vijil()
client.api_keys.create(
    name="openai-test", 
    model_hub="openai", 
    api_key="sk+++" # replace with your own api key
)

You are now ready to kick off an evaluation job!

## Trust Score
Vijil evaluates LLMs, AI applications, and agents for task-worthiness (along 5 dimensions of performacne) and trustworthiness (along 8 dimensions of trust). For each dimension of trust, we assessed vulnerability to several attack vectors and propensity to violating areas of compliance. Each attack vector is treated as one evaluation module. Results are summarized into a Vijil Trust Score.

The following command kicks off a full trust evaluation job on GPT-4o-mini, setting temperature at 0.

In [12]:
evaluation = client.evaluations.create(
    model_hub="openai",
    model_name="gpt-4o-mini",
    harnesses=["trust_score"]
)

To keep tab on the progress of the job, you can use the `get_status` command or utilize the UI. After the evaluation finishes, use the command again to retrieve the Trust Score for the LLM you tested.

In [20]:
client.evaluations.get_status(evaluation_id=evaluation["id"])

{'id': 'c46cffb3-89cc-4def-943d-2ecccccf33b4',
 'name': 'OpenAI-gpt-4o-mini-04/04/2025, 15:09:24',
 'tags': [''],
 'status': 'COMPLETED',
 'cause': None,
 'total_test_count': 711,
 'completed_test_count': 711,
 'error_test_count': 0,
 'total_response_count': 711,
 'completed_response_count': 622,
 'error_response_count': 89,
 'total_generation_time': '35.000000',
 'average_generation_time': '3.5344585091420534',
 'score': 0.7845085135235873,
 'status_counts': {'probes': {'COMPLETED': 97},
  'tests': {'GENERATED': 711},
  'responses': {'ERROR': 89, 'COMPLETED': 622}},
 'hub': 'openai',
 'model': 'gpt-4o-mini',
 'url': '',
 'created_at': 1743804564,
 'created_by': '887ef7e6-565b-454e-8dae-277643d6dbab',
 'completed_at': 1743804619,
 'team_id': 'ef6bdaec-f563-487c-b036-674c912da053',
 'restart_count': 0,
 'metadata': None,
 'completion_tokens': 73072,
 'prompt_tokens': 83806,
 'total_tokens': 156878}

To get summarized scores at different levels of granularity, you can use `client.evaluations.summarize`. To get prompt-response level logs, you can use `client.evaluations.describe`.

## Benchmarks
For quickly testing an LLM or agent on well-known benchmarks, we have 21 benchmarks available across reliability (e.g. [OpenLLM](https://huggingface.co/open-llm-leaderboard), [OpenLLM v2](https://huggingface.co/collections/open-llm-leaderboard/open-llm-leaderboard-2-660cdb7601eba6852431fffc)), security (e.g. [garak](https://garak.ai/), [CyberSecEval 3](https://ai.meta.com/research/publications/cyberseceval-3-advancing-the-evaluation-of-cybersecurity-risks-and-capabilities-in-large-language-models/)), and safety (e.g. [StrongReject](https://arxiv.org/abs/2402.10260), [JailbreakBench](https://arxiv.org/abs/2404.01318)) in Vijil Evaluate.

The following command lists available benchmarks in Evaluate.

In [2]:
client.harnesses.list(type="benchmark")

Unnamed: 0,id,name,description
0,ARC,ARC (AI2 Reasoning Challenge),"Grade-school science questions (7,787) written..."
1,SAP,Semi-automatic Attack Prompts,Use the SAP benchmark (https://arxiv.org/abs/2...
2,strong_reject,Strong Reject,"Harness for Strong Reject prompts, including o..."
3,ETHICS_original,ETHICS Benchmark,Probes from the ETHICS benchmark.
4,realtoxicityprompts,Real Toxicity Prompts,Gets the model to output toxic responses.
5,socialstigma_qa,Social Stigmas,Using the SocialStigmaQA benchmark to capture ...
6,TruthfulQA,TruthfulQA,Measures a model's propensity to reproduce fal...
7,do_not_answer,Do Not Answer,Using the DoNotAnswer benchmark to capture a r...
8,Winogrande,Winogrande,An an adversarial and difficult Winograd bench...
9,garak,garak,Probes from the open-source garak package.


To run one or more benchmarks, simply supply the name(s) inside the `harnesses` parameter.

In [None]:
client.evaluations.create(
    model_hub="openai",
    model_name="gpt-4o-mini",
    model_params={"temperature": 0},
    harnesses=["CyberSecEval3","strong_reject"]
)

## Custom Harness

Besides a variety of pre-configured harnesses, you can also create your own harnesses in order to obtain a trust score specific to your organization.

You can create a custom policy adherence harness that checks whether your model adheres to its system prompt or an organizational policy. To do this, you need a system prompt specified as a string, and an optional organizational policy provided as a `.txt` or `.pdf` file. If you don't provide a policy file, we will create a harness based only on the provided system prompt. To specify that you want a policy adherence harness, you need to specify the `category` argument as `["AGENT_POLICY"]`.

The following examples uses the `harnesses.create` function to create a harness to test adherence against the NIST [AI Risk Management](https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf) framework.

In [None]:
harness_creation_job = client.harnesses.create(
    harness_name="NIST AI RMF harness",
    system_prompt="You are a helpful assistant.", 
    policy_file_path="nist.ai.100-1.pdf", # download this file from the link above and store it first
    category=["AGENT_POLICY"]
)

You can use the `get_status` command to know the status of a harness creation job.

In [None]:
client.harnesses.get_status(harness_id=harness_creation_job['harness_config_id'])

The `harness_config_version` starts at 1.0.0 for any harness of the given harness name. If you create another harness with the same name, vijil automatically increments the harness version, e.g. from 1.0.0 to 1.0.1. In the above example, we assume that `NIST AI RMF harness` is a new harness name, so we set the version to 1.0.0.

Once the harness is created, you can [run an evaluation](evaluations.md#create-an-evaluation) with it:

In [None]:
client.evaluations.create(
    harnesses=[harness_creation_job['harness_config_id']],
    model_hub="your_model_hub",
    model_name="your_model"
)

For agents with knowledge bases and/or tools attached to them, we also allow creating custom harnesses by specifying pointers to the knowledge base/tools. See [this tutorial](./harness_create.ipynb) for more details on how to do this.