# Most Aggressive Prompts in Vijil Evaluate

This notebook goes through the construction of a test harness composed of the most aggressive prompts in Vijil Evaluate based on historical evaluations, and shows evaluation results of a few LLMs on this harness.

## Method

To construct this harness, we obtain aggregated detector scores for all prompts in Vijil Evaluate, with the following filters
- Only non-null scores, leaving out errored generations
- Evaluations created on or after 12/1/2024, to ensure freshness of the test harness and detection logic in light of regular internal updates
- Prompts that are used in at least 5 evaluations, to ensure a baseline amount of stability in the average detector score.

Given these filters, we extracted 598 prompts in total that have an average detector score of *at least 0.95*. Based on these prompts, we construct the harness `vijil.harnesses.most_aggressive`, clubbing them in their own probes and dimensions.

The composition of prompts across our 8 dimensions of trust is as follows. Note that Stereotype is the only dimension that does not have any prompts in this set.

| Dimension | No. of prompts |
|---|---|
|Security|407|
|Privacy|50|
|Robustness|14|
|Hallucination|27|
|Toxicity|66|
|Stereotype|0|
|Ethics|22|
|Fairness|12|

## Results

The following code runs the harnesses containing these prompts in Vijil Evaluate. First you load the Vijil python client, with a `VIJIL_API_KEY` [fetched](https://docs.vijil.ai/setup.html) from the Evaluate frontend in your local .env file.

In [None]:
# ! pip install -U vijil
from dotenv import load_dotenv
load_dotenv()

from vijil import Vijil
client = Vijil()

In [None]:
evaluation = client.evaluations.create(
    model_hub="together",
    model_name="deepseek-ai/DeepSeek-R1",
    harnesses=[
        f"{dim}_most_aggressive"
        for dim in ["security","privacy","hallucination","robustness","toxicity","fairness","ethics"]
    ],
)

Below are the results of five canonical LLMs we tested this harness on. Generally, all LLMs perform poorly across dimensions. Among the five models here, OpenAI o1 performs the best, with an overall score of 36.12.

| LLM | OpenAI gpt-4o-mini | OpenAI o1 | DeepSeek R1 | Llama 3.3 70B Instruct Turbo | Google Gemma 2 27B |
|------|------------|----|--------------|-----------------------------|-------------------|
| Security | 0.16 | 18.83 | 19.40 | 0.07 | 12.70 |
| Privacy | 0.00 | 55.42 | 5.71 | 25.00 | 25.64 |
| Robustness | 0.00 | 66.67 | 50.00 | 25.00 | 8.33 |
| Hallucination | 2.78 | 26.85 | 38.89 | 5.00 | 43.33 |
| Toxicity | 15.90 | 43.47 | 24.58 | 3.33 | 35.02 |
| Ethics | 0.00 | 41.58 | 7.33 | 0.00 | 0.00 |
| Fairness | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| **Overall** | **2.69** | **36.12** | **20.84** | **8.34** | **17.86** |

If you are adopting an LLM in an enterprise setting or building an agent with it, be sure to perform holistic adversarial testing before deployment and during CI/CD. To use Vijil's most aggressive prompts for this purpose sign up at https://evaluate.vijil.ai.