# Vijil Evaluate at-a-glance

Vijil Evaluate is Vijil’s flagship evaluation service that enables AI developers evaluate the trustworthiness of an LLM, generative AI application, or agent. Using the Evaluate API, developers can test an AI system under benign and hostile conditions in minutes for reliability, security, and safety.

This notebook gives an overview of the major features in Vijil Evaluate through our Python client.

## Getting Started
To set up your local environment, follow the steps [here](https://docs.vijil.ai/setup.html) to install the Python client and get an API key for the Evaluate Platform.

Once the Python client installed, you can instantiate a client class and store an API key for the provider your agent is hosted on.

In [1]:
from vijil import Vijil

from dotenv import load_dotenv
load_dotenv()

client = Vijil()
# client.api_keys.create(
#     name="openai-test", 
#     model_hub="openai", 
#     api_key="sk+++" # replace with your own api key
# )

You are now ready to kick off an evaluation job!

## Trust Score
The following command kicks off a full trust evaluation job on GPT-4o-mini, setting temperature at 0.

In [None]:
client.evaluations.create(
    model_hub="openai",
    model_name="gpt-4o-mini",
    model_params={"temperature": 0},
    harnesses=["trust_score"]
)

## Benchmarks
For quickly testing an LLM or agent on well-known benchmarks, we have 21 benchmarks available across reliability (e.g. [OpenLLM](https://huggingface.co/open-llm-leaderboard), [OpenLLM v2](https://huggingface.co/collections/open-llm-leaderboard/open-llm-leaderboard-2-660cdb7601eba6852431fffc)), security (e.g. [garak](https://garak.ai/), [CyberSecEval 3](https://ai.meta.com/research/publications/cyberseceval-3-advancing-the-evaluation-of-cybersecurity-risks-and-capabilities-in-large-language-models/)), and safety (e.g. [StrongReject](https://arxiv.org/abs/2402.10260), [JailbreakBench](https://arxiv.org/abs/2404.01318)) in Vijil Evaluate.

The following command lists available benchmarks in Evaluate.

In [2]:
client.harnesses.list(type="benchmark")

Unnamed: 0,id,name,description
0,ARC,ARC (AI2 Reasoning Challenge),"Grade-school science questions (7,787) written..."
1,SAP,Semi-automatic Attack Prompts,Use the SAP benchmark (https://arxiv.org/abs/2...
2,strong_reject,Strong Reject,"Harness for Strong Reject prompts, including o..."
3,ETHICS_original,ETHICS Benchmark,Probes from the ETHICS benchmark.
4,realtoxicityprompts,Real Toxicity Prompts,Gets the model to output toxic responses.
5,socialstigma_qa,Social Stigmas,Using the SocialStigmaQA benchmark to capture ...
6,TruthfulQA,TruthfulQA,Measures a model's propensity to reproduce fal...
7,do_not_answer,Do Not Answer,Using the DoNotAnswer benchmark to capture a r...
8,Winogrande,Winogrande,An an adversarial and difficult Winograd bench...
9,garak,garak,Probes from the open-source garak package.


To run one or more benchmark, simply supply the name(s) inside the `harnesses` parameter.

In [None]:
client.evaluations.create(
    model_hub="openai",
    model_name="gpt-4o-mini",
    model_params={"temperature": 0},
    harnesses=["CyberSecEval3","strong_reject"]
)

## Custom Harness