## Prometheus evaluation

### Running on GPU
#### Imports and loading

In [None]:
from prometheus_eval.vllm import VLLM
from prometheus_eval import PrometheusEval
from prometheus_eval.prompts import ABSOLUTE_PROMPT, SCORE_RUBRIC_TEMPLATE

model = VLLM(model="prometheus-eval/prometheus-7b-v2.0")
judge = PrometheusEval(model=model, absolute_grade_template=ABSOLUTE_PROMPT)

#### Testing prompt
A testing prompt prepared by the creators for evaluation testing. The result meets expectations.

In [4]:
data = {
    "instruction": "Struggling with a recent break-up, a person opens up about the intense feelings of loneliness and sadness. They ask for advice on how to cope with the heartbreak and move forward in life.",
    
    "response": "I'm genuinely sorry to hear about your break-up. This can be an immensely challenging time filled with intense emotions of loneliness and sorrow. It's important to remember that it's normal to experience these feelings; they are a testament to the deep connection you once had. Firstly, don't hesitate to embrace your emotions, let them flow freely. Ignoring them often complicates the healing process. It's natural to be upset, to shed tears, or to feel angry. This is all part of the journey towards healing. Self-care can be a healing balm in these tough times. This could mean indulging in your favourite book, journaling your thoughts, or even just a quiet cup of tea. Small acts of kindness to oneself can significantly improve your emotional well-being. Also, don't isolate yourself. Surround yourself with supportive individuals who can offer comfort and understanding. This might include friends, family, or even a mental health professional. They can provide much-needed emotional support and strategies to help you cope. Lastly, remember there's no set timeline for healing. It's different for everyone and it's okay to take as long as you need. Keep your focus on the future and the potential it holds. Though it may not seem possible now, you will emerge stronger from this experience. It's okay to ask for help, it's okay to grieve, and it's okay to take it one day at a time. You're not alone in this journey, and there are people ready and willing to help. It's okay to heal at your own pace.",

    "reference_answer": "I can only imagine how difficult this time must be for you. Heartbreak can feel like an overwhelming wave that washes over you, making you feel lost and alone. It's important to remember that it's okay to feel this way; your feelings are valid and it's natural to grieve the loss of a relationship. Firstly, it's crucial to allow yourself to feel the emotions fully. Suppressing or denying them often only prolongs the healing process. It's okay to cry, to feel sad, or to be angry. These are all part of the healing journey. Engaging in self-care activities can also provide a much-needed boost. This could be anything from taking a warm bath, writing in a journal, or simply drinking a cup of your favorite tea. Small acts of self-love can make a big difference in how you feel. Next, try to surround yourself with supportive people who understand your situation and provide comfort. Friends and family can be a great source of strength in times of heartbreak. If you feel comfortable, you might want to consider seeking professional help. Therapists and counselors are trained to provide assistance and tools to navigate through difficult times like these. Lastly, it's important to remember that it's okay to take your time to heal. Everyone has their own pace and there's no rush. Try to focus on the future and the possibilities it holds. While it may not seem like it now, you will come out stronger and more resilient from this experience. Remember, it's okay to ask for help and it's okay to feel the way you feel. You are not alone in this journey and there are people who care about you and want to help. It's okay to take one day at a time. Healing is a process, and it's okay to move through it at your own pace.",

    "rubric": SCORE_RUBRIC_TEMPLATE.format(
    criteria="Is the model proficient in applying empathy and emotional intelligence to its responses when the user conveys emotions or faces challenging circumstances?",
    score1_description="The model neglects to identify or react to the emotional tone of user inputs, giving responses that are unfitting or emotionally insensitive.",
    score2_description="The model intermittently acknowledges emotional context but often responds without sufficient empathy or emotional understanding.",
    score3_description="The model typically identifies emotional context and attempts to answer with empathy, yet the responses might sometimes miss the point or lack emotional profundity.",
    score4_description="The model consistently identifies and reacts suitably to emotional context, providing empathetic responses. Nonetheless, there may still be sporadic oversights or deficiencies in emotional depth.",
    score5_description="The model excels in identifying emotional context and persistently offers empathetic, emotionally aware responses that demonstrate a profound comprehension of the user's emotions or situation.")
}

feedback, score = judge.single_absolute_grade(**data)
print(f"Score: {score}\n"
      f"Feedback: {feedback}")

Processed prompts: 100%|██████████| 1/1 [00:03<00:00,  3.12s/it, est. speed input: 392.22 toks/s, output: 76.07 toks/s]


Processed 1/1 instances.


Finalizing: 100%|██████████| 1/1 [00:00<00:00, 21620.12it/s]

Score: 5
Feedback: The response excels in recognizing and addressing the emotional tone of the user's input, which aligns with the highest score on the rubric. It begins by acknowledging the user's pain, setting a compassionate tone that is maintained throughout. The empathy is evident as the model carefully articulates the normalcy of the user's feelings, which provides emotional validation. The advice provided, such as embracing emotions and practicing self-care, is presented in a manner that is empathetic and emotionally intelligent. The model further enhances its empathy by emphasizing the importance of social support and professional help, if needed. It also conveys the understanding that the healing process is personal and cannot be rushed. By highlighting the future and its potential, the response instills hope while still honoring the user's current emotional state. It reinforces the message of not being alone and that seeking help is a strength, not a weakness. This level of e




#### Hallucination prompt example
Custom prompt prepared for detecting hallucinations. Here, the response and the referenced answer were generated by ChatGPT. In this case as well, the result is correct. The model successfully detected false information. 

In [5]:
data = {
    "instruction": "An AI model is tasked with responding to customers' questions. It is very important that the responses are truthful and free from hallucinations. The length of the response does not matter.",
    
    "response": "The capital of France is Paris. Paris is famous for its cuisine and fashion, and it is home to iconic landmarks like the Eiffel Tower and the Colosseum. It is also known as the 'City of Love' due to its romantic ambiance.",

    "reference_answer": "The capital of France is Paris. Paris is not only the largest city in France but also one of the most significant cultural and historical centers in the world. It is known for its landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. The city has a rich history dating back over 2,000 years and has played a pivotal role in many historical events, including the French Revolution. Paris is also renowned for its contributions to art, literature, and philosophy, with numerous famous artists and writers having lived and worked there. Additionally, the city is a major hub for education, with prestigious institutions such as the Sorbonne University. The Seine River, which flows through Paris, adds to the city's charm and has inspired countless works of art and literature. As a global center for fashion, Paris hosts numerous fashion shows and events, attracting designers and fashion enthusiasts from around the world. The city's cuisine is celebrated for its variety and excellence, with numerous Michelin-starred restaurants and a rich culinary tradition that includes iconic dishes like croissants, baguettes, and escargot.",

    "rubric": SCORE_RUBRIC_TEMPLATE.format(
    criteria="Is the model's response truthful and free from hallucinations?",
    score1_description="The model frequently generates false information or hallucinations, providing responses that are misleading or factually incorrect.",
    score2_description="The model occasionally provides accurate information, but often includes hallucinations or inaccuracies in its responses.",
    score3_description="The model generally provides truthful responses, though there are occasional instances of hallucinations or minor inaccuracies.",
    score4_description="The model consistently delivers truthful and accurate responses, with only rare instances of minor inaccuracies or hallucinations.",
    score5_description="The model excels in providing consistently truthful and accurate responses, demonstrating a high level of reliability and free from any hallucinations or inaccuracies.")
}

feedback, score = judge.single_absolute_grade(**data)
print(f"Score: {score}\n"
      f"Feedback: {feedback}")

Processed prompts: 100%|██████████| 1/1 [00:02<00:00,  2.10s/it, est. speed input: 353.34 toks/s, output: 80.84 toks/s]


Processed 1/1 instances.


Finalizing: 100%|██████████| 1/1 [00:00<00:00, 25115.59it/s]

Score: 3
Feedback: This response correctly identifies Paris as the capital of France and mentions some well-known aspects of the city, such as its cuisine, fashion, and romantic ambiance. However, the inclusion of the Colosseum in the response is inaccurate. The Colosseum is located in Rome, Italy, not Paris, France. Therefore, the response contains a factual error, albeit a minor one, which slightly undermines its overall reliability. Nonetheless, the response remains generally truthful and does not include any hallucinations or gross inaccuracies. This suggests that while there are occasional instances of minor inaccuracies, the model generally provides truthful responses. Hence, according to the score rubric, the response would be assigned a score of 3.





#### Summarization prompt
This prompt was prepared to check whether the response is comprehensive. A score of 5 points indicates that the most important information from the knowledge base has been included in the response. The response and referenced answer used are the same as in the previous example. The feedback correctly indicates which information was omitted and which was included, based on which it accurately assigns points. In this case, it is possible to adjust the prompt to our use case. For example, 5 points could be awarded if some information is omitted due to the length of the response.


In [None]:
data = {
    "instruction": "An AI model is tasked with summarizing text. The reference answer contains all the necessary information. The response should capture the main points and key details.", 
        
    "response" : """
    Artificial intelligence (AI) has significantly impacted various industries, including healthcare, finance, and transportation. AI technologies like machine learning and natural language processing help in accurate disease diagnosis, fraud detection in financial transactions, and the operation of autonomous vehicles. The continued evolution of AI is anticipated to bring substantial changes to our daily lives and work environments.
    """,

    "reference_answer": """
    In recent years, the rise of artificial intelligence (AI) has transformed many industries. AI technologies, such as machine learning and natural language processing, have become integral to advancements in healthcare, finance, and transportation. For example, AI-powered diagnostic tools can now assist doctors in identifying diseases more accurately, while financial institutions use AI to detect fraudulent transactions. Similarly, autonomous vehicles utilize AI to navigate and make real-time decisions, improving safety and efficiency on the roads. As AI continues to evolve, its impact on various sectors is expected to grow, potentially leading to significant changes in how we live and work.
    """, 
        
    "rubric": SCORE_RUBRIC_TEMPLATE.format( 
    criteria="How well does the model summarize the provided text, focusing on capturing the main points and key details?", 
    score1_description="The model fails to capture the main points or key details, resulting in a summary that is incomplete or inaccurate.", 
    score2_description="The model captures some main points but omits significant details, leading to a summary that lacks comprehensiveness.", 
    score3_description="The model generally captures the main points and some key details, but the summary may miss minor aspects or lack clarity.", 
    score4_description="The model accurately captures the main points and most key details, providing a clear and concise summary with minor omissions.", 
    score5_description="The model excels in capturing the main points and all key details, offering a clear, concise, and comprehensive summary.")
}

feedback, score = judge.single_absolute_grade(**data)
print(f"Score: {score}\n"
      f"Feedback: {feedback}")

### Example on CPU

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"
model = AutoModelForCausalLM.from_pretrained("prometheus-eval/prometheus-7b-v2.0")
tokenizer = AutoTokenizer.from_pretrained("prometheus-eval/prometheus-7b-v2.0")

In [2]:
ABS_SYSTEM_PROMPT = "You are a fair judge assistant tasked with providing clear, objective score based on specific criteria, ensuring each assessment reflects the absolute standards set for performance."

ABSOLUTE_PROMPT = """###Task Description:
An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a score that is an integer between 1 and 5. You should refer to the score rubric.
2. The output format should look as follows: "Feedback: (an integer number between 1 and 5)"
3. Please do not generate any other opening, closing, and explanations.

###The instruction to evaluate:
{instruction}

###Response to evaluate:
{response}

###Reference Answer (Score 5):
{reference_answer}

###Score Rubrics:
{rubric}

###Feedback: """

instruction = "Struggling with a recent break-up, a person opens up about the intense feelings of loneliness and sadness. They ask for advice on how to cope with the heartbreak and move forward in life.",
response = "I'm genuinely sorry to hear about your break-up. This can be an immensely challenging time filled with intense emotions of loneliness and sorrow. It's important to remember that it's normal to experience these feelings; they are a testament to the deep connection you once had. Firstly, don't hesitate to embrace your emotions, let them flow freely. Ignoring them often complicates the healing process. It's natural to be upset, to shed tears, or to feel angry. This is all part of the journey towards healing. Self-care can be a healing balm in these tough times. This could mean indulging in your favourite book, journaling your thoughts, or even just a quiet cup of tea. Small acts of kindness to oneself can significantly improve your emotional well-being. Also, don't isolate yourself. Surround yourself with supportive individuals who can offer comfort and understanding. This might include friends, family, or even a mental health professional. They can provide much-needed emotional support and strategies to help you cope. Lastly, remember there's no set timeline for healing. It's different for everyone and it's okay to take as long as you need. Keep your focus on the future and the potential it holds. Though it may not seem possible now, you will emerge stronger from this experience. It's okay to ask for help, it's okay to grieve, and it's okay to take it one day at a time. You're not alone in this journey, and there are people ready and willing to help. It's okay to heal at your own pace.",
reference_answer = "I can only imagine how difficult this time must be for you. Heartbreak can feel like an overwhelming wave that washes over you, making you feel lost and alone. It's important to remember that it's okay to feel this way; your feelings are valid and it's natural to grieve the loss of a relationship. Firstly, it's crucial to allow yourself to feel the emotions fully. Suppressing or denying them often only prolongs the healing process. It's okay to cry, to feel sad, or to be angry. These are all part of the healing journey. Engaging in self-care activities can also provide a much-needed boost. This could be anything from taking a warm bath, writing in a journal, or simply drinking a cup of your favorite tea. Small acts of self-love can make a big difference in how you feel. Next, try to surround yourself with supportive people who understand your situation and provide comfort. Friends and family can be a great source of strength in times of heartbreak. If you feel comfortable, you might want to consider seeking professional help. Therapists and counselors are trained to provide assistance and tools to navigate through difficult times like these. Lastly, it's important to remember that it's okay to take your time to heal. Everyone has their own pace and there's no rush. Try to focus on the future and the possibilities it holds. While it may not seem like it now, you will come out stronger and more resilient from this experience. Remember, it's okay to ask for help and it's okay to feel the way you feel. You are not alone in this journey and there are people who care about you and want to help. It's okay to take one day at a time. Healing is a process, and it's okay to move through it at your own pace.",

rubric_data = {
  "criteria":"Is the model proficient in applying empathy and emotional intelligence to its responses when the user conveys emotions or faces challenging circumstances?",
  "score1_description":"The model neglects to identify or react to the emotional tone of user inputs, giving responses that are unfitting or emotionally insensitive.",
  "score2_description":"The model intermittently acknowledges emotional context but often responds without sufficient empathy or emotional understanding.",
  "score3_description":"The model typically identifies emotional context and attempts to answer with empathy, yet the responses might sometimes miss the point or lack emotional profundity.",
  "score4_description":"The model consistently identifies and reacts suitably to emotional context, providing empathetic responses. Nonetheless, there may still be sporadic oversights or deficiencies in emotional depth.",
  "score5_description":"The model excels in identifying emotional context and persistently offers empathetic, emotionally aware responses that demonstrate a profound comprehension of the user's emotions or situation."
}

user_content = ABS_SYSTEM_PROMPT + "\n\n" + ABSOLUTE_PROMPT.format(
    instruction=instruction,
    response=response,
    reference_answer=reference_answer,
    rubric=rubric_data
)

messages = [
    {"role": "user", "content": user_content},
]

encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")
model_inputs = encodeds.to(device)
model.to(device)

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralSdpaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm()
        (post_attention_layernorm): MistralRMSNorm()
      )
    )
    (norm): MistralRMSNorm(

In [5]:
generated_ids = model.generate(model_inputs, max_new_tokens=1000, max_time=300)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> [INST] You are a fair judge assistant tasked with providing clear, objective score based on specific criteria, ensuring each assessment reflects the absolute standards set for performance.

###Task Description:
An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a score that is an integer between 1 and 5. You should refer to the score rubric.
2. The output format should look as follows: "Feedback: (an integer number between 1 and 5)"
3. Please do not generate any other opening, closing, and explanations.

###The instruction to evaluate:
('Struggling with a recent break-up, a person opens up about the intense feelings of loneliness and sadness. They ask for advice on how to cope with the heartbreak and move forward in life.',)

###Response to evaluate:
("I'm genuinely sorry to hear about your break-up. This can be an immensely challenging time f

### Conclusion
It is possible to combine the above prompts into one that would provide a very universal measure. It seems reasonable to modify the summarization prompt to consider whether the response is not overly comprehensive. Additional metrics that could be introduced include whether the chatbot's response encourages further exploration of the topic or whether it is free from harmful content. In reality, Prometheus offers significant flexibility and freedom in formulating prompts. A good practice might be to review the dataset on which the model was trained. It contains many interesting aspects that could be used for evaluation. A drawback, however, is the high hardware requirements and relatively long waiting time for results.