# LLM as a Judge senza Golden Dataset

Viene simulata la modalità di verifica descritta nel paper ufficiale. 
Viene applicata questa metodologia senza golden dataset, in quanto il dataset di valutazione è stato generato e potrebbe contenere errori o impreciosioni, questo approccio evita di produrre risultati utilizzando dati potenzialmente non corretti. 

Questo tipo di valutazione andrà ad analizzare metriche che hanno un focus user-centric.

In [1]:
import json
import requests

## Definizione funzioni e prompt

In [2]:
def verify_and_extract_comparison(json_model_a, json_model_b):
    
    with open(json_model_a, "r", encoding="utf-8") as file_a:
        data_a = json.load(file_a)
    
    with open(json_model_b, "r", encoding="utf-8") as file_b:
        data_b = json.load(file_b)
    
    questions_a = [item['question'] for item in data_a['questions']]
    answers_a = [item['answer'] for item in data_a['questions']]
    
    questions_b = [item['question'] for item in data_b['questions']]
    answers_b = [item['answer'] for item in data_b['questions']]
    
    for idx, (q_a, q_b) in enumerate(zip(questions_a, questions_b), 1):
        assert q_a == q_b, f"Questions do not match at position {idx}: '{q_a}' != '{q_b}'"
    
    combined_qa = []
    for q, a, b in zip(questions_a, answers_a, answers_b):
        combined_qa.append({
            "question": q,
            "model_a_answer": a,
            "model_b_answer": b
        })
    
    print(f"All {len(combined_qa)} questions match between the two files.")
    return combined_qa

In [3]:
json_naive = "../Naive/Naive_responses.json"
json_graph = "../GraphRAG-test/GraphRAG_responses.json"

In [4]:
qa_pairs = verify_and_extract_comparison(json_naive, json_graph)

All 37 questions match between the two files.


In [5]:
print(qa_pairs)



### Prompt per comprehensiveness

In [6]:
comprehensiveness_prompt = """
### Task Description:
You are an expert evaluator tasked with assessing the **Comprehensiveness** of two generated answers in response to the same query. **Comprehensiveness** refers to how thoroughly each answer covers all aspects and details of the question, providing complete and relevant information.
    
### Instructions:
1. Compare the two generated answers and determine which one is better in terms of **Comprehensiveness**.
2. Indicate your choice without providing scores or detailed feedback.
3. The output format should be:
    
        "better_answer": "Model A" // oppure "Model B" o "Tie"

    
4. Do not include any additional text beyond the specified output format.
5. Ensure that your evaluation is objective and strictly follows the definition of **Comprehensiveness**.
    
### Query:
{question}
    
### Model A Answer:
{model_a_answer}
    
### Model B Answer:
{model_b_answer}
    
### Evaluation:"""

### Prompt per diversity

In [7]:
diversity_prompt = """### Task Description:
You are an expert evaluator tasked with assessing the **Diversity** of two generated answers in response to the same query. **Diversity** refers to the variety, uniqueness, and different perspectives offered by the answers, providing a broad range of relevant information without unnecessary repetition.
    
### Instructions:
1. Compare the two generated answers and determine which one is better in terms of **Diversity**.
2. Indicate your choice without providing scores or detailed feedback.
3. The output format should be:
    
        "better_answer": "Model A" // oppure "Model B" o "Tie"

    
4. Do not include any additional text beyond the specified output format.
5. Ensure that your evaluation is objective and strictly follows the definition of **Diversity**.
    
### Query:
{question}
    
### Model A Answer:
{model_a_answer}
    
### Model B Answer:
{model_b_answer}
    
### Evaluation:
"""

### Prompt per empowerment

In [8]:
empowerment_prompt = """ ### Task Description:
You are an expert evaluator tasked with assessing the **Empowerment** of two generated answers in response to the same query. **Empowerment** refers to how well each answer helps the reader understand and make informed decisions about the topic, providing clear and actionable information.
    
### Instructions:
1. Compare the two generated answers and determine which one is better in terms of **Empowerment**.
2. Indicate your choice without providing scores or detailed feedback.
3. The output format should be:
    

        "better_answer": "Model A" // oppure "Model B" o "Tie"

    
4. Do not include any additional text beyond the specified output format.
5. Ensure that your evaluation is objective and strictly follows the definition of **Empowerment**.
    
### Query:
{question}
    
### Model A Answer:
{model_a_answer}
    
### Model B Answer:
{model_b_answer}
    
### Evaluation:
"""

### Prompt per directness

In [9]:
directness_prompt = """ ### Task Description:
You are an expert evaluator tasked with assessing the **Directness** of two generated answers in response to the same query. **Directness** refers to how clear and straightforward each answer is, avoiding unnecessary complexity or ambiguity.
    
### Instructions:
1. Compare the two generated answers and determine which one is better in terms of **Directness**.
2. Indicate your choice without providing scores or detailed feedback.
3. The output format should be:
    
        "better_answer": "Model A" // oppure "Model B" o "Tie"
    
4. Do not include any additional text beyond the specified output format.
5. Ensure that your evaluation is objective and strictly follows the definition of **Directness**.
    
### Query:
{question}
    
### Model A Answer:
{model_a_answer}
    
### Model B Answer:
{model_b_answer}
    
### Evaluation:
"""

In [10]:
def query_llama(prompt, model):
    """
    Funzione che invia un prompt all'API LLaMA e restituisce il feedback.
    """
    payload = {
        "model": model["name"],  
        "prompt": prompt,        # Prompt 
        "temperature": model["temperature"],
        "max_tokens": model["max_tokens"]
    }
    
    response = requests.post(f"{model['url']}/completions", json=payload)
    
    if response.status_code == 200:
        return response.json()["choices"][0]["text"].strip()
    else:
        raise Exception(f"Errore nell'API: {response.status_code}, {response.text}")

In [11]:
def analyze_response(response_text):
    """
    Funzione per analizzare la risposta del modello LLaMA e determinare se preferisce Model A o Model B.
    """
    response_text = response_text.lower()  
    
    if "model a" in response_text or "answer a" in response_text or "a is better" in response_text:
        return "Model A"
    
    elif "model b" in response_text or "answer b" in response_text or "b is better" in response_text:
        return "Model B"
    
    else:
        return "Ambiguous"

In [12]:
def evaluate_model_comparison(qa_pairs, model, prompt_template):
    """
    Funzione che cicla attraverso ogni domanda in qa_pairs, invia le risposte all'LLM
    per determinare quale è migliore secondo il prompt dato e conta le scelte.
    """
    count_a = 0
    count_b = 0
    ambiguous_count = 0  
    for pair in qa_pairs:
        question = pair['question']
        model_a_answer = pair['model_a_answer']
        model_b_answer = pair['model_b_answer']
        
        prompt = prompt_template.format(
            question=question,
            model_a_answer=model_a_answer,
            model_b_answer=model_b_answer
        )
        
        result = query_llama(prompt, model)
        
        evaluation = analyze_response(result)
        if evaluation == "Model A":
            count_a += 1
        elif evaluation == "Model B":
            count_b += 1
        else:
            print(f"Ambiguous response: {result}")
            ambiguous_count += 1
    
    print(f"Model A was chosen {count_a} times.")
    print(f"Model B was chosen {count_b} times.")
    print(f"Ambiguous responses: {ambiguous_count}")

    return {"Model A": count_a, "Model B": count_b, "Ambiguous": ambiguous_count}

## Valutazione metriche

In [13]:
model = {
    "url": "http://172.18.21.137:8000/v1",  # URL llama
    "name": "meta-llama/Meta-Llama-3.1-70B-Instruct",
    "temperature": 0,
    "max_tokens": 512
}

### Valutazione comprehensiveness

In [14]:
results_comprehensiveness = evaluate_model_comparison(qa_pairs, model, comprehensiveness_prompt)

Model A was chosen 1 times.
Model B was chosen 36 times.
Ambiguous responses: 0


In [15]:
print(f"Results for Comprehensiveness: {results_comprehensiveness}")

Results for Comprehensiveness: {'Model A': 1, 'Model B': 36, 'Ambiguous': 0}


### Valutazione Diversity

In [16]:
results_diversity = evaluate_model_comparison(qa_pairs, model, diversity_prompt)

Model A was chosen 0 times.
Model B was chosen 37 times.
Ambiguous responses: 0


In [17]:
print(f"Results for Diversity: {results_diversity}")

Results for Diversity: {'Model A': 0, 'Model B': 37, 'Ambiguous': 0}


### Valutazione Empowerment

In [18]:
results_empowerment = evaluate_model_comparison(qa_pairs, model, empowerment_prompt)

Model A was chosen 5 times.
Model B was chosen 32 times.
Ambiguous responses: 0


In [20]:
print(f"Results for Empowerment: {results_empowerment}")

Results for Empowerment: {'Model A': 5, 'Model B': 32, 'Ambiguous': 0}
