# Valuto le global query

In [37]:
import json
import re
import requests

## Prompt e funzioni

In [38]:
llama3_correctness_eval_prompt_template = """### Task Description:
You are given a query, a generated answer, a reference answer (which receives a score of 5), and a scoring rubric representing the evaluation criteria for correctness.

Instructions:
1. Provide detailed feedback that assesses the correctness of the generated answer strictly based on the scoring rubric.
2. After writing the feedback, assign a score from 1 to 5 according to the rubric.
3. The output format should be:
   - Feedback: (your detailed feedback)
   - [RESULT] (1-5) or Score: (1-5)
4. Do not include any additional text beyond the feedback and the score.
5. Focus your evaluation only on the content present in both the generated answer and the reference answer. Do not penalize for missing information not present in the generated answer.

### Query:
{query}

### Generated Answer:
{generated_answer}

### Reference Answer (Score 5):
{reference_answer}

### Scoring Rubric for Correctness:
- **Score 1**: The generated answer is completely incorrect and does not relate to the query or the reference answer.
- **Score 2**: The generated answer has significant inaccuracies and fails to correctly address the main points of the query or the reference answer.
- **Score 3**: The generated answer is partially correct but contains notable errors or misconceptions.
- **Score 4**: The generated answer is mostly correct with minor inaccuracies.
- **Score 5**: The generated answer is entirely correct and aligns perfectly with the reference answer in terms of factual accuracy.

### Feedback:"""

In [39]:
llama3_completness_eval_prompt_template = """ ### Task Description:
You are given a query, a generated answer, a reference answer (which receives a score of 5), and a scoring rubric representing the evaluation criteria for completeness.

Instructions:
1. Provide detailed feedback that assesses the completeness of the generated answer strictly based on the scoring rubric.
2. After writing the feedback, assign a score from 1 to 5 according to the rubric.
3. The output format should be:
   - Feedback: (your detailed feedback)
   - Score: (1-5)
4. Do not include any additional text beyond the feedback and the score.
5. Focus on whether the generated answer includes all relevant information present in the reference answer.

### Query:
{query}

### Generated Answer:
{generated_answer}

### Reference Answer (Score 5):
{reference_answer}

### Scoring Rubric for Completeness:
- **Score 1**: The generated answer provides minimal or no relevant information.
- **Score 2**: The generated answer includes some relevant points but misses most key information.
- **Score 3**: The generated answer covers several relevant points but lacks important details.
- **Score 4**: The generated answer includes most key information but may miss minor details.
- **Score 5**: The generated answer thoroughly covers all relevant information from the reference answer.

### Feedback: """

In [40]:
llama3_relevance_eval_prompt_template = """ ### Task Description:
You are an expert evaluator tasked with assessing a generated answer in response to a specific query. Evaluate the answer based on the criterion of **Relevance**.

Instructions:
1. Provide detailed feedback that assesses the relevance of the generated answer strictly based on the provided rubric.
2. After writing the feedback, assign a score from 1 to 5 according to the rubric.
3. The output format should be:

**Feedback**: (your detailed feedback)
**Score**: (1-5)

4. Do not include any additional text beyond the feedback and the score.
5. Focus only on the content present in the generated answer and the query. Do not penalize for missing information that is not required by the query.

### Query:

{query}

### Generated Answer:

{generated_answer}

### Evaluation Criterion and Rubric:

**Relevance**

- **Score 1**: The answer is completely irrelevant to the query. It does not address any aspect of the question.
- **Score 2**: The answer has minimal relation to the query but lacks significant pertinent information.
- **Score 3**: The answer is partially relevant. It addresses some aspects of the query but omits key points.
- **Score 4**: The answer is mostly relevant. It covers most key points of the query with few irrelevant details.
- **Score 5**: The answer is highly relevant. It fully addresses the query directly and completely without including irrelevant information.

### Evaluation: """

In [68]:
def extract_score(feedback):
    """
    Funzione che estrae il punteggio (da 1 a 5) dal feedback usando regex.
    Gestisce sia il formato '[RESULT] X' che 'Score: X'.
    """
    match = re.search(r'(?:\[RESULT\]|Score:|\*\*Score\*\*:)\s*(\d+)', feedback)
    if match:
        return int(match.group(1))  
    else:
        print(f"Impossibile estrarre il punteggio dal feedback:\n{feedback}\n")
        return None

In [42]:
def create_evaluation_prompt(query, generated_answer, reference_answer, criterion):
    # Scegli il template del prompt in base al criterio
    if criterion == 'correctness':
        prompt_template = llama3_correctness_eval_prompt_template  # Usa il template aggiornato
    elif criterion == 'completeness':
        prompt_template = llama3_completness_eval_prompt_template  # Definisci questo template
    elif criterion == 'relevance' :
        prompt_template = llama3_relevance_eval_prompt_template

    return prompt_template.format(
        query=query,
        generated_answer=generated_answer,
        reference_answer=reference_answer
    )

In [43]:
def verifica_e_estrai(json_generated, json_reference):
    """
    Funzione che verifica se le domande nei due JSON corrispondono e,
    se corrispondono, estrae le domande, le risposte generate e le risposte di riferimento.
    
    :param json_generated: Path del file JSON con le risposte generate.
    :param json_reference: Path del file JSON con le risposte di riferimento.
    :return: Dizionario con domande, risposte generate e risposte di riferimento.
    :raises: AssertionError se le domande non corrispondono.
    """
    
    # Carica il file JSON con le risposte generate dal modello
    with open(json_generated, "r") as file:
        generated_data = json.load(file)

    # Carica il file JSON con le risposte di riferimento
    with open(json_reference, "r") as file:
        reference_data = json.load(file)

    # Estrai le domande e le risposte dai due file
    generated_questions = [item['question'] for item in generated_data['questions']]
    generated_answers = [item['answer'] for item in generated_data['questions']]
    
    reference_questions = [item['question'] for item in reference_data['questions']]
    reference_answers = [item['answer'] for item in reference_data['questions']]

    # Verifica che le domande corrispondano tra i due file
    for gq, rq in zip(generated_questions, reference_questions):
        assert gq == rq, f"Le domande non corrispondono: {gq} != {rq}"

    # Se tutto combacia, ritorna le domande e le risposte
    print("Tutte le domande corrispondono tra i due file.")
    return {
        "questions": generated_questions,
        "generated_answers": generated_answers,
        "reference_answers": reference_answers
    }

In [44]:
def query_llama(prompt, model):
    """
    Funzione che invia un prompt all'API LLaMA e restituisce il feedback.
    """
    payload = {
        "model": model["name"],  # Nome del modello LLaMA
        "prompt": prompt,        # Prompt da inviare
        "temperature": model["temperature"],
        "max_tokens": model["max_tokens"]
    }
    
    # Invia la richiesta POST all'endpoint
    response = requests.post(f"{model['url']}/completions", json=payload)
    
    if response.status_code == 200:
        # Restituisci il testo generato
        return response.json()["choices"][0]["text"].strip()
    else:
        # In caso di errore, stampa il messaggio
        raise Exception(f"Errore nell'API: {response.status_code}, {response.text}")

In [45]:
def esegui_valutazione_completa(json_generated, json_reference, nome_output_json, model, criterion):
    """
    Esegue il processo di valutazione:
    - Verifica i file JSON
    - Itera su tutte le domande per generare il prompt e ottenere il feedback
    - Salva i risultati in un file JSON
    """
    # 1. Verifica e estrazione
    dati_estratti = verifica_e_estrai(json_generated, json_reference)
    generated_questions = dati_estratti["questions"]
    generated_answers = dati_estratti["generated_answers"]
    reference_answers = dati_estratti["reference_answers"]

    # 2. Lista per memorizzare i risultati
    results = []

    for i in range(len(generated_questions)):
        # Crea il prompt per il criterio specifico
        prompt = create_evaluation_prompt(generated_questions[i], generated_answers[i], reference_answers[i], criterion)

        # Ottieni il feedback dall'API LLaMA
        feedback = query_llama(prompt, model)

        # Estrai il voto dal feedback
        score = extract_score(feedback)

        # Aggiungi i risultati
        result = {
            "question": generated_questions[i],
            "generated_answer": generated_answers[i],
            "reference_answer": reference_answers[i],
            f"{criterion}_feedback": feedback,
            f"{criterion}_score": score
        }
        results.append(result)

    # Salva i risultati in un file JSON
    with open(nome_output_json, "w") as file:
        json.dump(results, file, indent=4)

    print(f"Risultati salvati in {nome_output_json}")

In [46]:
from collections import Counter

def estrai_statistiche_punteggi(json_file, criterion):
    """
    Funzione che estrae i punteggi da un file JSON e calcola la statistica dei voti (1, 2, 3, 4, 5).

    :param json_file: Path del file JSON con le risposte e i punteggi.
    :param criterion: Il criterio di valutazione ('correctness', 'completeness', ecc.).
    :return: Dizionario con la distribuzione dei punteggi.
    """
    # Carica il file JSON
    with open(json_file, "r") as file:
        data = json.load(file)

    # Definisci la chiave del punteggio in base al criterio
    score_key = f"{criterion}_score"

    # Estrai i punteggi
    scores = [item[score_key] for item in data if item.get(score_key) is not None]

    # Conta quanti 1, 2, 3, 4, 5 ci sono
    score_distribution = Counter(scores)

    # Stampa le statistiche dei punteggi
    print(f"Distribuzione dei punteggi per il criterio '{criterion}':")
    for score in range(1, 6):
        print(f"Punteggio {score}: {score_distribution.get(score, 0)} occorrenze")

    # Restituisci il conteggio
    return dict(score_distribution)

In [47]:
model = {
    "url": "http://172.18.21.137:8000/v1",  # URL dell'endpoint LLaMA
    "name": "meta-llama/Meta-Llama-3.1-70B-Instruct",
    "temperature": 0,
    "max_tokens": 512
}

## Naive

In [69]:
json_generated = "../Naive/Naive_responses.json"
json_reference = "../DatasetCreation/Global_questions.json"

try:
    risultati = verifica_e_estrai(json_generated, json_reference)
    print("Dati estratti correttamente.")
except AssertionError as e:
    print(f"Errore: {e}")

Tutte le domande corrispondono tra i due file.
Dati estratti correttamente.


In [49]:
nome_output_json = "naive_results/naive_results_correctness.json"

In [50]:
esegui_valutazione_completa(json_generated, json_reference, nome_output_json, model, 'correctness')

Tutte le domande corrispondono tra i due file.


Risultati salvati in naive_results/naive_results_correctness.json


In [51]:
stat = estrai_statistiche_punteggi(nome_output_json, 'correctness')
print(stat)

Distribuzione dei punteggi per il criterio 'correctness':
Punteggio 1: 4 occorrenze
Punteggio 2: 12 occorrenze
Punteggio 3: 18 occorrenze
Punteggio 4: 3 occorrenze
Punteggio 5: 0 occorrenze
{1: 4, 4: 3, 3: 18, 2: 12}


In [52]:
nome_output_json = "naive_results/naive_results_completeness.json"

In [53]:
esegui_valutazione_completa(json_generated, json_reference, nome_output_json, model, 'completeness')

Tutte le domande corrispondono tra i due file.
Risultati salvati in naive_results/naive_results_completeness.json


In [54]:
stat = estrai_statistiche_punteggi(nome_output_json, 'completeness')
print(stat)

Distribuzione dei punteggi per il criterio 'completeness':
Punteggio 1: 7 occorrenze
Punteggio 2: 11 occorrenze
Punteggio 3: 17 occorrenze
Punteggio 4: 2 occorrenze
Punteggio 5: 0 occorrenze
{2: 11, 3: 17, 4: 2, 1: 7}


In [70]:
nome_output_json = "naive_results/naive_results_relevance.json"

In [71]:
esegui_valutazione_completa(json_generated, json_reference, nome_output_json, model, 'relevance')

Tutte le domande corrispondono tra i due file.
Risultati salvati in naive_results/naive_results_relevance.json


In [72]:
stat = estrai_statistiche_punteggi(nome_output_json, 'relevance')
print(stat)

Distribuzione dei punteggi per il criterio 'relevance':
Punteggio 1: 1 occorrenze
Punteggio 2: 2 occorrenze
Punteggio 3: 14 occorrenze
Punteggio 4: 14 occorrenze
Punteggio 5: 6 occorrenze
{4: 14, 5: 6, 3: 14, 2: 2, 1: 1}


## GraphRAG

In [73]:
json_generated = "../GraphRAG-test/GraphRAG_responses.json"
json_reference = "../DatasetCreation/Global_questions.json"

try:
    risultati = verifica_e_estrai(json_generated, json_reference)
    print("Dati estratti correttamente.")
except AssertionError as e:
    print(f"Errore: {e}")

Tutte le domande corrispondono tra i due file.
Dati estratti correttamente.


In [59]:
nome_output_json = "gr_results/gr_results_correctness.json"

In [60]:
esegui_valutazione_completa(json_generated, json_reference, nome_output_json, model, 'correctness')

Tutte le domande corrispondono tra i due file.
Risultati salvati in gr_results/gr_results_correctness.json


In [61]:
stat = estrai_statistiche_punteggi(nome_output_json, 'correctness')
print(stat)

Distribuzione dei punteggi per il criterio 'correctness':
Punteggio 1: 0 occorrenze
Punteggio 2: 3 occorrenze
Punteggio 3: 16 occorrenze
Punteggio 4: 18 occorrenze
Punteggio 5: 0 occorrenze
{3: 16, 4: 18, 2: 3}


In [62]:
nome_output_json = "gr_results/gr_results_completeness.json"

In [63]:
esegui_valutazione_completa(json_generated, json_reference, nome_output_json, model, 'completeness')

Tutte le domande corrispondono tra i due file.
Impossibile estrarre il punteggio dal feedback:
(Please provide detailed feedback based on the scoring rubric)
The generated answer provides a comprehensive overview of how Chronos approaches time-series forecasting compared to traditional statistical models. It highlights the use of a deep transformer network, probabilistic forecasting, and pre-training and fine-tuning capabilities. However, it lacks specific details about how Chronos tokenizes time-series data into discrete values and utilizes multiple future trajectory samples based on historical context, which are key aspects of its approach as mentioned in the reference answer. Additionally, the generated answer does not explicitly compare Chronos to traditional statistical models like ARIMA or SARIMA, which is a crucial point for understanding its distinctiveness. Despite these omissions, the generated answer covers several relevant points and demonstrates a good understanding of Chr

In [64]:
stat = estrai_statistiche_punteggi(nome_output_json, 'completeness')
print(stat)

Distribuzione dei punteggi per il criterio 'completeness':
Punteggio 1: 0 occorrenze
Punteggio 2: 0 occorrenze
Punteggio 3: 14 occorrenze
Punteggio 4: 22 occorrenze
Punteggio 5: 0 occorrenze
{4: 22, 3: 14}


In [74]:
nome_output_json = "gr_results/gr_results_relevance.json"

In [75]:
esegui_valutazione_completa(json_generated, json_reference, nome_output_json, model, 'relevance')

Tutte le domande corrispondono tra i due file.
Risultati salvati in gr_results/gr_results_relevance.json


In [76]:
stat = estrai_statistiche_punteggi(nome_output_json, 'relevance')
print(stat)

Distribuzione dei punteggi per il criterio 'relevance':
Punteggio 1: 0 occorrenze
Punteggio 2: 0 occorrenze
Punteggio 3: 0 occorrenze
Punteggio 4: 2 occorrenze
Punteggio 5: 35 occorrenze
{5: 35, 4: 2}
