# **Evaluating CosmoGemma using LLMs as a judge**

### Read the testing sample and compare the reference and predicted answers.

In [1]:
import json
with open('arxiv_astrophco_qa_pairs_2023_testing.json', 'r') as f:
    test = json.load(f)

### See some entry

In [2]:
index = 3
print ("Question:", test[index]['Question'], "\n")
print ("Reference answer from llama3.1:", test[index]['REF_ANS'], "\n")
print ("Predicted answer from CosmoGemma:", test[index]['PRED_ANS'],"\n")

Question: What is the typical age of large-scale stellar disks in S0 galaxies that reside in denser environments? 

Reference answer from llama3.1: older than 10 Gyr 

Predicted answer from CosmoGemma: 100--200 My. 



### Use langchain, and Ollama to run llama3.1:70b-instruct-q2_K (or any model, but accuracy can be sensitive to the model choice, requires extensive testing) model to grade CosmoGemma predictions

In [3]:
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain.output_parsers import ResponseSchema, StructuredOutputParser
from langchain_community.chat_models import ChatOllama


response_schemas = [
    ResponseSchema(name="task", description="whether the two provided responses of a given question carry the exact meaning?"),
    ResponseSchema(name="output", description="graded output either CORRECT or WRONG")]
output_parser = StructuredOutputParser.from_response_schemas(response_schemas)
format_instructions = output_parser.get_format_instructions(only_json=True)

llm = ChatOllama(model="llama3.1:70b-instruct-q2_K", temperature=0.0, format='json')#, keep_alive='16h', num_thread=8)

### Prompte template to generate QA pair

In [4]:
TEMPLATE = """ You are a cosmologist and judge. Your task is to grade whether the following two responses "{ref_ans}" and "{pred_ans}" of this given question "{question}" carry the exact meaning.\n\n
"You MUST obey the following criteria:\n"   
"- Just give either CORRECT or WRONG as a response, and no other detail or explanation. just one word response."
"- Please follow JSON recommended format below.\n" 
"- Please ensure that the ouput is a valid JSON object.\n"                                                                                                                     
"{format_instructions}"""


prompt = ChatPromptTemplate.from_template(template=TEMPLATE)


### Loop through all answers to test performance

In [5]:
N = 0    
failed = 0
total = 2 #run few examples as a demo or use len(test)) to run through all the testing sample.
for i in range(total): 
    print ("Evaluation", i)
    messages = prompt.format_messages(ref_ans=test[i]['REF_ANS'],
                                      pred_ans=test[i]['PRED_ANS'],
                                      question=test[i]['Question'],
                                      format_instructions=format_instructions) 

    response = llm.invoke(messages)
    try:
        output_dict = output_parser.parse(response.content)
    
        print ("Task : " + output_dict['task'], "\n")
        print ("Reference Answer :", test[i]['REF_ANS'],"\n")
        print ("Predicted Answer :", test[i]['PRED_ANS'],"\n")
    
        print ("Graded output :", output_dict['output'],"\n")
        if output_dict['output'] == "CORRECT":
            N+=1
    except:
        print ("failed", i)
        failed+=1
print ("Accuracy=",float(N/total))
print ("Failed parsing",failed) #check how many runs failed during parsing

Evaluation 0
Task : whether the two provided responses of a given question carry the exact meaning? 

Reference Answer : Yes, they can, with accuracies comparable to or even better than those of the stage-III type surveys neglecting the effect of massive neutrinos. 

Predicted Answer : Yes, it can. 

Graded output : CORRECT 

Evaluation 1
Task : Grading whether the two responses carry the exact meaning 

Reference Answer : Yes, they can contaminate the search for a primordial local signal by f_NL>10. 

Predicted Answer : yes 

Graded output : CORRECT 

Accuracy= 1.0
Failed parsing 0
