# **Generating Q&A dataset from arXiv articles (https://www.kaggle.com/datasets/Cornell-University/arxiv)**

### Select only articles from Cosmology and Nongalactic Astrophysics (astro-ph.CO). Articles from 2018-2022 are used for fine-tuning and from 2023 for testing. The Q&A Dataset will be generated from the abstracts of the selected articles. 

In [1]:
import json
data = []
n1 = 0
n2 = 0
with open('./arxiv-metadata-oai-snapshot.json') as file:
    for line in file:
        features = json.loads(line)
        #if features['categories'] == "astro-ph.CO" and    int(features['update_date'][:4]) == 2023: # This is for testing
        if features['categories'] == "astro-ph.CO" and   2018  < int(features['update_date'][:4]) < 2023: # This is for fine-tuning 
            data.append(features['abstract'])
            n1+=1
        n2+=1

print ("Total number of selected articles", n1)
print ("Total number of all articles", n2)

Total number of selected articles 3497
Total number of all articles 2560035


### See some random abstarct

In [2]:
data[0]

'  At present, dwarf spheroidal galaxies satellites of the Milky Way may\nrepresent the best astrophysical objects for dark matter (DM) searches with\ngamma-ray telescopes. They present the highest mass-to-light ratios known in\nthe Universe. Furthermore, many of them are near enough from the Earth to be\nable to yield high predicted DM annihilation fluxes that might be observed by\ncurrent gamma-ray instruments like MAGIC. The picture has become even better\nwith the recent discovery of new dwarfs. These new objects are expected to\nyield even higher DM annihilation fluxes, since most of them are nearer than\nthe previously known dwarfs and are even more DM dominated systems. Here a\ntentative list of the best candidates is given. The observational results\nobtained with MAGIC from the Draco dwarf as well as the observation of other\ndwarfs carried out by other Cherenkov telescopes are presented as well.\nFinally, we discuss the detection prospects of such kind of objects in the\ncont

### Use langchain, and Ollama to run llama3.1:8b-instruct-fp16 model to generate QA pair from a given abstract

In [3]:
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain.output_parsers import ResponseSchema, StructuredOutputParser
from langchain_community.chat_models import ChatOllama


response_schemas = [
    ResponseSchema(name="Question", description="the generated question from the provided context"),
    ResponseSchema(name="Answer", description="the corresponding answer from the provided context")]
output_parser = StructuredOutputParser.from_response_schemas(response_schemas)
format_instructions = output_parser.get_format_instructions(only_json=True)

llm = ChatOllama(model="llama3.1:8b-instruct-fp16", temperature=0.0, format='json')#, keep_alive='16h')

### Prompte template to generate QA pair

In [15]:
TEMPLATE = """ You are a cosmologist. Your task is to generate a meaningful question and an answer using the following provided "{context}" from a cosmology and nongalactic astrophysics article \n\n   
"You MUST obey the following criteria:\n"
"- No pre-amble.\n"
"- Restrict the question to the context information provided.\n"
"- Do NOT create a question that cannot be answered from the context.\n"
"- Phrase the question so that it does NOT refer to specific context.\n"
"- For instance, do NOT use phrases like 'given the provided context' or 'in this work' in the question or 'according to the text' in the answer becuse if the question is asked elsewher it would not be provided specific context. Replace these terms with specific details.\n"
"- Please do NOT repeat the provided context.\n"
"- Please Only generate a question and an answer without any sentence in advance such as "Here is the generated question and answer:".\n"
"- Please follow JSON recommended format below.\n"
"- Please ensure that the ouput is a valid JSON object.\n"
"{format_instructions}"""

prompt = ChatPromptTemplate.from_template(template=TEMPLATE)


### Loop through all abstracts and generate QA pairs

In [18]:
responses = []
for i in range(2):
    print ("################### Abstract", i)
    print (i, data[i].encode('ascii', errors='ignore'))
    print ("Generated QA pair ---->")
    messages = prompt.format_messages(context=data[i].encode('ascii', errors='ignore'), format_instructions=format_instructions)    
    response = llm.invoke(messages)
    output_dict = output_parser.parse(response.content)
    print ("Question: " + output_dict['Question'])
    print ("Answer: " + output_dict['Answer'])
    responses.append(output_dict)


# Saving dataset
#np.save("responses_astrophco-23-QAfull.npy", np.array(res_arr))


################### Abstract 0
0 b'  At present, dwarf spheroidal galaxies satellites of the Milky Way may\nrepresent the best astrophysical objects for dark matter (DM) searches with\ngamma-ray telescopes. They present the highest mass-to-light ratios known in\nthe Universe. Furthermore, many of them are near enough from the Earth to be\nable to yield high predicted DM annihilation fluxes that might be observed by\ncurrent gamma-ray instruments like MAGIC. The picture has become even better\nwith the recent discovery of new dwarfs. These new objects are expected to\nyield even higher DM annihilation fluxes, since most of them are nearer than\nthe previously known dwarfs and are even more DM dominated systems. Here a\ntentative list of the best candidates is given. The observational results\nobtained with MAGIC from the Draco dwarf as well as the observation of other\ndwarfs carried out by other Cherenkov telescopes are presented as well.\nFinally, we discuss the detection prospects of