# Fine-tuning Gemma

### Select only articles from Cosmology and Nongalactic Astrophysics (astro-ph.CO). Articles from 2018-2022 are used for fine-tuning and from 2023 for testing. The Q&A Dataset will be generated from the abstracts of the selected articles. 

In [9]:
! export KAGGLE_USERNAME=sultanhhassan
! export KAGGLE_KEY=d95c12c75374c46f148fb3f401cb9fae

In [10]:
import os
os.environ["KERAS_BACKEND"] = "jax"  # Or "torch" or "tensorflow".                                                                                                                          
# Avoid memory fragmentation on JAX backend.                                                                                                                                                
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"]="1.00"

In [11]:
import json

In [12]:
import keras
import keras_nlp
import numpy as np

In [13]:
ls

QA_dataset_generation.ipynb
README.md
arxiv_astrophco_qa_pairs_2018_2022_finetuning.json
arxiv_astrophco_qa_pairs_2023_testing.json
arxiv_filtered_astrophco-18-22.json
arxiv_filtered_astrophco-23.json
evaluation_LLM_as_judge.ipynb
fine_tune_Gemma.ipynb


In [14]:
with open('arxiv_astrophco_qa_pairs_2018_2022_finetuning.json', 'r') as f:
  data = json.load(f)

In [15]:
data_list = []
for i in range(len(data)):
     data_list.append(("Instruction:\n"+data[i]['Question']+"\n\nResponse:\n"+data[i]['REF_ANS']).encode('ascii', errors='ignore'))

### See first abstarct

In [16]:
gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma2_2b_en")
gemma_lm.summary()

ValueError: Preset gemma2_2b_en has no config.json. Make sure the URI or directory you are trying to load is a valid KerasNLP preset and and that you have permissions to read/download from this location.

### Use langchain, and Ollama to run llama3.1:8b-instruct-fp16 model to generate QA pair from a given abstract

In [4]:
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain.output_parsers import ResponseSchema, StructuredOutputParser
from langchain_community.chat_models import ChatOllama


response_schemas = [
    ResponseSchema(name="Question", description="the generated question from the provided context"),
    ResponseSchema(name="Answer", description="the corresponding answer from the provided context")]
output_parser = StructuredOutputParser.from_response_schemas(response_schemas)
format_instructions = output_parser.get_format_instructions(only_json=True)

llm = ChatOllama(model="llama3.1:8b-instruct-fp16", temperature=0.0, format='json')#, keep_alive='16h')

### Prompte template to generate QA pair

In [5]:
TEMPLATE = """ You are a cosmologist. Your task is to generate a meaningful question and an answer using the following provided "{context}" from a cosmology and nongalactic astrophysics article \n\n   
"You MUST obey the following criteria:\n"
"- No pre-amble.\n"
"- Restrict the question to the context information provided.\n"
"- Do NOT create a question that cannot be answered from the context.\n"
"- Phrase the question so that it does NOT refer to specific context.\n"
"- For instance, do NOT use phrases like 'given the provided context' or 'in this work' in the question or 'according to the text' in the answer becuse if the question is asked elsewher it would not be provided specific context. Replace these terms with specific details.\n"
"- Please do NOT repeat the provided context.\n"
"- Please Only generate a question and an answer without any sentence in advance such as "Here is the generated question and answer:".\n"
"- Please follow JSON recommended format below.\n"
"- Please ensure that the ouput is a valid JSON object.\n"
"{format_instructions}"""

prompt = ChatPromptTemplate.from_template(template=TEMPLATE)


### Loop through all abstracts and generate QA pairs

In [6]:
responses = []
for i in range(2): # we here generate 2 QA pairs from the first 2 abstracts as a demo
    print ("################### Abstract", i)
    print (data[i].encode('ascii', errors='ignore'))
    print ("################### Generated QA pair")
    messages = prompt.format_messages(context=data[i].encode('ascii', errors='ignore'), format_instructions=format_instructions)    
    response = llm.invoke(messages)
    output_dict = output_parser.parse(response.content)
    print ("Question: " + output_dict['Question'])
    print ("Answer: " + output_dict['Answer'])
    responses.append(output_dict)

################### Abstract 0
b'  At present, dwarf spheroidal galaxies satellites of the Milky Way may\nrepresent the best astrophysical objects for dark matter (DM) searches with\ngamma-ray telescopes. They present the highest mass-to-light ratios known in\nthe Universe. Furthermore, many of them are near enough from the Earth to be\nable to yield high predicted DM annihilation fluxes that might be observed by\ncurrent gamma-ray instruments like MAGIC. The picture has become even better\nwith the recent discovery of new dwarfs. These new objects are expected to\nyield even higher DM annihilation fluxes, since most of them are nearer than\nthe previously known dwarfs and are even more DM dominated systems. Here a\ntentative list of the best candidates is given. The observational results\nobtained with MAGIC from the Draco dwarf as well as the observation of other\ndwarfs carried out by other Cherenkov telescopes are presented as well.\nFinally, we discuss the detection prospects of s

In [7]:
# Saving dataset
#with open('Arxiv_astroph.CO_QA_pairs_2018_2022.json', 'w') as f:
#    json.dump(responses, f)
# Reading dataset    
#with open('Arxiv_astroph.CO_QA_pairs_2023.json', 'w') as f:
#    json.dump(responses, f)

### check first QA pair

In [8]:
responses[0]['Question']

'What type of astrophysical objects are considered to have the highest mass-to-light ratios in the Universe?'

In [9]:
responses[0]['Answer']

'Dwarf spheroidal galaxies'