# LlamaModel Code

The LlamaModel is a continuation of Mallory Helfenbein's (NASA HQ intern 2023) [ReviewerExtractor codeV2](https://github.com/ninoc/ReviewerExtractor/tree/main/codeV2). You must create an ADS account and obtain an [api token](https://ui.adsabs.harvard.edu/user/settings/token). We input a list of researcher names into the codeV2 which searches by first author in ADS, gather their abstracts from 2003 to 2030, and returns the top 10 words, bigrams, and trigrams. From these n-grams, we create a combined top words list.

We used the llama3-70b-8192 model in groqCloud. You must create a groqCloud account and obtain an [api key](https://console.groq.com/keys). The llama model takes in the combined top words for each researcher and will determine the expertise chosen from the [AAS keywords](https://journals.aas.org/keywords-2013/). We fed the model a specific prompt and the specific topics from AAS. First, the model is prompted to determine the general topics and then it is asked for their associated subtopics.

###Citing this code:

Part of this code is the second version of a Expertise finding tool developed by Helfenbein et al. 2023.

It utilizes NASA ADS API to query for articles (refereed or not) in the "Astronomy" database (cite ADS). Please, cite "Wu & Lendahl et al. 2024" and refer to the README file in the github.


## Import Packages

*   pandas 1.5.3
*   groqCloud api
*   mount to google drive
*   nltk for n-grams
*   ads api
*   TextAnalysis.py
*   stopwords.txt (to create meaningful N-grams)
*   ADSsearcherpkg.py

In [None]:
#We designed the code to work with Pandas 1.5.3. If the session restarts, make sure to run this cell again to import the correct version.
import pandas as pd
print(pd. __version__)

#If the Pandas version differs from 1.5.3, run the following:
!pip install pandas==1.5.3 --user

In [None]:
#connect to your google drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#imports for n-grams
import requests
from urllib.parse import urlencode, quote_plus
import numpy as np
import sys

import nltk
nltk.download('punkt')
nltk.download('wordnet')

In [None]:
#install ads package
!pip install ads

In [None]:
#create a folder in your drive. In this case we named it SMS. You must store the TextAnalysis.py, stopwords.txt, ADSsearcherpkg.py, and the SMS Data in this folder
path_stop= '/content/drive/MyDrive/SMS/'
stop_file='stopwords.txt'
stop_dir=path_stop+stop_file
sys.path.append(path_stop)

In [None]:
#For the TextAnalysis File, please refer to M. Volze et al. 2023
import TextAnalysis as TA
import ADSsearcherpkg as AP

In [None]:
#Grab the data. It is best to run a large amount of data in chunks. Here we did chunks of a 1000.
SMS_file = pd.read_csv('/content/drive/MyDrive/SMS/SMS Input.csv') #import the total SMS data
first_1000 = SMS_file[0:1000] #change range to get a subset of the data (ex: the first 1000 rows you want to run through the llama model)

file_name = '/content/drive/MyDrive/SMS/0-1000.csv'
first_1000.to_csv(file_name, index=False) #save this file to a csv in your folder
first_1000

In [None]:
token = '' #Insert your ADS API token

## Reviewer Extractor

Import a file of researcher names as a csv file (a column labeled "Name" and formatted Last Name, First Name) and run the names through ADS Search. This will search by first author, gather their abstracts from 2003 to 2030, and return the top n-grams.

In [None]:
#save the SMS chunk of data aside
file_name = '/content/drive/MyDrive/SMS/0-1000.csv'
sample_df = pd.read_csv(file_name)
sample_df

In [None]:
#ADS Search from codeV2
datf=AP.run_file_names(filename=file_name,
               token=token, stop_dir=stop_dir)

datf

In [None]:
#Save the people with no ADS search results to a separate excel file
ads_no_results = sample_df[sample_df["Name"].isin(datf["Input Author"]) == False]
ads_no_results.to_excel(path_stop+'ADS_no_results1.xlsx', index=False)

In [None]:
#Combined top words function
import itertools
def topwords(top10words, top10bigrams, top10trigrams):
    '''
    Takes in a list of top 10 words, bigrams, and trigrams and returns a combined list
    '''
    # Handle the case where input might not be lists of tuples
    if isinstance(top10words, str):
        top10words = eval(top10words) # Safely evaluate string representation of list
    if isinstance(top10bigrams, str):
        top10bigrams = eval(top10bigrams)
    if isinstance(top10trigrams, str):
        top10trigrams = eval(top10trigrams)

    topwords = [word for word, _ in top10words]
    topbigrams = [' '.join(words) for words, _ in top10bigrams]
    toptrigrams = [' '.join(words) for words, _ in top10trigrams]

    lst = [topwords, topbigrams, toptrigrams]
    single_list = list(itertools.chain(*lst))
    return single_list

Combine the N-grams into a single list using the function above. Also append the number of research papers they wrote as first author from 2003 to 2030.

In [None]:
#Gather the top combined words and number of papers
datf["Top Combined Words"] = datf.apply(lambda x: topwords(x["Top 10 Words"], x["Top 10 Bigrams"], x['Top 10 Trigrams']), axis=1)
datf["Number of Papers"] = datf["Bibcode"].str.split(",").str.len()
datf

In [None]:
#drop duplicates and reset the index
datf = datf.astype(str) #convert datf columns to type string so you can drop duplicates
datf.drop_duplicates(inplace=True)
datf.reset_index(drop=True, inplace=True)
datf

In [None]:
#Save the ADS search results file. This is the file you'll need to run through the llama model
datf.to_excel(path_stop+'ADS_results1.xlsx', index=False)

## (START HERE IF CODE ERRORS AFTER RUNNING THE LLAMA MODEL) Re-Import ADS Search Results

In [None]:
#Skip this if this is the first time you are running the code
import pandas as pd
from google.colab import drive

drive.mount('/content/drive')
path_stop= '/content/drive/MyDrive/SMS/'
output_file = path_stop+"output_0-1000.csv" #make sure you have the right output_file name
sample_df = pd.read_csv('/content/drive/MyDrive/SMS/0-1000.csv') #make sure file name is correct

In [None]:
#Re-import ADS Search results file here if the code errors. Skip this if this is the first time you are running the code.
df = pd.read_excel(path_stop+'ADS_results1.xlsx')
datf = df[178:] #slice where the code stopped
datf

## Preparing the Llama Model

*   Run the dictionary of topics and their subtopics
*   Run the topic and subtopic prompts for the model
*   extract_topics function to extract the expertise from the model's response
*   generate_subtopics function to generate the subtopics with the model






In [None]:
# @title Run the Subtopics Dictionary
d = {
"physical data and processes":
"""
acceleration of particles
accretion, accretion disks
asteroseismology
astrobiology
astrochemistry
astroparticle physics
atomic data
atomic processes
black hole physics
chaos
conduction
convection
dense matter
diffusion
dynamo
elementary particles
equation of state
gravitation
gravitational lensing: strong
gravitational lensing: weak
gravitational lensing: micro
gravitational waves
hydrodynamics
instabilities
line: formation
line: identification
line: profiles
magnetic fields
magnetic reconnection
magnetohydrodynamics (MHD)
masers
molecular data
molecular processes
neutrinos
nuclear reactions, nucleosynthesis, abundances
opacity
plasmas
polarization
radiation: dynamics
radiation mechanisms: general
radiation mechanisms: non-thermal
radiation mechanisms: thermal
radiative transfer
relativistic processes
scattering
shock waves
solid state: refractory
solid state: volatile
turbulence
waves
""",
"astronomical instrumentation methods and techniques":
"""
atmospheric effects
balloons
instrumentation: adaptive optics
instrumentation: detectors
instrumentation: high angular resolution
instrumentation: interferometers
instrumentation: miscellaneous
instrumentation: photometers
instrumentation: polarimeters
instrumentation: spectrographs
light pollution
methods: analytical
methods: data analysis
methods: laboratory: atomic
methods: laboratory: molecular
methods: laboratory: solid state
methods: miscellaneous
methods: numerical
methods: observational
methods: statistical
site testing
space vehicles
space vehicles: instruments
techniques: high angular resolution
techniques: image processing
techniques: imaging spectroscopy
techniques: interferometric
techniques: miscellaneous
techniques: photometric
techniques: polarimetric
techniques: radar astronomy
techniques: radial velocities
techniques: spectroscopic
telescopes
""",
"astronomical databases":
"""
astronomical databases: miscellaneous
atlases
catalogs
surveys
virtual observatory tools
""",
"astrometry and celestial mechanics":
"""
astrometry
celestial mechanics
eclipses
ephemerides
occultations
parallaxes
proper motions
reference systems
time
""",
"the sun":
"""
Sun: abundances
Sun: activity
Sun: atmosphere
Sun: chromosphere
Sun: corona
Sun: coronal mass ejections (CMEs)
Sun: evolution
Sun: faculae, plages
Sun: filaments, prominences
Sun: flares
Sun: fundamental parameters
Sun: general
Sun: granulation
Sun: helioseismology
Sun: heliosphere
Sun: infrared
Sun: interior
Sun: magnetic fields
Sun: oscillations
Sun: particle emission
Sun: photosphere
Sun: radio radiation
Sun: rotation
(Sun:) solar–terrestrial relations
(Sun:) solar wind
(Sun:) sunspots
Sun: transition region
Sun: UV radiation
Sun: X-rays, gamma rays
""",
"planetary systems":
"""
comets: general
comets: individual (…, …)
Earth
interplanetary medium
Kuiper belt: general
Kuiper belt objects: individual (…, …)
meteorites, meteors, meteoroids
minor planets, asteroids: general
minor planets, asteroids: individual (…, …)
Moon
Oort Cloud
planets and satellites: atmospheres
planets and satellites: aurorae
planets and satellites: composition
planets and satellites: detection
planets and satellites: dynamical evolution and stability
planets and satellites: formation
planets and satellites: fundamental parameters
planets and satellites: gaseous planets
planets and satellites: general
planets and satellites: individual (…, …)
planets and satellites: interiors
planets and satellites: magnetic fields
planets and satellites: oceans
planets and satellites: physical evolution
planets and satellites: rings
planets and satellites: surfaces
planets and satellites: tectonics
planets and satellites: terrestrial planets
protoplanetary disks
planet–disk interactions
planet–star interactions
zodiacal dust
""",
"stars":
"""
stars: abundances
stars: activity
stars: AGB and post-AGB
stars: atmospheres
(stars:) binaries (including multiple): close
(stars:) binaries: eclipsing
(stars:) binaries: general
(stars:) binaries: spectroscopic
(stars:) binaries: symbiotic
(stars:) binaries: visual
stars: black holes
(stars:) blue stragglers
(stars:) brown dwarfs
stars: carbon
stars: chemically peculiar
stars: chromospheres
(stars:) circumstellar matter
stars: coronae
stars: distances
stars: dwarf novae
stars: early-type
stars: emission-line, Be
stars: evolution
stars: flare
stars: formation
stars: fundamental parameters
stars: general
(stars:) gamma-ray burst: general
(stars:) gamma-ray burst: individual (…, …)
(stars:) Hertzsprung–Russell and C–M diagrams
stars: horizontal-branch
stars: imaging
stars: individual (…, …)
stars: interiors
stars: jets
stars: kinematics and dynamics
stars: late-type
stars: low-mass
stars: luminosity function, mass function
stars: magnetars
stars: magnetic field
stars: massive
stars: mass-loss
stars: neutron
(stars:) novae, cataclysmic variables
stars: oscillations (including pulsations)
stars: peculiar (except chemically peculiar)
(stars:) planetary systems
stars: Population II
stars: Population III
stars: pre-main sequence
stars: protostars
(stars:) pulsars: general
(stars:) pulsars: individual (…, …)
stars: rotation
stars: solar-type
(stars:) starspots
stars: statistics
(stars:) subdwarfs
(stars:) supergiants
(stars:) supernovae: general
(stars:) supernovae: individual (…, …)
stars: variables: Cepheids
stars: variables: delta Scuti
stars: variables: general
stars: variables: RR Lyrae
stars: variables: S Doradus
stars: variables: T Tauri, Herbig Ae/Be
(stars:) white dwarfs
stars: winds, outflows
stars: Wolf–Rayet
""",
"interstellar medium (ism) nebulae":
"""
ISM: abundances
ISM: atoms
ISM: bubbles
ISM: clouds
(ISM:) cosmic rays
(ISM:) dust, extinction
(ISM:) evolution
ISM: general
(ISM:) HII regions
(ISM:) Herbig–Haro objects
ISM: individual objects (…, …) (except
planetary nebulae)
ISM: jets and outflows
ISM: kinematics and dynamics
ISM: lines and bands
ISM: magnetic fields
ISM: molecules
(ISM:) planetary nebulae: general
(ISM:) planetary nebulae: individual (…, …)
(ISM:) photon-dominated region (PDR)
ISM: structure
ISM: supernova remnants
""",
"the galaxy":
"""
Galaxy: abundances
Galaxy: bulge
Galaxy: center
Galaxy: disk
Galaxy: evolution
Galaxy: formation
Galaxy: fundamental parameters
Galaxy: general
(Galaxy:) globular clusters: general
(Galaxy:) globular clusters: individual (…, …)
Galaxy: halo
(Galaxy:) local interstellar matter
Galaxy: kinematics and dynamics
Galaxy: nucleus
(Galaxy:) open clusters and associations: general
(Galaxy:) open clusters and associations: individual (…, …)
(Galaxy:) solar neighborhood
Galaxy: stellar content
Galaxy: structure
""",
"galaxies":
"""
galaxies: abundances
galaxies: active
(galaxies:) BL Lacertae objects: general
(galaxies:) BL Lacertae objects: individual (…, …)
galaxies: bulges
galaxies: clusters: general
galaxies: clusters: individual (…, …)
galaxies: clusters: intracluster medium
galaxies: distances and redshifts
galaxies: dwarf
galaxies: elliptical and lenticular, cD
galaxies: evolution
galaxies: formation
galaxies: fundamental parameters
galaxies: general
galaxies: groups: general
galaxies: groups: individual (…, …)
galaxies: halos
galaxies: high-redshift
galaxies: individual (…, …)
galaxies: interactions
(galaxies:) intergalactic medium
galaxies: irregular
galaxies: ISM
galaxies: jets
galaxies: kinematics and dynamics
(galaxies:) Local Group
galaxies: luminosity function, mass function
(galaxies:) Magellanic Clouds
galaxies: magnetic fields
galaxies: nuclei
galaxies: peculiar
galaxies: photometry
(galaxies:) quasars: absorption lines
(galaxies:) quasars: emission lines
(galaxies:) quasars: general
(galaxies:) quasars: individual (…, …)
(galaxies:) quasars: supermassive black holes
galaxies: Seyfert
galaxies: spiral
galaxies: starburst
galaxies: star clusters: general
galaxies: star clusters: individual (…, …)
galaxies: star formation
galaxies: statistics
galaxies: stellar content
galaxies: structure
""",
"cosmology":
"""
(cosmology:) cosmic background radiation
(cosmology:) cosmological parameters
cosmology: miscellaneous
cosmology: observations
cosmology: theory
(cosmology:) dark ages, reionization, first stars
(cosmology:) dark matter
(cosmology:) dark energy
(cosmology:) diffuse radiation
(cosmology:) distance scale
(cosmology:) early universe
(cosmology:) inflation
(cosmology:) large-scale structure of universe
(cosmology:) primordial nucleosynthesis
""",
"resolved and unresolved sources as a function of wavelength":
"""
gamma rays: diffuse background
gamma rays: galaxies
gamma rays: galaxies: clusters
gamma rays: general
gamma rays: ISM
gamma rays: stars
infrared: diffuse background
infrared: galaxies
infrared: general
infrared: ISM
infrared: planetary systems
infrared: stars
radio continuum: galaxies
radio continuum: general
radio continuum: ISM
radio continuum: planetary systems
radio continuum: stars
radio lines: galaxies
radio lines: general
radio lines: ISM
radio lines: planetary systems
radio lines: stars
submillimeter: diffuse background
submillimeter: galaxies
submillimeter: general
submillimeter: ISM
submillimeter: planetary systems
submillimeter: stars
ultraviolet: galaxies
ultraviolet: general
ultraviolet: ISM
ultraviolet: planetary systems
ultraviolet: stars
X-rays: binaries
X-rays: bursts
X-rays: diffuse background
X-rays: galaxies
X-rays: galaxies: clusters
X-rays: general
X-rays: individual (…, …)
X-rays: ISM
X-rays: stars
"""
}

In [None]:
#Llama model prompts
prompt_topic = """
You are a scientist determining the areas of expertise of a person based on a list of top words from their abstracts pulled from the NASA Astrophysics Data System (ADS). You will be provided a list of top words to determine the topic or topics the person is an expert in. The topics are listed below.

Topics list:
physical data and processes
astronomical instrumentation methods and techniques
astronomical databases
astrometry and celestial mechanics
the sun
planetary systems
stars
interstellar medium (ism) nebulae
the galaxy
galaxies
cosmology
resolved and unresolved sources as a function of wavelength

Now please determine accurately the topic or topics based on these top words and provide evidence. You MUST ONLY choose from the topics list above. Choose to the best of your ability and do not leave the response as none. Please list out the topics in a python list format first. Here are some examples of the expected format, [galaxies, the sun, cosmology], [astronomical databases], [interstellar medium (ism) nebulae, astronomical instrumentation methods and techniques, resolved and unresolved sources as a function of wavelength, stars]. Note these examples contain topics chosen ONLY from the topics list and do not include the ` character. Then provide evidence by explaining which of the top words correspond with the topic. Here are the top words.:
"""

def prompt_subtopic(subtopic):
  prompt = """
  You are a scientist determining the specific areas of expertise of a person based on a list of top words from their abstracts pulled from the NASA Astrophysics Data System (ADS) and their associated general topic. You will be provided a list of top words and the general topic to determine the subtopic or subtopics the person is an expert in. The subtopics are listed below.

  Subopics list:
  """ + subtopic + """
  Now please determine accurately the subtopic or subtopics based on these top words and general topic. You MUST ONLY choose from the subtopics list above. Choose to the best of your ability and do not leave the response as none. Please list out the topic and subtopics in this format: [topic - subtopic1|subtopic2|etc]. Some examples of the expected format are [physical data and processes - astrobiology|diffusion|gravitational lensing: strong], [astrometry and celestial mechanics - occultations], [stars - (stars:) circumstellar matter|stars: flare|stars: interiors|(stars:) supernovae: general]. Note these examples contain subtopics chosen ONLY from the subtopics list for each topic.
  Then provide evidence by explaining which of the top words correspond with the subtopics. Here are the top words and the general topic:
  """
  return prompt

In [None]:
###IMPORTANT FUNCTIONS###

def extract_topics(text, topic):
  '''
  Returns the topics/subtopics in a list format
  '''
  # Extract the list using a regular expression
  match = re.search(r'\[(.*?)\]', text)
  if match:
    list_string = match.group(1)
  else:
    return 'None'
  if topic == "t":
      result_list = [item.strip().lower().replace('`', '') for item in list_string.split(',')]
      return result_list
  else:
    return list_string


def generate_subtopics(topwords, topic):
  '''
  Uses the llama model to generate the subtopics for a given topic
  '''
  try:
    subtopics = d[topic]
  except:
    return "["+topic + " - not listed]"

  client = Groq()
  completion = client.chat.completions.create(
      model="llama3-70b-8192",
      messages=[
          {
            "role": "user",
            "content": prompt_subtopic(subtopics) + "Top Words: " + str(topwords) + " Topic: " + topic
          }
        ],
      temperature=0,
      max_tokens=3000,
      top_p=1,
      stream=True,
      stop=None,
    )
  response = "".join(chunk.choices[0].delta.content or "" for chunk in completion)
  return response

## Run the Llama Model
In the following code, you will open a new csv file and name it. Then run the llama model and the csv file will update after each iteration. The model reads the topwords for a given researcher and returns the topics that match those words. Then it will run those topics and topwords through the model again but this time to gather the subtopics (gather_subtopics function is called).

In [None]:
!pip install groq
%env GROQ_API_KEY= #Insert you groqCloud API key here

In [None]:
# @title Run this just once! If the code errors do not run this cell again!
# Opens a new csv file with headers
import csv

output_file = path_stop+"output_0-1000.csv" #name the output file
with open(output_file, 'a', newline='', encoding='utf-8') as dynamic_csv_file:
  csv_writer = csv.writer(dynamic_csv_file)
  csv_writer.writerow(["Input Author", "Affiliations", "Combined Top Words", "Topics with Explanation", "Subtopics with Explanation", "Topics", "Subtopics", "Number of Papers"])

In [None]:
from groq import Groq
import re
import csv

%env GROQ_API_KEY= #Insert you groqCloud API key here
client = Groq()
num_iter = 0 #if the code errors change to the row number it stopped at next time you run this

for topwords in datf["Top Combined Words"]:
  completion = client.chat.completions.create(
      model="llama3-70b-8192",
      messages=[
          {
              "role": "user",
              "content": prompt_topic + str(topwords)
          }
      ],
      temperature=0,
      max_tokens=3000,
      top_p=1,
      stream=True,
      stop=None,
  )

  #Topics with explanation
  response = "".join(chunk.choices[0].delta.content or "" for chunk in completion)
  #Extract only the topic
  topics = extract_topics(response, "t")
  #Subtopics with explanation
  full_st = []
  #Extract only the subtopics
  st = []

  for topic in topics:
    subtopics = generate_subtopics(topwords, topic)
    full_st.append(subtopics)
    st.append(extract_topics(subtopics, "s"))

  #Get input author, affiliations, and number of papers
  author = datf["Input Author"][num_iter]
  aff = datf["Affiliations"][num_iter]
  papers = datf["Number of Papers"][num_iter]

  #Write to csv
  with open(output_file, 'a', newline='', encoding='utf-8') as dynamic_csv_file:
    csv_writer = csv.writer(dynamic_csv_file)
    csv_writer.writerow([author, aff, topwords, response, full_st, topics, st, papers])
    num_iter += 1
    print("Completed Row " + str(num_iter))

## Format and Save the Results
Read in the output csv file. Then format the results in the following cell and export as an excel file.

In [None]:
output_file = path_stop+"output_0-1000.csv" #make sure the output file name is correct
output_df = pd.read_csv(output_file)
output_df

In [None]:
#Format the model results
import ast
def format_output(output):
  lst = ast.literal_eval(output)
  format = ""
  for item in lst:
    try:
      topics = item.split(' - ')[0]
      subtopic_split = item.split(' - ')[1]
      subtopics = subtopic_split.split('|')
      format += topics+':'+str(subtopics)+'|'
    except:
      pass
  format+="Number of Papers: "
  return format

output_df["Expertise"] = output_df["Subtopics"].apply(format_output)
output_df["Expertise"] = output_df['Expertise'] + output_df['Number of Papers'].astype(str)
output_df

In [None]:
#Merge the SMS chunk of data with the output file
final_df = sample_df.merge(output_df, left_on="Name", right_on="Input Author")
final_df.rename(columns={"Expertise_x": "Labeled Expertise", "Expertise_y": "Model Expertise"}, inplace=True) #rename columns
final_df

In [None]:
#Export the final merged dataframe to an excel file
final_df.to_excel(path_stop+"final_0-1000.xlsx", index=False) #rename the file after each run