# RADx-rad Publication Relevance Classification Using Likert Scores and LLMs

This notebook classifies scientific publications for relevance to RADx-rad program objectives using Large Language Models (LLMs). 

Publications referencing RADx-rad grant numbers are evaluated to determine how closely they align with:

- **Funding Opportunity Announcements (FOAs)** objectives (based on full-text descriptions)
- **dbGaP Study Objectives** (based on study titles and abstracts)

LLMs assess relevance by comparing each publication's title and abstract against FOA descriptions and dbGaP study details, assigning a relevance score on a 5-point [Likert scale](https://www.scribbr.com/methodology/likert-scale/). Each LLM evaluation also includes a rationale explaining the assigned score.


## Workflow Overview

1. **Data Integration**
   - Load and merge preprocessed FOA, dbGaP, and publication datasets.

2. **Likert-Based Classification**
   - Evaluate publications against FOA and grant objectives using LLMs.
   - Annotate publications with Likert scores and detailed rationales.

**Author:** Peter W. Rose ([pwrose@ucsd.edu](mailto:pwrose@ucsd.edu))  
**Date:** 2025-03-13

In [1]:
import os
from pathlib import Path
import glob
import json
import time
import pandas as pd

import llm_utils

## Specify the LLM Model
Run this notebook multiple times to evaluate publications using different LLM models. 
Uncomment the desired model for each run.

In [2]:
# model = "meta-llama/Llama-3.3-70B-Instruct"
model = "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B"
# model = "gpt-4o-mini"

In [3]:
FOA_DIR = "../derived_data/foas" # Path to full-text FOA documents
DBGAP_ABSTRACTS = "../derived_data/dbgap_abstracts.csv" # dbGaP data
RADX_RAD_ALL = "../derived_data/publications_pubmed_raw.csv" # List of publications that mention RADx-rad grant numbers
LAST_UPDATE = "2025-06-02" # Date publications were last retrieved from PubMed

LENGTH_THRESHOLD = 100 # minimum length of title + abstract
ANNOTATION = "annotation_full_text_5" # Name of subfolder to save Likert scoring annotations
RESULTS = "../results"

## 1. **Data Integration**
Load and merge the preprocessed FOA, dbGaP, and Publication data

### Load Funding Opportunity Announcement Documents for the RADx-rad Program

In [4]:
foas = pd.concat(
    (pd.read_csv(f).assign(sub_project=f.stem) for f in Path(FOA_DIR).glob("*.csv")),
    ignore_index=True
)
foas

Unnamed: 0,id,name,url,sub_project,summary
0,RFA-OD-20-017,Emergency Awards RADx-RAD: Screening for COVID...,https://grants.nih.gov/grants/guide/rfa-files/...,SCENT,EXPIRED/nNational Institutes of Health (NIH)/n...
1,RFA-OD-20-015,Emergency Awards: RADx-rad Wastewater Detectio...,https://grants.nih.gov/grants/guide/rfa-files/...,Wastewater,EXPIRED/nNational Institutes of Health (NIH)/n...
2,RFA-OD-20-016,Emergency Awards: RADx-RAD Multimodal COVID-19...,https://grants.nih.gov/grants/guide/rfa-files/...,Multimodal Surveillance,EXPIRED/nNational Institutes of Health (NIH)/n...
3,RFA-OD-20-022,Emergency Awards: Chemosensory Testing as a CO...,https://grants.nih.gov/grants/guide/rfa-files/...,Chemosensory Testing,EXPIRED/nNational Institutes of Health (NIH)/n...
4,RFA-OD-20-014,Emergency Awards: Automatic Detection and Trac...,https://grants.nih.gov/grants/guide/rfa-files/...,Automatic Detection & Tracing,EXPIRED/nNational Institutes of Health (NIH)/n...
5,RFA-OD-20-023,Emergency Awards: RADx-rad Predicting Viral-As...,https://grants.nih.gov/grants/guide/rfa-files/...,PreVAIL kIds,EXPIRED/nNational Institutes of Health (NIH)/n...
6,RFA-OD-20-021,Emergency Awards RADx-RAD: Novel Biosensing fo...,https://grants.nih.gov/grants/guide/rfa-files/...,Novel Biosensing and VOC,EXPIRED/nNational Institutes of Health (NIH)/n...
7,RFA-OD-20-018,Emergency Awards: Exosome-based Non-traditiona...,https://grants.nih.gov/grants/guide/rfa-files/...,Exosome,EXPIRED/nNational Institutes of Health (NIH)/n...


### Load dbGaP Titles and Abstracts for the RADx-rad Program

In [5]:
dbgap_abstracts = pd.read_csv(DBGAP_ABSTRACTS, dtype=str, keep_default_na=False)
dbgap_abstracts.rename(columns={ "title": "dbgap_title", "description": "dbgap_abstract"}, inplace=True)
dbgap_abstracts.head(1)

Unnamed: 0,project_num,dbgap_accession,research_initiative,sub_project,project_serial_num,dbgap_title,focus,dbgap_abstract
0,1U01HL152410-01,phs002522.v1.p1,RADx-rad,Novel Biosensing and VOC,HL152410,Rapid Acceleration of Diagnostics - Radical (R...,COVID-19,"This proposal describes the design, fabricatio..."


### Load List of Publications that mention RADx-rad Grant Numbers

In [6]:
publications = pd.read_csv(RADX_RAD_ALL, usecols=["pm_id", "pmc_id", "doi", "title", "abstract", "keywords", "journal", "year", "authors", "article_type", "project_serial_num", "award_type", "supplement", "sub_project"], dtype=str, keep_default_na=False)

In [7]:
print(f"Number of publications: {publications.shape[0]}")

Number of publications: 689


### Map the Publications to corresponding dbGaP Studies

In [8]:
publications = publications.merge(dbgap_abstracts, on=["project_serial_num", "sub_project"])
print(f"Number of publications: {publications.shape[0]}")
publications.drop_duplicates(inplace=True)
publications.head()

Number of publications: 801


Unnamed: 0,pm_id,pmc_id,doi,title,abstract,keywords,authors,journal,year,article_type,project_serial_num,award_type,supplement,sub_project,project_num,dbgap_accession,research_initiative,dbgap_title,focus,dbgap_abstract
0,39773555,PMC11827850,doi:10.1172/JCI188222,CXCL12 ameliorates neutrophilia and disease se...,"Neutrophils, particularly low-density neutroph...",Angiotensin-Converting Enzyme 2|Animals|COVID-...,"Zheng, Jian|Dhakal, Hima|Qing, Enya|Shrestha, ...",The Journal of clinical investigation,2025,Journal Article,TR003787,U18,False,SCENT,1U18TR003787-01,phs002563.v1.p1,RADx-rad,Rapid Acceleration of Diagnostics - Radical (R...,COVID-19,The COVID-19 pandemic has caused unprecedented...
1,39880798,PMC11800485,doi:10.1093/femsec/fiaf010,Distinct bacteria display genus and species-sp...,Lichens are complex symbiotic systems where fu...,Bacteria|Colombia|Fungi|Lichens|Microbiota|Phy...,"Chaib De Mares, Maryam|Arciniegas Castro, Emer...",FEMS microbiology ecology,2025,Journal Article,DA053941,U01,False,Wastewater,1U01DA053941-01,phs002525.v1.p1,RADx-rad,Rapid Acceleration of Diagnostics - Radical (R...,COVID-19,"The University of Miami (UM), with three prima..."
2,39948059,PMC11874078,doi:10.1002/wnan.70004,Nanoparticle Contrast Agents for Photon-Counti...,The clinical availability of photon-counting c...,Animals|Contrast Media|Humans|Nanoparticles|Ph...,"Devkota, Laxman|Bhavane, Rohan|Badea, Cristian...",Wiley interdisciplinary reviews. Nanomedicine ...,2025,Journal Article|Review,HD105593,R61,False,PreVAIL kIds,1R61HD105593-01,phs002585.v1.p1,RADx-rad,Rapid Acceleration of Diagnostics - Radical (R...,COVID-19,This work is directed at characterizing pediat...
3,40101747,PMC11952872,doi:10.1093/jimmun/vkaf006,A genetically modulated Toll-like receptor-tol...,Dysregulated innate immune responses contribut...,LYST|MIS-C|SARS-CoV-2|genetic variants|hyperin...,"Khan, Rehan|Ji, Weizhen|Guzman Rivera, Jeisac|...","Journal of immunology (Baltimore, Md. : 1950)",2025,Journal Article,HD105593,R61,False,PreVAIL kIds,1R61HD105593-01,phs002585.v1.p1,RADx-rad,Rapid Acceleration of Diagnostics - Radical (R...,COVID-19,This work is directed at characterizing pediat...
4,39724424,PMC11995861,doi:10.1007/s00216-024-05720-z,Bioreactor contamination monitoring using off-...,Metabolically active cells emit volatile organ...,Animals|Aspergillus fumigatus|Biomanufacturing...,"Linderholm, Angela L|Bhandari, Manohar P|Borra...",Analytical and bioanalytical chemistry,2025,Journal Article,TR003795,U18,False,SCENT,1U18TR003795-01,phs002600.v1.p1,RADx-rad,Rapid Acceleration of Diagnostics - Radical (R...,COVID-19,The data herein combines GC-MS and GC-DMS anal...


In [9]:
publications[publications["pmc_id"] == "PMC10784670"]

Unnamed: 0,pm_id,pmc_id,doi,title,abstract,keywords,authors,journal,year,article_type,project_serial_num,award_type,supplement,sub_project,project_num,dbgap_accession,research_initiative,dbgap_title,focus,dbgap_abstract
177,38222877,PMC10784670,doi:10.1016/j.conctc.2023.101246,Moana: Alternate surveillance for COVID-19 in ...,"Objective: Create a longitudinal, multi-modal ...",,Morgan ER|Dillard D|Lofgren E|Maddison BK|Rikl...,Contemp Clin Trials Commun.,2023,Journal Article,MD016526,R01,False,Multimodal Surveillance,1R01MD016526-01,phs002551.v1.p1,RADx-rad,Rapid Acceleration of Diagnostics - Radical (R...,COVID-19,Marshallese Pacific Islanders bear a dispropor...


### Map the publications to the corresponding FOAs

In [10]:
publications = publications.merge(foas, on="sub_project")
print(f"Number of publications: {publications.shape[0]}")
publications.head()

Number of publications: 801


Unnamed: 0,pm_id,pmc_id,doi,title,abstract,keywords,authors,journal,year,article_type,...,project_num,dbgap_accession,research_initiative,dbgap_title,focus,dbgap_abstract,id,name,url,summary
0,39773555,PMC11827850,doi:10.1172/JCI188222,CXCL12 ameliorates neutrophilia and disease se...,"Neutrophils, particularly low-density neutroph...",Angiotensin-Converting Enzyme 2|Animals|COVID-...,"Zheng, Jian|Dhakal, Hima|Qing, Enya|Shrestha, ...",The Journal of clinical investigation,2025,Journal Article,...,1U18TR003787-01,phs002563.v1.p1,RADx-rad,Rapid Acceleration of Diagnostics - Radical (R...,COVID-19,The COVID-19 pandemic has caused unprecedented...,RFA-OD-20-017,Emergency Awards RADx-RAD: Screening for COVID...,https://grants.nih.gov/grants/guide/rfa-files/...,EXPIRED/nNational Institutes of Health (NIH)/n...
1,39880798,PMC11800485,doi:10.1093/femsec/fiaf010,Distinct bacteria display genus and species-sp...,Lichens are complex symbiotic systems where fu...,Bacteria|Colombia|Fungi|Lichens|Microbiota|Phy...,"Chaib De Mares, Maryam|Arciniegas Castro, Emer...",FEMS microbiology ecology,2025,Journal Article,...,1U01DA053941-01,phs002525.v1.p1,RADx-rad,Rapid Acceleration of Diagnostics - Radical (R...,COVID-19,"The University of Miami (UM), with three prima...",RFA-OD-20-015,Emergency Awards: RADx-rad Wastewater Detectio...,https://grants.nih.gov/grants/guide/rfa-files/...,EXPIRED/nNational Institutes of Health (NIH)/n...
2,39948059,PMC11874078,doi:10.1002/wnan.70004,Nanoparticle Contrast Agents for Photon-Counti...,The clinical availability of photon-counting c...,Animals|Contrast Media|Humans|Nanoparticles|Ph...,"Devkota, Laxman|Bhavane, Rohan|Badea, Cristian...",Wiley interdisciplinary reviews. Nanomedicine ...,2025,Journal Article|Review,...,1R61HD105593-01,phs002585.v1.p1,RADx-rad,Rapid Acceleration of Diagnostics - Radical (R...,COVID-19,This work is directed at characterizing pediat...,RFA-OD-20-023,Emergency Awards: RADx-rad Predicting Viral-As...,https://grants.nih.gov/grants/guide/rfa-files/...,EXPIRED/nNational Institutes of Health (NIH)/n...
3,40101747,PMC11952872,doi:10.1093/jimmun/vkaf006,A genetically modulated Toll-like receptor-tol...,Dysregulated innate immune responses contribut...,LYST|MIS-C|SARS-CoV-2|genetic variants|hyperin...,"Khan, Rehan|Ji, Weizhen|Guzman Rivera, Jeisac|...","Journal of immunology (Baltimore, Md. : 1950)",2025,Journal Article,...,1R61HD105593-01,phs002585.v1.p1,RADx-rad,Rapid Acceleration of Diagnostics - Radical (R...,COVID-19,This work is directed at characterizing pediat...,RFA-OD-20-023,Emergency Awards: RADx-rad Predicting Viral-As...,https://grants.nih.gov/grants/guide/rfa-files/...,EXPIRED/nNational Institutes of Health (NIH)/n...
4,39724424,PMC11995861,doi:10.1007/s00216-024-05720-z,Bioreactor contamination monitoring using off-...,Metabolically active cells emit volatile organ...,Animals|Aspergillus fumigatus|Biomanufacturing...,"Linderholm, Angela L|Bhandari, Manohar P|Borra...",Analytical and bioanalytical chemistry,2025,Journal Article,...,1U18TR003795-01,phs002600.v1.p1,RADx-rad,Rapid Acceleration of Diagnostics - Radical (R...,COVID-19,The data herein combines GC-MS and GC-DMS anal...,RFA-OD-20-017,Emergency Awards RADx-RAD: Screening for COVID...,https://grants.nih.gov/grants/guide/rfa-files/...,EXPIRED/nNational Institutes of Health (NIH)/n...


In [11]:
publications[publications["pmc_id"] == "PMC10784670"]

Unnamed: 0,pm_id,pmc_id,doi,title,abstract,keywords,authors,journal,year,article_type,...,project_num,dbgap_accession,research_initiative,dbgap_title,focus,dbgap_abstract,id,name,url,summary
177,38222877,PMC10784670,doi:10.1016/j.conctc.2023.101246,Moana: Alternate surveillance for COVID-19 in ...,"Objective: Create a longitudinal, multi-modal ...",,Morgan ER|Dillard D|Lofgren E|Maddison BK|Rikl...,Contemp Clin Trials Commun.,2023,Journal Article,...,1R01MD016526-01,phs002551.v1.p1,RADx-rad,Rapid Acceleration of Diagnostics - Radical (R...,COVID-19,Marshallese Pacific Islanders bear a dispropor...,RFA-OD-20-016,Emergency Awards: RADx-RAD Multimodal COVID-19...,https://grants.nih.gov/grants/guide/rfa-files/...,EXPIRED/nNational Institutes of Health (NIH)/n...


## Run the LLM to evaluate Publications against FOA and Grant Objectives

In [12]:
prompt = '''
You are a researcher analyzing scientific publications. You will be provided with a FUNDING OPPORTUNITY description, a GRANT ABSTRACT, and a PUBLICATION ABSTRACT. 
Your task is to determine whether the publication is related to the research objectives of the funding opportunity and/or the objectives stated in the grant abstract. 
If the publication abstract is missing, use the title but only if it clearly matches the objectives.
Please provide your answer as a JSON object following these guidelines:

Rank your response based on the following Likert scale:
1: Strongly disagree
2: Disagree
3: Neither agree nor disagree
4: Agree
5: Strongly agree

- result: integer 1 to 5 on the Likert scale to indicate if the publication is related to the funding opportunity and/or the grant description.
- explanation: A clear, concise, and specific explanation of how you arrived at your conclusion. The explanation should describe why the publication is or is not relevant to the research objectives in the funding opportunity and grant's abstract.

Example JSON response:

```json
{
  "result": 5,
  "explanation": "The publication directly addresses the specific research aims and methodology outlined in the funding opportunity and grant's abstract."
}
'''

The response format directs the LLM to return the results in the proper JSON format.

In [13]:
response_format = {
    "type": "json_schema",
    "json_schema": {
        "name": "Likert-5",
        "schema": {
            "type": "object",
            "properties": {
                "result": {"type": "integer"},
                "explanation": {"type": "string"},
            },
            "required": ["result", "explanation"],
            "additionalProperties": False
        },
        "strict": True
    }
}

In [14]:
def eval_publication(row, model): 
    pm_id = row["pm_id"]
    dbgap_accession = row["dbgap_accession"]
    doi = row["doi"].replace("doi:", "").replace("/", "_")

    identifier = f"{pm_id}_{doi}_{dbgap_accession}"

    # Remove model prefixes such as meta-llama from the model string
    simple_model = model.split('/', 1)[-1]
    result_path = os.path.join(RESULTS, simple_model, ANNOTATION)
    os.makedirs(result_path, exist_ok=True)

    # Check if the result file already exists, then skip this record
    result_file = os.path.join(result_path, f"{identifier}.csv")
    if os.path.isfile(result_file):
        if (len(row["title"]) + len(row['abstract'])) < LENGTH_THRESHOLD:
            print(f"Remove {identifier}. Length of title + abstract < {LENGTH_THRESHOLD} characters")
            os.remove(result_file)
        return

    print(f"processing: {identifier}")

    if (len(row["title"]) + len(row['abstract'])) < LENGTH_THRESHOLD:
        print(f"Skipping {pm_id}. Length of title + abstract < {LENGTH_THRESHOLD} characters")
        return

    # If there is no abstract, use the title as the abstract to avoid LLM failures.
    publication_abstract = row['abstract']
    if publication_abstract == "":
        publication_abstract = row['title']

    # Create the context for the LLM
    context = f"FUNDING OPPORTUNITY:\n{row['summary']}\n\nGRANT ABSTRACT:\n{row['dbgap_title']}\n{row['dbgap_abstract']}\n\nPUBLICATION ABSTRACT:\n{row['title']}\n{publication_abstract}"
    start = time.time()
    #print(context)
    
    # Run the LLM
    response, usage = llm_utils.run_prompt(prompt, context, model, environment="development", response_format=response_format)
    end = time.time()

    # Parse the response
    data = json.loads(response)
    result = data.get("result", "1")
    explanation = data.get("explanation", "")

    # Get token usage and calculate cost  
    completion_tokens = usage["completion_tokens"]
    prompt_tokens = usage["prompt_tokens"]

    cost = llm_utils.get_token_cost(prompt_tokens, model, "input")
    cost += llm_utils.get_token_cost(completion_tokens, model, "output")

    # Collect result data
    elapsed_time = f"{end-start:.1f}"
    result_data = {
        'result': [result],
        'explanation': [explanation],
        'model': [simple_model],
        'prompt_tokens': [prompt_tokens],
        'completion_tokens': [completion_tokens],
        'elapsed_time': [elapsed_time],
        'cost': [cost]
    }

    # Append the result data to each row 
    pub_data = row.to_dict()
    merged_data = pub_data | result_data

    # Create and save the DataFrame
    df = pd.DataFrame(merged_data)
    df.to_csv(result_file, index=False)
    return

In [15]:
publications.apply(eval_publication, model=model, axis=1)

processing: 39793745_10.1016_j.actbio.2025.01.006_phs002583.v1.p1
processing: 38598791_10.1513_AnnalsATS.202310-896PS_phs002689.v1.p1
Skipping 38598791. Length of title + abstract < 100 characters
processing: 39078251_10.1513_AnnalsATS.202403-322PS_phs002689.v1.p1
Skipping 39078251. Length of title + abstract < 100 characters
processing: 37738417_10.1093_infdis_jiad405_phs002603.v1.p1
Skipping 37738417. Length of title + abstract < 100 characters
processing: 37738417_10.1093_infdis_jiad405_phs002603.v1.p1
Skipping 37738417. Length of title + abstract < 100 characters
processing: 37699143_10.1164_rccm.202308-1493ED_phs002689.v1.p1
Skipping 37699143. Length of title + abstract < 100 characters
processing: 36795031_10.2215_CJN.0000000000000089_phs002657.v1.p1
Skipping 36795031. Length of title + abstract < 100 characters
processing: 37159952_10.1513_AnnalsATS.202301-090RL_phs002689.v1.p1
Skipping 37159952. Length of title + abstract < 100 characters
processing: 36356276_10.2105_AJPH.2022.

0      None
1      None
2      None
3      None
4      None
       ... 
796    None
797    None
798    None
799    None
800    None
Length: 801, dtype: object