# Summarize an article from PubMedCentral (PMC)
This notebook retrieves a pdf file of an article from PMC and extracts key information for the RADx-rad project.

It can use either the public OpenAI API or the interal SDSC LLM API.

In [1]:
# Enter the PMC Id of the article (note, not all articles have an associated pdf file)

# pmcid = "PMC10190252"
# pmcid = "PMC10463275"
pmcid = "PMC8402658"
# pmcid = "PMC840265" # this example is a test for an invalid PMC ID
# pmcid = "PMC8854333"
# pmcid = "PMC9386735"
# pmcid = "PMC9725778"

# The following articles cause a timeout error:
#    InternalServerError: <html><body><h1>504 Gateway Time-out</h1>
#    The server didn't respond in time.
# pmcid = "PMC10121104"
# pmcid = "PMC7954844"

## Prompts to extract summarize the RADx-rad articles
Modify to your specific needs.

In [2]:
system_message = """
Your are a researcher working on COVID diagnostic and surveillance methods. You job is to summarize the content of an article. 
Ignore the "References" and "Acknowledgment" sections. Evaluate the article using the following aspects: 
Objective and Scope, Methodology, Biorecognition Elements, Key Findings, Limitations, Potential Applications, and Study Type. 
"""

assistant_message = """
Present the result in JSON format as in the example below without additional text output. Include the pmcid, which is the first word in the input. 
The sub-critera in the example below are just a guide. Add or remove sub-criteria for each aspect based on the content. 
For the Study Type, answer with yes or no. If none of the Study Types apply, use the "Other" sub-category and describe the study.

Example:
{
{"Article":<pmcid>,
"Title":<title>,
"Objective and Scope": {
  "Objective":<text>,
  "Scope":<text>
},
"Methodology": {
  "Design":<text>,
  "Testing":<text>,
  "Validation":<text>
  <other sub-criteria as needed>:<text>
},
"Key Findings": {
  "Sensitivity":<text>,
  "Specificity":<text>,
  "Limit of Detection (LOD)":<text>
  "Variant detection":<text>
  "Cross-reactivity with other viruses":<text>
   <other sub-criteria as needed>:<text>
},
"Limitations": {
  "Testing": <text>,
  "Environmental Factors":<text>
   <other sub-criteria as needed>:<text>
},
"Potential Applications": {
  "Point-of-Care Testing":<text>,
  "Non-Invasive Diagnostics":<text>"
   <other sub-criteria as needed>:<text>
}
"Study Type": {
  "Diagnostic Method Development": yes or no,
  "Wastewater Surveillance": yes or no,
  "COVID or SARS-CoV-2 related": yes or no
  "other":<specify other types of study as text>
}
}
"""

## Imports

In [3]:
import os
from dotenv import load_dotenv
from openai import OpenAI
import utils

In [4]:
# Load environment variables
# API keys and other configuration parameters are stored in the .env file.
# To create the .env file, copy the env_template file to .env and set your API keys.
ENV_PATH = "../.env"
load_dotenv(ENV_PATH, override=True)

True

## Set up API key and create client

In [5]:
# Choose a service
# service = "OPENAI" 
service = "SDSC_LLM"

In [6]:
if service == "OPENAI":
    # https://platform.openai.com/docs/models
    MODEL = os.environ.get("OPENAI_MODEL")
    API_KEY = os.environ.get("OPENAI_API_KEY")
    client = OpenAI(api_key=API_KEY)

if service == "SDSC_LLM":
    # https://ai.meta.com/blog/meta-llama-3/
    MODEL = os.environ.get("SDSC_LLM_MODEL")
    API_KEY = os.environ.get("SDSC_LLM_API_KEY")
    BASE_URL = os.environ.get("SDSC_LLM_BASE_URL")
    client = OpenAI(api_key=API_KEY, base_url=BASE_URL)

In [7]:
# Available models
models = utils.get_available_models(client)
print(models)

['meta-llama/Meta-Llama-3.1-70B-Instruct']


In [8]:
print(f"Using the default model: {MODEL}")

Using the default model: meta-llama/Meta-Llama-3.1-70B-Instruct


## Download article from PMC

In [9]:
text = utils.load_pdf_from_pmc(pmcid)

# Prepend PMC Id
text = pmcid + ":" + text

# Print input metrics
num_tokens = utils.get_token_count(text, MODEL)
print(f"Number of characters:   {len(text)}")
print(f"Number of tokens (GTP): {num_tokens}")
print(f"Cost for input tokens:  {utils.get_token_cost(num_tokens, MODEL)} $US")

Number of characters:   62651
Number of tokens (GTP): 15898
Cost for input tokens:  0.0 $US


## Create the summary in JSON format

In [10]:
%%time
response = utils.prompt_gpt(client, MODEL, text, system_message, assistant_message)
print(response)

{
"Article": "PMC8402658",
"Title": "Monitoring SARS-CoV-2 Populations in Wastewater by Amplicon Sequencing and Using the Novel Program SAM Reﬁner",
"Objective and Scope": {
  "Objective": "To develop a computational workflow for monitoring SARS-CoV-2 populations in wastewater using amplicon sequencing and a novel program called SAM Reﬁner",
  "Scope": "The study focuses on the development of a method for tracking SARS-CoV-2 variants in wastewater using amplicon sequencing and SAM Reﬁner, with a specific focus on the spike gene"
},
"Methodology": {
  "Design": "The study used a combination of amplicon sequencing and computational analysis to track SARS-CoV-2 variants in wastewater",
  "Testing": "The method was tested on wastewater samples from a Missouri sewershed",
  "Validation": "The results were validated by comparing the outputs of SAM Reﬁner with known variant lineages and polymorphisms"
},
"Biorecognition Elements": {
  "Target": "SARS-CoV-2 spike gene",
  "Primers": "Loci-spec