# 02.1 - Information Extraction using OpenAI

In this notebook, we explore ways that OpenAI LLMs can be used for extracting information relevant to infections disease modeling, such as categorical keywords (e.g. diseases, treatments, populations, etc.), from publication titles/abstracts. This information will be used later for publication search, clustering, etc.

The OpenAI Platrform requires an API key for accessing the web service (https://platform.openai.com/docs/quickstart). To avoid inadvertently sharing a personal API key, the key should be added to the Jupyter notebook kernel that's used by this notebook. Instructions for adding environment variables to a notebook kernel can be found at https://stackoverflow.com/a/53595397/763176.

In [None]:
%pip install --upgrade --quiet openai

Load the publications from the database, skipping any publications without abstracts.

In [1]:
from tinydb import TinyDB

db = TinyDB("db.json")
table = db.table("articles")

articles = table.all()
print(f"loaded {len(articles)} articles")

articles = [x for x in articles if x["abstract"] != "No abstract available."]
print(f"retaining {len(articles)} articles")

loaded 1398 articles
retaining 1320 articles


In [2]:
KEYWORD_PROMPT_TEMPLATE = """
Your goal is to identify important keywords in scientific paper abstracts.
For the abstract below, identify all diseases, treatments, interventions, and vectors mentioned.
List the keywords identified in a JSON array, with each item in the array including keyword_type and value.
The only valid keyword types are disease, treatment, intervention, and vector.
Only return the JSON array.

abstract:
{abstract}
"""

In [4]:
from openai import OpenAI

client = OpenAI()

article = articles[0]

completion = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": KEYWORD_PROMPT_TEMPLATE.format(abstract=article["abstract"]),
        },
    ],
)

print(completion.choices[0].message.content)

```json
[
    {
        "keyword_type": "disease",
        "value": "respiratory conditions"
    },
    {
        "keyword_type": "treatment",
        "value": "prescription medication for respiratory conditions"
    }
]
```


Using Structured Ouptuts: https://platform.openai.com/docs/guides/structured-outputs

In [None]:
from pydantic import BaseModel


class Keyword(BaseModel):
    type: str
    value: str


class KeywordResults(BaseModel):
    keywords: list[Keyword]


completion = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[
        # {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": KEYWORD_PROMPT_TEMPLATE.format(abstract=article["abstract"]),
        }
    ],
    response_format=KeywordResults,
)

print(completion.choices[0].message.parsed)

In [14]:
MODEL_CLASSIFICATION_PROMPT_TEMPLATE = """
Given the following scientific publication abstract,
identify if the publication references an infectious disease modeling technique.
Only return YES or NO.
If YES, also return the name of the tecnhique or techniques used.

abstract:
{abstract}
"""

In [15]:
for i in range(10, 15):
    completion = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": MODEL_CLASSIFICATION_PROMPT_TEMPLATE.format(abstract=articles[i]["abstract"]),
            }
        ],
    )

    print(articles[i]["abstract"])
    print(completion.choices[0].message.content)

Diet profoundly influences the composition of an animal's microbiome, especially in holometabolous insects, offering a valuable model to explore the impact of diet on gut microbiome dynamics throughout metamorphosis. Here, we use monarch butterflies (Danaus plexippus), specialist herbivores that feed as larvae on many species of chemically well-defined milkweed plants (Asclepias sp.), to investigate the impacts of development and diet on the composition of the gut microbial community. While a few microbial taxa are conserved across life stages of monarchs, the microbiome appears to be highly dynamic throughout the life cycle. Microbial diversity gradually diminishes throughout the larval instars, ultimately reaching its lowest point during the pupal stage and then recovering again in the adult stage. The microbial composition then undergoes a substantial shift upon the transition from pupa to adult, with female adults having significantly different microbial communities than the eggs t