# 02.1 - Information Extraction using OpenAI

In this notebook, we explore ways that OpenAI LLMs can be used for extracting information relevant to infections disease modeling, such as categorical keywords (e.g. diseases, treatments, populations, etc.), from publication titles/abstracts. This information will be used later for publication search, clustering, etc.

In [None]:
%pip install --upgrade --quiet openai

In [None]:
%pip install --upgrade --quiet python-dotenv

In [None]:
import dotenv
from genscai import paths

dotenv.load_dotenv(paths.root / "../.env")

Load the publications from the database, skipping any publications without abstracts.

In [None]:
import json
from genscai import paths

with open(paths.data / "training_modeling_papers.json", "r") as f:
    papers = json.load(f)

len(papers)

In [None]:
KEYWORD_PROMPT_TEMPLATE = """
Your goal is to identify important keywords in scientific paper abstracts.
For the abstract below, identify all diseases, treatments, interventions, and vectors mentioned.
List the keywords identified in a JSON array, with each item in the array including keyword_type and value.
The only valid keyword types are disease, treatment, intervention, and vector.
Only return the JSON array.

abstract:
{abstract}
"""

In [None]:
from openai import OpenAI

client = OpenAI()

article = papers[0]

completion = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": KEYWORD_PROMPT_TEMPLATE.format(abstract=article["abstract"]),
        },
    ],
)

print(completion.choices[0].message.content)

Using Structured Ouptuts: https://platform.openai.com/docs/guides/structured-outputs

In [None]:
from pydantic import BaseModel


class Keyword(BaseModel):
    type: str
    value: str


class KeywordResults(BaseModel):
    keywords: list[Keyword]


completion = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[
        # {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": KEYWORD_PROMPT_TEMPLATE.format(abstract=article["abstract"]),
        }
    ],
    response_format=KeywordResults,
)

print(completion.choices[0].message.parsed)

In [None]:
MODEL_CLASSIFICATION_PROMPT_TEMPLATE = """
Given the following scientific publication abstract,
identify if the publication references an infectious disease modeling technique.
Only return YES or NO.
If YES, also return the name of the tecnhique or techniques used.

abstract:
{abstract}
"""

In [None]:
for paper in papers[5:10]:
    completion = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": MODEL_CLASSIFICATION_PROMPT_TEMPLATE.format(abstract=paper["abstract"]),
            }
        ],
    )

    print(paper["abstract"])
    print(completion.choices[0].message.content)