# 02.1 - Information Extraction using OpenAI

In this notebook, we explore ways that OpenAI LLMs can be used for extracting information relevant to infections disease modeling, such as categorical keywords (e.g. diseases, treatments, populations, etc.), from publication titles/abstracts. This information will be used later for publication search, clustering, etc.

The OpenAI Platrform requires an API key for accessing the web service (https://platform.openai.com/docs/quickstart). To avoid inadvertently sharing a personal API key, the key should be added to the Jupyter notebook kernel that's used by this notebook. Instructions for adding environment variables to a notebook kernel can be found at https://stackoverflow.com/a/53595397/763176.

In [None]:
%pip install --upgrade openai

Load the publications from the database, skipping any publications without abstracts.

In [None]:
from tinydb import TinyDB, Query

db = TinyDB('db.json')
table = db.table('articles')

articles = table.all()
print(f'loaded {len(articles)} articles')

articles = [x for x in articles if x['abstract'] != 'No abstract available.']
print(f'retaining {len(articles)} articles')

In [None]:
KEYWORD_PROMPT_TEMPLATE = """
Your goal is to identify important keywords in scientific paper abstracts.
For the abstract below, identify all diseases, treatments, interventions, and vectors mentioned.
List the keywords identified in a JSON array, with each item in the array including keyword_type and value.
The only valid keyword types are disease, treatment, intervention, and vector.
Only return the JSON array.

abstract:
{abstract}
"""

In [None]:
from openai import OpenAI

client = OpenAI()

article = articles[0]
content = prompt + article['abstract']

completion = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": KEYWORD_PROMPT_TEMPLATE.format(
                abstract=article['abstract']
            )
        }
    ]
)

print(completion.choices[0].message.content)

Using Structured Ouptuts: https://platform.openai.com/docs/guides/structured-outputs

In [None]:
from pydantic import BaseModel

class Keyword(BaseModel):
    type: str
    value: str

class KeywordResults(BaseModel):
    keywords: list[Keyword]

completion = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[
        # {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": KEYWORD_PROMPT_TEMPLATE.format(
                abstract=article['abstract']
            )
        }
    ],
    response_format=KeywordResults
)

print(completion.choices[0].message.parsed)