We have a dataset (DCC) with a shortage of certain labels. We want to generate 
new samples synthetically using GPT-4. We will use the following approach:
1. We take the existing samples for each document type and present these to GPT-4
2. we ask to generate new sentences like it, where the token labels are provided in the BIO format

We care specifically about the following labels:
* Experiencer: Other
* Historical: Hypothetical

The task of the GPT model is to generate new sentences that are similar to the input sentences but with variations of the medical concepts. 

For instance: 



In [11]:
import os, sys, re
import json, dotenv

import openai
import asyncio
from openai import AsyncOpenAI, OpenAI


dotenv.load_dotenv()

True

In [2]:
openai.api_key = os.getenv("OPENAI_KEY")

In [3]:
DCC = json.load(open('../data/emc-dcc_ann.json'))

In [17]:
docs = DCC['projects'][0]['documents']

In [30]:
relevant_docs = []
for i, doc in enumerate(docs):
    for concept in doc['annotations']:
        if (concept['meta_anns']['Temporality']['value']=='hypothetical') | \
                    (concept['meta_anns']['Experiencer']['value']=='other'):
            relevant_docs.append(doc)
            break

In [33]:
relevant_docs[0]

{'id': 106,
 'name': 'DL1139',
 'text': "De flexiecontractuur lijkt daarom meer een gevolg te zijn van spasme van de flexoren.\nPatient geeft eveneens aan psychische klachten te hebben en worstelt op het moment met vroegere trauma's met zijn ouders, waarvoor hij steun krijgt in het Boumanshuis.\nGezien de duidelijke spasmen van de flexoren, lijkt het mij geindiceerd om ten eerst conservatieve behandeling te starten m.b.v. fysiotherapie, spalken en analgesie.\n",
 'annotations': [{'id': 107,
   'user': 'emc_dcc',
   'cui': 15,
   'value': 'spasme',
   'start': 62,
   'end': 68,
   'validated': True,
   'correct': True,
   'deleted': False,
   'alternative': False,
   'killed': False,
   'meta_anns': {'Negation': {'name': 'Negation',
     'validated': True,
     'acc': 1.0,
     'value': 'not negated'},
    'Temporality': {'name': 'Temporality',
     'validated': True,
     'acc': 1.0,
     'value': 'recent'},
    'Experiencer': {'name': 'Experiencer',
     'validated': True,
     'acc':

In [16]:
OAI_ASYNC_CLIENT = AsyncOpenAI(api_key=os.getenv("OPENAI_KEY"), max_retries=2)
OAI_CLIENT = OpenAI(api_key=os.getenv("OPENAI_KEY"), max_retries=2)

In [15]:
stream = OAI_CLIENT.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Say this is a test"}],
    stream=True,
)
for part in stream:
    print(part.choices[0].delta.content or "")


This
 is
 a
 test
.

