We have a dataset (DCC) with a shortage of certain labels. We want to generate 
new samples synthetically using GPT-4. We will use the following approach:
1. We take the existing samples for each document type and present these to GPT-4
2. we ask to generate new sentences like it, where the token labels are provided in the BIO format

We care specifically about the following labels:
* Experiencer: Other
* Historical: Hypothetical

The task of the GPT model is to generate new sentences that are similar to the input sentences but with variations of the medical concepts. 




# Definitions
The definitions are taken from the ConText/ConTextD papers:

## Negation

This property has two values, ‘Negated’ or ‘Not negated’. A clinical condition or term is labeled as ‘Negated’ if there is evidence in the text suggesting that the condition does not occur or exist, e.g., ‘There was no sign of sinus infection’, otherwise it is ‘Not negated’.

## Temporality

The temporality property places a condition along a time line. There are three possible values for this property: ‘Recent’, ‘Historical’, and ‘Hypothetical’. A condition is considered ‘Recent’ if it is maximally 2 weeks old. Conditions that developed more than 2 weeks ago are labeled as ‘Historical’. A condition is labeled as ‘Hypothetical’ if it is not ‘Recent’ or ‘Historical’, e.g., ‘patient should return if she develops fever’ [13].

**Adaptation**: *'Hypothetical' is specifically about (theoretical) concepts, concepts that are not (yet) realized, i.e. concepts that may materialize in the future. 'Historical' and 'Recent' can be used for realized concepts, in which we also include their negations. I.e. if a concept is explicitly denied historically or recently, we can label it as 'Historical' or 'Recent' respectively.*

## Experiencer

Clinical text may refer to subjects other than the actual patient. The experiencer property describes whether the patient experienced the condition or someone else. For simplicity, we have defined only two possible values for this property: ‘Patient’ or ‘Other’, where ‘Other’ refers to anyone but the actual patient, e.g., ‘Mother is recently diagnosed with cancer’.

In [None]:
%load_ext autoreload
%autoreload 2

import os, sys, re
import json, dotenv
import pprint

import openai
import asyncio
from openai import AsyncOpenAI, OpenAI
from tqdm import tqdm
from collections import defaultdict

import pandas as pd

import datetime

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
    
from utils import preprocess_dcc_for_robbert

dotenv.load_dotenv()

In [None]:
run_temporality = False
run_experiencer = False
#run_historical = True
experiencer_file = '../data/synth_experiencer_gpt_4_1106_preview_20231214.parquet'
hypothetical_file = '../data/synth_temporality_gpt_4_1106_preview_20231218.parquet'

In [None]:
openai.api_key = os.getenv("OPENAI_KEY")

In [None]:
DCC = json.load(open('../data/emc-dcc_ann.json'))

In [None]:
update_dcc = DCC.copy()
for c in update_dcc['projects'][0]['documents']:
    c['source'] = 'EMC_DCC_ORIGINAL'
with open('../data/emc-dcc_ann_ORIGNAL.json', 'w') as f:
    json.dump(update_dcc, f, indent=2)

In [None]:
docs = update_dcc['projects'][0]['documents']

In [None]:
class_counts = {'Negation': defaultdict(int),
                'Temporality': defaultdict(int),
                'Experiencer': defaultdict(int)}

for doc in docs:
    for ann in doc['annotations']:
        for _class, val in ann['meta_anns'].items():
            class_counts[_class][val['value']] += 1

In [None]:
pprint.pprint(class_counts, indent=2)

In [None]:
#len(relevant_docs_hypothetical), len(relevant_docs_experiencer)
#relevant_docs_hypothetical[0]['text'][0:110]
#[(d['start'],d['end'], d['id']) for d in relevant_docs_hypothetical[0]['annotations'] 
#        if d['meta_anns']['Temporality']['value']=='hypothetical']
#relevant_docs_hypothetical[0]['name']

In [None]:
corrections = [
    {'doc_id': 'DL1616', 'annotation_id': 1873, 'meta': 'Experiencer', 'value': 'patient'},
    {'doc_id': 'DL1139', 'annotation_id': 108, 'meta': 'Experiencer', 'value': 'patient'},
    {'doc_id': 'GP2799', 'annotation_id': 8210, 'meta': 'Experiencer', 'value': 'patient'},
    {'doc_id': 'SP1476', 'annotation_id': 15532, 'meta': 'Experiencer', 'value': 'patient'},
    {'doc_id': 'DL1567', 'annotation_id': 1694, 'meta': 'Temporality', 'value': 'historical'},
    {'doc_id': 'DL1711', 'annotation_id': 2232, 'meta': 'Temporality', 'value': 'recent'},
    {'doc_id': 'DL1812', 'annotation_id': 2232, 'meta': 'Temporality', 'value': 'historical'},
    {'doc_id': 'DL1814', 'annotation_id': 2703, 'meta': 'Temporality', 'value': 'historical'},
    {'doc_id': 'GP1395', 'annotation_id': 4538, 'meta': 'Temporality', 'value': 'recent'},
    {'doc_id': 'DL2111', 'annotation_id': 3779, 'meta': 'Temporality', 'value': 'historical'},
    {'doc_id': 'DL2100', 'annotation_id': 3716, 'meta': 'Temporality', 'value': 'historical'},  
    {'doc_id': 'DL2100', 'annotation_id': 3716, 'meta': 'Temporality', 'value': 'historical'},  
    {'doc_id': 'DL2072', 'annotation_id': 3627, 'meta': 'Temporality', 'value': 'historical'},    
    {'doc_id': 'DL2067', 'annotation_id': 3606, 'meta': 'Temporality', 'value': 'historical'},
    {'doc_id': 'DL1931', 'annotation_id': 3113, 'meta': 'Temporality', 'value': 'historical'},
    {'doc_id': 'DL1812', 'annotation_id': 2689, 'meta': 'Temporality', 'value': 'historical'},
    {'doc_id': 'DL1507', 'annotation_id': 1467, 'meta': 'Temporality', 'value': 'historical'},
    {'doc_id': 'DL1167', 'annotation_id': 212, 'meta': 'Temporality', 'value': 'historical'},
 ]

In [None]:
# update the docs
update_docs = docs.copy()
for c in corrections:
    for d in update_docs:        
        if d['name']==c['doc_id']:
            d['source'] = 'EMC_DCC_ORIGINAL_ADJUSTED'
            for a in d['annotations']:
                if a['id']==c['annotation_id']:
                    a['meta_anns'][c['meta']]['value'] = c['value']
# put updated docs in DCC
DCC['projects'][0]['documents'] = update_docs

In [None]:
# write DCC back to json 
with open('../data/emc-dcc_ann_ADJ.json', 'w') as f:
    json.dump(DCC, f, indent=2)

In [None]:
relevant_docs_hypothetical = []
for i, doc in enumerate(docs):
    for concept in doc['annotations']:
        if (concept['meta_anns']['Temporality']['value']=='hypothetical'):
            doc['index'] = i
            relevant_docs_hypothetical.append(doc)
            break
        
relevant_docs_experiencer = []
for i, doc in enumerate(docs):
    for concept in doc['annotations']:
        if (concept['meta_anns']['Experiencer']['value']=='other'):
            doc['index'] = i
            relevant_docs_experiencer.append(doc)
            break

In [None]:
OAI_ASYNC_CLIENT = AsyncOpenAI(api_key=os.getenv("OPENAI_KEY"), max_retries=2)
OAI_CLIENT = OpenAI(api_key=os.getenv("OPENAI_KEY"), max_retries=2)

In [229]:
SYSTEM_PROMPT_HYPOTHETICAL = """
    Je bent een kritische assistent die mij helpt om nieuwe tekst te bedenken.
    De tekst moeten voldoen aan de volgende eisen:
    - ze moeten semantisch correct zijn en vergelijkbaar zijn met de voorbeeltekst die ik je geef.
    - de voorbeeltekst wordt voorafgegaan door de term VOORBEELDTEKST
    - in de voorbeeldzin worden 1 of meer concepten benoemd die hypothethisch zijn, het is belangrijk
    dat deze concepten in de nieuwe zin ook hypothetisch zijn, het mogen ook andere concepten zijn. 
    Een voorbeeld van een hypothetische concept = 'een voorafgaand trauma kan niet worden herinnerd', waarin 'trauma' het concept is.
    Een ander voorbeeld = 'ter uitsluiting van epifysaire dysplasie' waarin 'epifysaire dysplasie' het concept is.
    - de concepten die je moet vervangen zijn aangegeven met verticale streepjes, dus |concept|.
    - het domein is medisch dus gebruik medische concepten.
    - probeer de medische concepten te varieren, dus gebruik niet steeds dezelfde concepten.
    - geef als antwoord ALLEEN de nieuw gegenereerde zinnen, voorafgaand met de term NIEUWE_TEKST
    - in de NIEUWE_TEKST, plaats de concepten die hypothetisch zijn tussen verticale streepjes, dus '|', 
    dus bijvoorbeeld: 'ter uitsluiting van |epifysaire dysplasie|'
    
    In case you have doubts, I explain it in English:
    'Hypothetical' is specifically about (theoretical) concepts, which means concepts that are not (yet) realized OR    
    concepts that may have occurred in the past. 'Historical' and 'Recent' can be used for realized concepts, in which we also include their negations. 
    I.e. if a concept is explicitly denied historically or recently, we can label it as 'Historical' or 'Recent' respectively.
"""

SYSTEM_PROMPT_HYPOTHETICAL_CHECK = """
    Je bent een kritische assistent die mij helpt om nieuwe text te beoordelen.
    
    Je krijgt een tekst. Deze tekst bevatten 1 of meerdere concepten die zijn omsloten met verticale streepjes, dus |concept|.
    
    Het is jouw taak om te beoordelen of de concepten in de tekst verwijzen naar een hypothetische situatie.
    
    LET OP: het kan per concept verschillen of het verwijst naar een hypothetische situatie, 
    een situatie in het verleden, of een situatie in het heden.
        
    In case you have doubts, I explain it in English:
    'Hypothetical' is specifically about (theoretical) concepts, which means concepts that are not (yet) realized OR
    concepts that may have occurred in the past. 'Historical' and 'Recent' can be used for realized concepts, in which we also include their negations. 
    I.e. if a concept is explicitly denied historically or recently, we can label it as 'Historical' or 'Recent' respectively.
    
    De output die je geeft is beperkt tot 'ja' of 'nee' per concept, en wordt gegeven in de vorm van een dictionary:
    {0: 'ja', 1: 'nee', ...} 
    
    Hierin is 0, 1, ... de index van de concepten in de tekst.
    Wat betreft de index, begin altijd met 0, en tel op voor elk concept.
"""

SYSTEM_PROMPT_EXPERIENCER = """
    Je bent een kritische assistent die mij helpt om nieuwe text te bedenken.
    Deze text moeten voldoen aan de volgende eisen:
    - het moet semantisch correct zijn en vergelijkbaar zijn met de text die ik je geef.
    - de voorbeeldtext wordt voorafgegaan door de term VOORBEELDTEKST
    - in de voorbeeldtext worden 1 of meer concepten benoemd die verwijzen naar een persoon anders dan de patient, het is belangrijk
    dat deze concepten in de nieuwe zin ook verwijzen naar iemand anders dan de patient (zoals een familielid), 
    het mogen ook andere medische concepten zijn.
    - de concepten die je moet vervangen zijn aangegeven met verticale streepjes, dus |concept|.
    - Een voorbeeld van een concept wat verwijst naar een ander persoon dan de patient =
    'Een zusje van #Name# is elders operatief behandeld in verband met recidiverende patella luxaties', waarin 'luxaties' het concept is, en er 
    wordt verwezen naar de zus van de patient.    
    - het domein is medisch dus gebruik medische concepten.
    - probeer de medische concepten te varieren, dus gebruik niet steeds dezelfde concepten.
    - varieer de ziektebeelden
    - varieer de opmaak van de text, dus gebruik niet steeds dezelfde opmaak.
    - geef als antwoord ALLEEN de nieuw gegenereerde text, voorafgaand met de term NIEUWE_TEKST
    - in de NIEUWE_TEKST, plaats alleen de concepten die verwijzen naar een ander persoon dan de patient tussen tussen verticale streepjes |, 
    dus bijvoorbeeld: 'Een zusje van #Name# is elders operatief behandeld in verband met recidiverende patella |luxaties|'
"""

SYSTEM_PROMPT_EXPERIENCER_CHECK = """
    Je bent een kritische assistent die mij helpt om nieuwe text te beoordelen.
    
    Je krijgt een tekst. Deze tekst bevatten 1 of meerdere concepten die zijn omsloten met verticale streepjes, dus |concept|.
    
    Het is jouw taak om te beoordelen of de concepten in de tekst verwijzen naar een persoon anders dan de patient (zoals een familielid, of een behandelend arts).
    LET OP: het gaat in de tekst om de verwijzing naar een persoon ANDERS dan de patient.
    LET OP: de tekst als geheel heeft betrekking op de patient.
    LET OP: het kan per concept verschillen of het verwijst naar een persoon anders dan de patient.
        
    De output die je geeft is beperkt tot 'ja' of 'nee' per concept, en wordt gegeven in de vorm van een dictionary:
    {0: 'ja', 1: 'nee', ...} 
    
    Hierin is 0, 1, ... de index van de concepten in de tekst.
    Wat betreft de index, begin altijd met 0, en tel op voor elk concept.

"""

In [None]:
def get_chat_res(USER_TEXT='Good day', 
                 SYSTEM_PROMPT=SYSTEM_PROMPT_HYPOTHETICAL, 
                 n = 10,
                 MODEL="gpt-4"):
    return OAI_CLIENT.chat.completions.create(
            model=MODEL,
            n = n,
            temperature=0.,
            messages=[
                        {"role": "system",
                        "content": SYSTEM_PROMPT
                        },
                        {"role": "user", 
                        "content": USER_TEXT
                        }],
            stream=False,
        )

In [223]:
GPT_VERSION = 'gpt-4-1106-preview'
CURRENT_DATE = datetime.datetime.now().strftime("%Y%m%d")

In [None]:
#re_extract = re.compile(r'NIEUWE_ZIN\:(.*)')
if run_temporality:
    nieuwe_zinnen_hypothetisch = []
    for i, doc in tqdm(enumerate(relevant_docs_hypothetical)):
        EXAMPLE = doc['text'].replace('|', ' ')
        # add | vertical bars around the concept that needs to be replaced
        LOCS = [(d['start'],d['end']) for d in doc['annotations'] 
                    if d['meta_anns']['Temporality']['value']=='hypothetical']
        for loc in LOCS:
            EXAMPLE = EXAMPLE[:loc[0]] + '|' + EXAMPLE[loc[0]:loc[1]] + '|' + EXAMPLE[loc[1]:]
        
        res = get_chat_res(SYSTEM_PROMPT=SYSTEM_PROMPT_HYPOTHETICAL, 
                        n=10,
                        MODEL=GPT_VERSION, # gpt-3.5-turbo-instruct-0914
                        USER_TEXT="VOORBEELDTEKST: " + EXAMPLE)

        for j, _res in enumerate(res.choices):
            txt = _res.message.content
            nieuwe_zinnen_hypothetisch.append((
                doc['name'],
                'hypothetical',
                j,
                txt[txt.find('NIEUWE_TEKST')+12:].strip()))    
    
    hypothetical_df = pd.DataFrame(nieuwe_zinnen_hypothetisch, columns=['doc_id', 'class_value', 'synth_num', 'text'])
    hypothetical_df.to_parquet(f'../data/synth_temporality_{GPT_VERSION.replace("-", "_")}_{CURRENT_DATE}.parquet')
else:
    hypothetical_df = pd.read_parquet(hypothetical_file)
    hypothetical_df['class_value'] = 'hypothetical'
    hypothetical_df = hypothetical_df.assign(text=hypothetical_df.text.str.lstrip(to_strip=':'))

In [None]:
#re_extract = re.compile(r'NIEUWE_ZIN\:(.*)')
if run_experiencer:
    nieuwe_zinnen_experiencer = []
    for i, doc in tqdm(enumerate(relevant_docs_experiencer)):
        EXAMPLE = doc['text'].replace('|', ' ')
        # add | vertical bars around the concept that needs to be replaced
        LOCS = [(d['start'],d['end']) for d in doc['annotations'] 
                    if d['meta_anns']['Experiencer']['value']=='other']
        for loc in LOCS:
            EXAMPLE = EXAMPLE[:loc[0]] + '|' + EXAMPLE[loc[0]:loc[1]] + '|' + EXAMPLE[loc[1]:]
        
        res = get_chat_res(SYSTEM_PROMPT=SYSTEM_PROMPT_EXPERIENCER, 
                        n=10,
                        MODEL=GPT_VERSION, # gpt-3.5-turbo-instruct-0914
                        USER_TEXT="VOORBEELDTEKST: " + EXAMPLE)

        for j, _res in enumerate(res.choices):
            txt = _res.message.content
            nieuwe_zinnen_experiencer.append((
                doc['name'],
                'other',
                j,
                txt[txt.find('NIEUWE_TEKST')+12:].strip()))
    experiencer_df = pd.DataFrame(nieuwe_zinnen_experiencer, columns=['doc_id', 'class_value', 'synth_num', 'text'])
    experiencer_df.to_parquet(f'../data/synth_experiencer_{GPT_VERSION.replace("-", "_")}_{CURRENT_DATE}.parquet')
else:
    experiencer_df = pd.read_parquet(experiencer_file)
    experiencer_df['class_value'] = 'other'


In [None]:
# run the checks
def check_by_gpt(TXT: str='', 
                 N_checks: int=7,
                 SYSTEM_PROMPT: str=SYSTEM_PROMPT_EXPERIENCER_CHECK
                 )->str:
    N_checks = 7
    maj_vote = N_checks//2

    RES = get_chat_res(SYSTEM_PROMPT=SYSTEM_PROMPT,
                        USER_TEXT="VOORBEELDTEKST: " + TXT,
                        n=N_checks,
                        MODEL=GPT_VERSION)

    RES_sel = []
    no_count = defaultdict(int)
    for j, _res in enumerate(RES.choices):
        txt = _res.message.content
        _d = eval(txt)
        RES_sel.append((j, _d))
        
        list_lens = []
        for k, v in _d.items():
            if v=='nee':
                no_count[k] += 1
        
        list_lens.append(len(_d.keys())) 

    if len(set(list_lens))>1:
        return 'ERROR-checker concept count mismatch'
    else:
        if len(no_count.values())>0:    
            # approach: if any of the concept is deemed incorrect we flag the TXT for removal 
            num_exc = sum([_v>maj_vote for _v in no_count.values()])
            if num_exc>0:
                if num_exc == len(_d.keys()):
                    return False
                else:
                    spans = []                    
                    for r in re.finditer(r'(\|.*?\|)', TXT):
                        spans.append(r.span())  
                    if len(spans)<max(_d.keys()):
                        return 'ERROR-checker concept count mismatch'              
                    rem = []
                    for k, v in no_count.items():
                        if v>maj_vote:
                            print(spans, no_count)
                            rem.append(spans[k])
                    if len(rem)>0:
                        for rcount, r in enumerate(rem):
                            TXT = TXT[:r[0]-rcount*2]+\
                                TXT[r[0]+1-rcount*2:r[1]-1-rcount*2]+\
                                TXT[r[1]-rcount*2:]
    return TXT
    

In [None]:
from tqdm.notebook import tqdm_notebook
tqdm_notebook.pandas()

In [None]:
texts_checked = [None]*experiencer_df.shape[0]
texts = experiencer_df['text'].values
for k, text in tqdm(enumerate(texts), total=len(texts)):
    if texts_checked[k] is None:
        texts_checked[k]=check_by_gpt(text, 
                                    N_checks=7, 
                                    SYSTEM_PROMPT=SYSTEM_PROMPT_EXPERIENCER_CHECK)
experiencer_df['checked_text'] = texts_checked

c1 = experiencer_df['checked_text']!=False 
c2 = experiencer_df['checked_text']!='ERROR-checker concept count mismatch'

experiencer_df = experiencer_df[c1 & c2]

# check token length
experiencer_df['token_len'] = experiencer_df['checked_text'].astype(str)\
                            .progress_apply(lambda x: len(x.split()))

# check number of r'\|.*?\|' in the text
experiencer_df['n_concepts'] = experiencer_df['checked_text'].astype(str)\
                            .progress_apply(lambda x: len(re.findall(r'\|.*?\|', x)))
                            
c = experiencer_df['n_concepts']>0
experiencer_df = experiencer_df[c]

experiencer_df = experiencer_df.drop(['text'], axis=1)
experiencer_df = experiencer_df.rename(columns={'checked_text': 'text'})
experiencer_df.to_parquet(f'../data/synth_experiencer_{GPT_VERSION.replace("-", "_")}_{CURRENT_DATE}.parquet')

In [230]:
#texts_checked = [None]*hypothetical_df.shape[0]
texts = hypothetical_df['text'].values
for k, text in tqdm(enumerate(texts), total=len(texts)):
    if texts_checked[k] is None:
        texts_checked[k]=check_by_gpt(text, 
                                    N_checks=7, 
                                    SYSTEM_PROMPT=SYSTEM_PROMPT_HYPOTHETICAL_CHECK)
hypothetical_df['checked_text'] = texts_checked

c1 = hypothetical_df['checked_text']!=False 
c2 = hypothetical_df['checked_text']!='ERROR-checker concept count mismatch'

hypothetical_df = hypothetical_df[c1 & c2]

# check token length
hypothetical_df['token_len'] = hypothetical_df['checked_text'].astype(str)\
                            .progress_apply(lambda x: len(x.split()))

# check number of r'\|.*?\|' in the text
hypothetical_df['n_concepts'] = hypothetical_df['checked_text'].astype(str)\
                            .hypothetical_df(lambda x: len(re.findall(r'\|.*?\|', x)))
                            
c = hypothetical_df['n_concepts']>0
hypothetical_df = hypothetical_df[c]

hypothetical_df = hypothetical_df.drop(['text'], axis=1)
hypothetical_df = hypothetical_df.rename(columns={'checked_text': 'text'})
hypothetical_df.to_parquet(f'../data/synth_hypothetical_{GPT_VERSION.replace("-", "_")}_{CURRENT_DATE}.parquet')

  8%|▊         | 54/690 [00:03<00:43, 14.75it/s]

[(132, 145), (615, 628), (667, 681), (762, 776), (996, 1007), (1161, 1175), (1381, 1395)] defaultdict(<class 'int'>, {0: 7, 1: 7, 4: 7})
[(132, 145), (615, 628), (667, 681), (762, 776), (996, 1007), (1161, 1175), (1381, 1395)] defaultdict(<class 'int'>, {0: 7, 1: 7, 4: 7})
[(132, 145), (615, 628), (667, 681), (762, 776), (996, 1007), (1161, 1175), (1381, 1395)] defaultdict(<class 'int'>, {0: 7, 1: 7, 4: 7})


  8%|▊         | 56/690 [00:11<02:44,  3.85it/s]

[(146, 163), (233, 256), (635, 658), (688, 711), (802, 820), (1049, 1068), (1148, 1157), (1215, 1234), (1343, 1362), (1429, 1438)] defaultdict(<class 'int'>, {0: 7, 1: 7, 2: 7, 3: 7, 4: 7, 5: 7, 6: 7, 7: 7, 8: 3})
[(146, 163), (233, 256), (635, 658), (688, 711), (802, 820), (1049, 1068), (1148, 1157), (1215, 1234), (1343, 1362), (1429, 1438)] defaultdict(<class 'int'>, {0: 7, 1: 7, 2: 7, 3: 7, 4: 7, 5: 7, 6: 7, 7: 7, 8: 3})
[(146, 163), (233, 256), (635, 658), (688, 711), (802, 820), (1049, 1068), (1148, 1157), (1215, 1234), (1343, 1362), (1429, 1438)] defaultdict(<class 'int'>, {0: 7, 1: 7, 2: 7, 3: 7, 4: 7, 5: 7, 6: 7, 7: 7, 8: 3})
[(146, 163), (233, 256), (635, 658), (688, 711), (802, 820), (1049, 1068), (1148, 1157), (1215, 1234), (1343, 1362), (1429, 1438)] defaultdict(<class 'int'>, {0: 7, 1: 7, 2: 7, 3: 7, 4: 7, 5: 7, 6: 7, 7: 7, 8: 3})
[(146, 163), (233, 256), (635, 658), (688, 711), (802, 820), (1049, 1068), (1148, 1157), (1215, 1234), (1343, 1362), (1429, 1438)] defaultdict(<

  9%|▊         | 60/690 [00:25<10:39,  1.02s/it]

[(241, 268), (735, 749), (806, 827), (1208, 1217), (1225, 1237), (1371, 1395), (1433, 1450), (1597, 1606), (1664, 1688)] defaultdict(<class 'int'>, {0: 7, 3: 7, 4: 7, 6: 7, 7: 7})
[(241, 268), (735, 749), (806, 827), (1208, 1217), (1225, 1237), (1371, 1395), (1433, 1450), (1597, 1606), (1664, 1688)] defaultdict(<class 'int'>, {0: 7, 3: 7, 4: 7, 6: 7, 7: 7})
[(241, 268), (735, 749), (806, 827), (1208, 1217), (1225, 1237), (1371, 1395), (1433, 1450), (1597, 1606), (1664, 1688)] defaultdict(<class 'int'>, {0: 7, 3: 7, 4: 7, 6: 7, 7: 7})
[(241, 268), (735, 749), (806, 827), (1208, 1217), (1225, 1237), (1371, 1395), (1433, 1450), (1597, 1606), (1664, 1688)] defaultdict(<class 'int'>, {0: 7, 3: 7, 4: 7, 6: 7, 7: 7})
[(241, 268), (735, 749), (806, 827), (1208, 1217), (1225, 1237), (1371, 1395), (1433, 1450), (1597, 1606), (1664, 1688)] defaultdict(<class 'int'>, {0: 7, 3: 7, 4: 7, 6: 7, 7: 7})


 11%|█         | 77/690 [00:57<16:48,  1.65s/it]

[(210, 237), (464, 481)] defaultdict(<class 'int'>, {0: 7})


 21%|██        | 146/690 [02:41<13:05,  1.44s/it]

[(12, 26), (433, 455)] defaultdict(<class 'int'>, {0: 7})


 22%|██▏       | 150/690 [03:40<13:12,  1.47s/it]  


RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4-turbo-preview in organization org-tLLesEF9T1Z2d0ddTZMQq3Yc on tokens per day (TPD): Limit 500000, Used 499969, Requested 506. Please try again in 1m22.08s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}

### Extract the spans from the synthetic set and add them to the original dataset with an additional label

In [212]:
def put_in_dict(data, original_documents, Class_name='Experiencer', Class_value=None):
    new_documents = original_documents.copy()
    
    if type(data)==pd.DataFrame:
        data = [(r.doc_id, r.class_value, r.synth_num, r.text)  for r in data.itertuples()]
    
    for i, (name, Class_value_spec, subid, text) in enumerate(data):
        Class_value_spec = Class_value_spec if Class_value_spec is not None else Class_value
        
        clean_text = text.replace('|', '')
        _doc = {
            'id': i,
            'source': 'synthetic',
            'source_version': f"{GPT_VERSION}|{CURRENT_DATE}",     
            'name': f"{name}|synth|{Class_name}|{subid}",
            'text': clean_text,
            'annotations' : []
        }
        
        try:
            for concept_count, match in enumerate(re.finditer('\|[A-zÀ-ÿ\s]+\|', text)):
                start, end = match.span()
                start_clean = start + 1 -1-concept_count*2 #  (+1,-1)  (+1,-3), (+1,-5), (+1,-7)...
                end_clean = end - 1 -1-concept_count*2 # (-1,-1), (-1,-3), (-1,-5), (-1,-7)...
                _doc['annotations'].append(
                            {
                                'id': 1,
                                'user': 'emc_dcc_synth',
                                'cui': 1,
                                'start': start_clean,
                                'end': end_clean,
                                'value': text[start+1:end-1],
                                'validated': False,
                                'correct': True,
                                'alternative': False,
                                'killed': False,
                                'meta_anns': {
                                    Class_name: {'value': Class_value_spec,
                                                    'name': Class_name,
                                                    'validated': False,
                                                    'acc': 1.0
                                                    },
                                }            
                            }
                        )         
        except:
            print(i, text)
        new_documents.append(_doc)
    return new_documents

In [None]:
#experiencer_df = pd.read_parquet('../data/synth_experiencer_gpt_4_1106_preview_20231214.parquet')
#hypothetical_df = pd.read_parquet('../data/synth_temporality_gpt_4_1106_preview_20231218.parquet')

In [None]:
#experiencer_df = experiencer_df[['doc_id', 'class_value', 'synth_num', 'text']]
#hypothetical_df = hypothetical_df[['doc_id', 'class_value', 'synth_num', 'text']]

In [None]:
new_docs_experiencer = put_in_dict(experiencer_df, update_docs, 
                                   Class_name='Experiencer', 
                                   Class_value='other')
new_docs_temporality = put_in_dict(hypothetical_df, new_docs_experiencer,
                                   Class_name='Temporality', 
                                   Class_value='hypothetical')

In [None]:
len(new_docs_temporality), len(new_docs_experiencer)

## Write new samples to the dataset

In [None]:
DCC['projects'][0]['documents'] = new_docs_temporality

In [None]:
# write DCC back to json 
with open('../data/emc-dcc_ann_Augmented.json', 'w', encoding='utf-8') as f:
    json.dump(DCC, f, indent=2)

## Write to DCC_df

In [None]:
texts, labels, ids = preprocess_dcc_for_robbert.get_tuples_from_medcat_json('../data/emc-dcc_ann_Augmented.json')

In [None]:
dataset, errors = preprocess_dcc_for_robbert.get_dataset(texts, labels, ids)

# index error:
# only one label, label for word at the end

# Mismatch:
# \# preceding, part of compound

len(errors)

In [None]:
dataset[100]

In [None]:
df, ids = preprocess_dcc_for_robbert.get_dataframe(dataset)
print(f"\tProcessed {len(set(ids))} files")
df.to_csv("../data/DCC_df.csv", index=False, sep="\t")

In [None]:
df.Experiencer.value_counts()

In [None]:
df.Temporality.value_counts()

In [None]:
df.Negation.value_counts()

## Translate English corpora

* English -- BioScope; [HF](https://huggingface.co/datasets/bigbio/bioscope), [src](https://rgai.inf.u-szeged.hu/downloads)
* English -- [Genia](http://www.geniaproject.org/genia-corpus/term-corpus)
* English -- Sherlock, SFU review corpus
