We have a dataset (DCC) with a shortage of certain labels. We want to generate 
new samples synthetically using GPT-4. We will use the following approach:
1. We take the existing samples for each document type and present these to GPT-4
2. we ask to generate new sentences like it, where the token labels are provided in the BIO format

We care specifically about the following labels:
* Experiencer: Other
* Historical: Hypothetical

The task of the GPT model is to generate new sentences that are similar to the input sentences but with variations of the medical concepts. 




# Definitions
The definitions are taken from the ConText/ConTextD papers:

## Negation

This property has two values, ‘Negated’ or ‘Not negated’. A clinical condition or term is labeled as ‘Negated’ if there is evidence in the text suggesting that the condition does not occur or exist, e.g., ‘There was no sign of sinus infection’, otherwise it is ‘Not negated’.

## Temporality

The temporality property places a condition along a time line. There are three possible values for this property: ‘Recent’, ‘Historical’, and ‘Hypothetical’. A condition is considered ‘Recent’ if it is maximally 2 weeks old. Conditions that developed more than 2 weeks ago are labeled as ‘Historical’. A condition is labeled as ‘Hypothetical’ if it is not ‘Recent’ or ‘Historical’, e.g., ‘patient should return if she develops fever’ [13].

**Adaptation**: *'Hypothetical' is specifically about (theoretical) concepts, concepts that are not (yet) realized, i.e. concepts that may materialize in the future. 'Historical' and 'Recent' can be used for realized concepts, in which we also include their negations. I.e. if a concept is explicitly denied historically or recently, we can label it as 'Historical' or 'Recent' respectively.*

## Experiencer

Clinical text may refer to subjects other than the actual patient. The experiencer property describes whether the patient experienced the condition or someone else. For simplicity, we have defined only two possible values for this property: ‘Patient’ or ‘Other’, where ‘Other’ refers to anyone but the actual patient, e.g., ‘Mother is recently diagnosed with cancer’.

ASL#: Experiencer, check labels **->** Temporality(Hypothetical), check labels **->** Temporality(Historical), check labels



In [None]:
%load_ext autoreload
%autoreload 2

import os, sys, re
import json, dotenv
import pprint

import openai
import asyncio
from openai import AsyncOpenAI, OpenAI
from tqdm import tqdm
from collections import defaultdict

import pandas as pd

import datetime

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
    
from utils import preprocess_dcc_for_robbert
from utils import active_synthesis
from utils import synthesis_prompts

dotenv.load_dotenv()

import matplotlib.pyplot as plt
import numpy as np

In [None]:
def get_class_counts(docs: dict):
    class_counts = {
                    'Negation':    defaultdict(int),
                    'Temporality': defaultdict(int),
                    'Experiencer': defaultdict(int)
                   }

    for doc in docs:
        for ann in doc['annotations']:
            for _class, val in ann['meta_anns'].items():
                class_counts[_class][val['value']] += 1
    return class_counts

In [None]:
run_hypothetical = False
run_experiencer = False
run_historical = True
run_patient = False
run_negation = True

experiencer_file = '../data/synth_experiencer_gpt_4_1106_preview_20240207.parquet'
hypothetical_file = '../data/synth_temporality_gpt_4_1106_preview_20240209_checked.parquet'
historical_file = '../data/synth_historical_gpt_4_1106_preview_20240213_checked.parquet'
patient_file = '../data/synth_patient_gpt_4_1106_preview_20240214_checked.parquet'

ASL=1

DCC_file = '../data/emc-dcc_ann_ADJ.json'

In [None]:
openai.api_key = os.getenv("OPENAI_KEY")

In [None]:
DCC = json.load(open(DCC_file))

In [None]:
#  First load 
#update_dcc = DCC.copy()
#for c in update_dcc['projects'][0]['documents']:
#    c['source'] = 'EMC_DCC_ORIGINAL'
#with open('../data/emc-dcc_ann_ORIGNAL.json', 'w') as f:
#    json.dump(update_dcc, f, indent=2)
#docs = update_dcc['projects'][0]['documents']
docs = DCC['projects'][0]['documents']

In [None]:
pprint.pprint(get_class_counts(docs))

In [None]:
# TODO: Check all to_remove_ jsons
remove_dict = json.load(open('../artifacts/to_remove_temporality_base_medroberta_5_32_64__centeredVal_temporality_ASL_1.json'))

In [None]:
docs = active_synthesis.remove_flagged_annotations(docs, remove_dict)

In [None]:
#Correction_of_original = active_synthesis.Annotation_correction_original
#Correction_of_synthetic = active_synthesis.Annotation_correction_synthetic

# update the docs
'''
update_docs = docs.copy()
for c in Correction_of_original:
    for d in update_docs:        
        if d['name']==c['doc_id']:
            d['source'] = 'EMC_DCC_ORIGINAL_ADJUSTED'
            for a in d['annotations']:
                if a['id']==c['annotation_id']:
                    a['meta_anns'][c['meta']]['value'] = c['value']
# put updated docs in DCC

update_docs_ = update_docs.copy()
for c in Correction_of_synthetic:
    for d in update_docs_:        
        if c['doc_id'] in d['name']:
            d['source'] = 'synthetic_ADJUSTED'
            for a in d['annotations']:
                if a['start']==c['start']:
                    print(a['start'], c['start'])
                    a['meta_anns'][c['meta']]['value'] = c['value']
'''

In [None]:
# change the source to "synthetic" whenever the "name" contains "synth"
#for doc in docs:
#    if 'synth' in doc['name']:
#        doc['source'] = 'synthetic'

In [None]:
source_counts = defaultdict(int)
for d in docs:
    source_counts[d['source']] += 1

pprint.pprint(source_counts)

In [None]:
source_counts = defaultdict(int)
for d in docs:
    source_counts[d['source']] += 1

pprint.pprint(source_counts)

In [None]:
DCC['projects'][0]['documents'] = update_docs_

In [None]:
pprint.pprint(get_class_counts(update_docs_))

In [None]:
# write DCC back to json 
with open('../data/emc-dcc_ann_ADJ.json', 'w') as f:
    json.dump(DCC, f, indent=2)

In [None]:
# minority classes
relevant_docs_hypothetical = []
for i, doc in enumerate(update_docs_):
    for concept in doc['annotations']:
        try:
            if (concept['meta_anns']['Temporality']['value']=='hypothetical'):
                doc['index'] = i
                relevant_docs_hypothetical.append(doc)
                break
        except:
            pass
        
relevant_docs_other = []
for i, doc in enumerate(update_docs_):
    for concept in doc['annotations']:
        try:
            if (concept['meta_anns']['Experiencer']['value']=='other'):
                doc['index'] = i
                relevant_docs_other.append(doc)
                break
        except:
            pass
        
relevant_docs_historical = []
for i, doc in enumerate(update_docs_):
    for concept in doc['annotations']:
        try:
            if (concept['meta_anns']['Temporality']['value']=='historical'):
                doc['index'] = i
                relevant_docs_historical.append(doc)
                break
        except:
            pass

relevant_docs_patient_names = [d.split("_")[0] for d 
                               in active_synthesis.Experiencer_patient_ASL1]

relevant_docs_patient = []

for i, doc in enumerate(update_docs_):
    if doc['name'] in relevant_docs_patient_names:
        relevant_docs_patient.append(doc)


In [None]:
print(f"{len(relevant_docs_patient)} patients docs for upsampling")
print(f"{len(relevant_docs_historical)} historical docs for upsampling")
print(f"{len(relevant_docs_other)} other docs for upsampling")
print(f"{len(relevant_docs_hypothetical)} hypothetical docs for upsampling")


In [None]:
OAI_ASYNC_CLIENT = AsyncOpenAI(api_key=os.getenv("OPENAI_KEY"), max_retries=2)
OAI_CLIENT = OpenAI(api_key=os.getenv("OPENAI_KEY"), max_retries=2)

In [None]:
SYSTEM_PROMPT_HYPOTHETICAL = synthesis_prompts.SYSTEM_PROMPT_HYPOTHETICAL
SYSTEM_PROMPT_HYPOTHETICAL_CHECK = synthesis_prompts.SYSTEM_PROMPT_HYPOTHETICAL_CHECK

SYSTEM_PROMPT_EXPERIENCER = synthesis_prompts.SYSTEM_PROMPT_EXPERIENCER
SYSTEM_PROMPT_EXPERIENCER_CHECK = synthesis_prompts.SYSTEM_PROMPT_EXPERIENCER_CHECK

SYSTEM_PROMPT_HISTORICAL = synthesis_prompts.SYSTEM_PROMPT_HISTORICAL
SYSTEM_PROMPT_HISTORICAL_CHECK = synthesis_prompts.SYSTEM_PROMPT_HISTORICAL_CHECK

SYSTEM_PROMPT_PATIENT = synthesis_prompts.SYSTEM_PROMPT_PATIENT
SYSTEM_PROMPT_PATIENT_CHECK = synthesis_prompts.SYSTEM_PROMPT_PATIENT_CHECK

In [None]:
def get_chat_res(USER_TEXT='Good day', 
                 SYSTEM_PROMPT="Please be kind in 20 years", 
                 n = 5,
                 MODEL="gpt-4"):
    return OAI_CLIENT.chat.completions.create(
            model=MODEL,
            n = n,
            temperature=0.,
            logprobs=True,
            messages=[
                        {"role": "system",
                        "content": SYSTEM_PROMPT
                        },
                        {"role": "user", 
                        "content": USER_TEXT
                        }],
            stream=False,
        )

In [None]:
prmpt = 'What is the fastest bird alive, and how it be so fast?'
sys_prmpt = 'You are a bird expert, and you\
are talking to a friend who is not a bird expert.'

test_response = get_chat_res(USER_TEXT=prmpt,
                             SYSTEM_PROMPT=sys_prmpt,
                             n=10,
                             MODEL="gpt-3.5-turbo")

In [None]:
log_probs = []
for i, r in enumerate(test_response.choices):
    _log_probs = []
    for t in r.logprobs.content:
        _log_probs.append(t.logprob)
    log_probs.append(_log_probs)

In [None]:
fig, ax = plt.subplots(ncols=2, nrows=1, figsize=(14, 5))
for i in range(10):
    ax[0].hist(np.exp(log_probs[i]), bins=20, alpha=1,
               histtype='step', linewidth=2, label=f"sample {i}");
    ax[1].scatter(x=np.exp(log_probs[i]), 
                  y=[i in range(len(log_probs[i]))],
                  label=f"sample {i}",
                  alpha=0.2)
plt.legend()
plt.suptitle("Token proba's", fontsize=16)

In [None]:
GPT_VERSION = 'gpt-4-1106-preview'
CURRENT_DATE = datetime.datetime.now().strftime("%Y%m%d")

In [None]:
def generate_examples(DocList: list[dict], 
                      SystemPrompt: str, 
                      GPT_version: str='gpt-4-1106-preview',
                      write_out: bool=True,
                      N_samples: int=5,
                      Class_name: str=None)-> pd.DataFrame:
    
    CURRENT_DATE = datetime.datetime.now().strftime("%Y%m%d")
    new_examples = []
    for i, doc in tqdm(enumerate(DocList), total=len(DocList)):
        EXAMPLE = doc['text'].replace('|', ' ')
        # add | vertical bars around the concept that needs to be replaced
        LOCS = [(d['start'],d['end']) for d in doc['annotations'] 
                    if d['meta_anns']['Temporality']['value']=='hypothetical']
        for loc in LOCS:
            EXAMPLE = EXAMPLE[:loc[0]] + '|' + EXAMPLE[loc[0]:loc[1]] + '|' + EXAMPLE[loc[1]:]
        
        res = get_chat_res(SYSTEM_PROMPT=SystemPrompt, 
                        n=N_samples,
                        MODEL=GPT_version, # gpt-3.5-turbo-instruct-0914
                        USER_TEXT="VOORBEELDTEKST: " + EXAMPLE)

        for j, _res in enumerate(res.choices):
            txt = _res.message.content
            new_examples.append((
                doc['name'],
                Class_name,
                j,
                txt[txt.find('NIEUWE_TEKST')+12:].strip()))    
    
    new_examples_df = pd.DataFrame(new_examples, columns=['doc_id', 'class_value', 'synth_num', 'text'])
    
    if write_out:
        new_examples_df.to_parquet(f'../data/synth_{Class_name}_{GPT_version.replace("-", "_")}_{CURRENT_DATE}_unchecked.parquet')
        
    return new_examples_df

In [None]:
#re_extract = re.compile(r'NIEUWE_ZIN\:(.*)')
if run_hypothetical:
    print("Running hypothetical")
    hypothetical_df = generate_examples(relevant_docs_hypothetical, 
                                        SYSTEM_PROMPT_HYPOTHETICAL, 
                                        GPT_VERSION, 
                                        write_out=True, 
                                        Class_name='hypothetical')
#else:
#    hypothetical_df = pd.read_parquet(hypothetical_file)

In [None]:
#re_extract = re.compile(r'NIEUWE_ZIN\:(.*)')
if run_experiencer:
    print("Running experiencer")
    experiencer_df = generate_examples(relevant_docs_other,
                                       SYSTEM_PROMPT_EXPERIENCER, 
                                       GPT_VERSION, 
                                       write_out=True, 
                                       Class_name='experiencer')
#else:
#    experiencer_df = pd.read_parquet(experiencer_file)

In [None]:
if run_historical:
    print("Running historical")
    historical_df = generate_examples(relevant_docs_historical, 
                                      SYSTEM_PROMPT_HISTORICAL, 
                                      GPT_VERSION, 
                                      write_out=True, 
                                      Class_name='historical')
#else:
#    try:
#        historical_df = pd.read_parquet(historical_file)
#    except:
#        pass

In [None]:
if run_patient:
    print("Running patient")
    patient_df = generate_examples(relevant_docs_patient,
                                   SYSTEM_PROMPT_PATIENT, 
                                   GPT_VERSION, 
                                   write_out=True, 
                                   Class_name='patient')
#else:
#    try:
#        patient_df = pd.read_parquet(patient_file)
#    except:
#        pass

In [None]:
# run the checks
def check_by_gpt(TXT: str='', 
                 N_checks: int=3,
                 SYSTEM_PROMPT: str="Respect humans",
                 GPT_VERSION: str='gpt-4-1106-preview'
                 )->str:
    N_checks = 7
    maj_vote = N_checks//2

    RES = get_chat_res(SYSTEM_PROMPT=SYSTEM_PROMPT,
                        USER_TEXT="VOORBEELDTEKST: " + TXT,
                        n=N_checks,
                        MODEL=GPT_VERSION)

    RES_sel = []
    no_count = defaultdict(int)
    list_lens = []
    for j, _res in enumerate(RES.choices):
        txt = _res.message.content
        try:
            _d = eval(txt)
        except:
            continue
        RES_sel.append((j, _d))
        
        for k, v in _d.items():
            if v=='nee':
                no_count[k] += 1
        
        list_lens.append(len(_d.keys())) 

    if len(set(list_lens))>1:
        return 'ERROR-checker concept count mismatch'
    elif len(set(list_lens))==1:
        if len(no_count.values())>0:    
            # approach: if any of the concept is deemed incorrect we flag the TXT for removal 
            num_exc = sum([_v>maj_vote for _v in no_count.values()])
            if num_exc>0:
                if num_exc == len(_d.keys()):
                    return False
                else:
                    spans = []                    
                    for r in re.finditer(r'(\|.*?\|)', TXT):
                        spans.append(r.span())  
                    if len(spans)<max(_d.keys()):
                        return 'ERROR-checker concept count mismatch'              
                    rem = []
                    for k, v in no_count.items():
                        if v>maj_vote:
                            #print(spans, no_count)
                            rem.append(spans[k])
                    if len(rem)>0:
                        for rcount, r in enumerate(rem):
                            TXT = TXT[:r[0]-rcount*2]+\
                                TXT[r[0]+1-rcount*2:r[1]-1-rcount*2]+\
                                TXT[r[1]-rcount*2:]
    else:
        return False
    return TXT
    

In [None]:
from tqdm.notebook import tqdm_notebook
tqdm_notebook.pandas()

In [None]:
def perform_synthetic_check(SyntDf: pd.DataFrame, 
                            NumChecks=3, 
                            SystemPrompt="Don't kill us, please",
                            GPT_VERSION ="gpt-4-1106-preview",
                            text_checked: list=None)-> pd.DataFrame:
    if (text_checked is not None):
        assert len(text_checked)==SyntDf.shape[0], \
        "text_checked should have the same length as SyntDf"
    else:
        text_checked = [None]*SyntDf.shape[0]
    texts = SyntDf['text'].values
    
    for k, text in enumerate(texts):
        if text_checked[k] is None:
            yield check_by_gpt(text, 
                            N_checks=NumChecks, 
                            GPT_VERSION=GPT_VERSION,
                            SYSTEM_PROMPT=SystemPrompt)
        else:
            yield text_checked[k]


def apply_checks(SyntDf: pd.DataFrame, 
                Class_name: str="Experiencer",
                text_checked: list=None,
                write_out: bool=True)-> pd.DataFrame:
            
    SyntDf['checked_text'] = text_checked

    c1 = SyntDf['checked_text']!=False 
    c2 = SyntDf['checked_text']!='ERROR-checker concept count mismatch'

    SyntDf = SyntDf[c1 & c2]

    # check token length
    SyntDf['token_len'] = SyntDf['checked_text'].astype(str)\
                                .progress_apply(lambda x:
                                    len(x.split()))

    # check number of r'\|.*?\|' in the text
    SyntDf['n_concepts'] = SyntDf['checked_text'].astype(str)\
                                .progress_apply(lambda x:
                                    len(re.findall(r'\|.*?\|', x)))
                                
    c = SyntDf['n_concepts']>0
    SyntDf = SyntDf[c]

    SyntDf = SyntDf.drop(['text'], axis=1)
    SyntDf = SyntDf.rename(columns={'checked_text': 'text'})
    if write_out:
        SyntDf.to_parquet(f'../data/synth_{Class_name}_{GPT_VERSION.replace("-", "_")}_{CURRENT_DATE}_checked.parquet')
    return SyntDf

In [None]:
checked_historical = [None]*historical_df.shape[0]

In [None]:
if run_historical:
    print("Checking historical")    
    checker = perform_synthetic_check(historical_df, 
                                        NumChecks=3, 
                                        SystemPrompt=SYSTEM_PROMPT_HISTORICAL_CHECK, 
                                        GPT_VERSION=GPT_VERSION,
                                        text_checked=checked_historical)
    for k, res in tqdm(enumerate(checker), total=historical_df.shape[0]):
        checked_historical[k] = res    
    
    
    print("Apply checks")
    historical_df_checked = apply_checks(historical_df, 
                                        Class_name='historical', 
                                        text_checked=checked_historical,
                                        write_out=True)

In [None]:
checked_patient = [None]*patient_df.shape[0]

In [None]:
if run_patient:
    print("Checking patient")    
    checker = perform_synthetic_check(patient_df, 
                                        NumChecks=3, 
                                        SystemPrompt=SYSTEM_PROMPT_PATIENT_CHECK, 
                                        GPT_VERSION=GPT_VERSION,
                                        text_checked=checked_patient)
    for k, res in tqdm(enumerate(checker), total=patient_df.shape[0]):
        checked_patient[k] = res    
    
    
    print("Apply checks")
    patient_df_checked = apply_checks(patient_df, 
                                        Class_name='patient', 
                                        text_checked=checked_patient,
                                        write_out=True)

### Extract the spans from the synthetic set and add them to the original dataset with an additional label

In [None]:
def put_in_dict(data, original_documents, Class_name='Experiencer', Class_value=None):
    max_id = max([int(k['id']) for d in original_documents
                           for k in d['annotations'] if k['id'] is not None])
    
    new_documents = original_documents.copy()
    
    if type(data)==pd.DataFrame:
        data = [(r.doc_id, r.class_value, r.synth_num, r.text)  for r in data.itertuples()]
    
    for i, (name, Class_value_spec, subid, text) in enumerate(data): # start=max_id+1
        Class_value_spec = Class_value_spec if Class_value_spec is not None else Class_value
        
        if i == 0:
            true_i = i + max_id + 1
        else:
            true_i = true_i + 1
        
        clean_text = text.replace('|', '')
        _doc = {
            'id': true_i,
            'source': 'synthetic',
            'source_version': f"{GPT_VERSION}|{CURRENT_DATE}",     
            'name': f"{name}|synth|{Class_name}|{subid}|{CURRENT_DATE}",
            'text': clean_text,
            'annotations' : []
        }
        
        try:
            for concept_count, match in enumerate(re.finditer('\|[A-zÀ-ÿ\s]+\|', text)):
                start, end = match.span()
                start_clean = start + 1 -1-concept_count*2 #  (+1,-1)  (+1,-3), (+1,-5), (+1,-7)...
                end_clean = end - 1 -1-concept_count*2 # (-1,-1), (-1,-3), (-1,-5), (-1,-7)...
                true_i = true_i + 1
                _doc['annotations'].append(
                            {
                                'id': true_i,
                                'user': 'emc_dcc_synth',
                                'cui': 1,
                                'start': start_clean,
                                'end': end_clean,
                                'value': text[start+1:end-1],
                                'validated': False,
                                'correct': True,
                                'alternative': False,
                                'killed': False,
                                'meta_anns': {
                                    Class_name: {'value': Class_value_spec,
                                                    'name': Class_name,
                                                    'validated': False,
                                                    'acc': 1.0
                                                    },
                                }            
                            }
                        )         
        except:
            print(i, text)
        new_documents.append(_doc)
    return new_documents

In [None]:
if run_experiencer:
    experiencer_df = experiencer_df_checked[['doc_id', 'class_value', 'synth_num', 'text']]
if run_hypothetical:
    hypothetical_df = hypothetical_df_checked[['doc_id', 'class_value', 'synth_num', 'text']]
if run_historical:
    historical_df = historical_df_checked[['doc_id', 'class_value', 'synth_num', 'text']]
if run_patient:
    patient_df = patient_df_checked[['doc_id', 'class_value', 'synth_num', 'text']]
    patient_df['class_value'] = 'patient'

In [None]:
# get class count of update_docs_
pprint.pprint(get_class_counts(update_docs_))

In [None]:
new_docs = update_docs_.copy()
if run_experiencer:
    new_docs = put_in_dict(experiencer_df, new_docs.copy(), 
                                    Class_name='Experiencer', 
                                    Class_value='other')
if run_hypothetical:
    new_docs = put_in_dict(hypothetical_df, new_docs.copy(),
                                   Class_name='Temporality', 
                                   Class_value='hypothetical')
if run_historical:
    new_docs = put_in_dict(historical_df, new_docs.copy(),
                                   Class_name='Temporality', 
                                   Class_value='historical')
    
if run_patient:
    new_docs = put_in_dict(patient_df, new_docs.copy(),
                                   Class_name='Experiencer', 
                                   Class_value='patient')

In [None]:
pprint.pprint(get_class_counts(new_docs))

In [None]:
max_id = max([int(k['id']) for d in update_docs
                           for k in d['annotations'] 
                           if k['id'] is not None])

print(f"Number of annotations in original DCC: {max_id}") 

max_id = max([int(k['id']) for d in new_docs
                           for k in d['annotations'] 
                           if k['id'] is not None])

print(f"Number of annotations in synthetic augmented DCC: {max_id}")

## Write new samples to the dataset

In [None]:
DCC['projects'][0]['documents'] = new_docs

In [None]:
fn = '../data/emc-dcc_ann_Augmented_ASL1_Historical.json'

# write DCC back to json 
with open(fn, 'w', encoding='utf-8') as f:
    json.dump(DCC, f, indent=2)

In [None]:
fn = '../data/emc-dcc_ann_ADJ.json'

# write DCC back to json 
with open(fn, 'w', encoding='utf-8') as f:
    json.dump(DCC, f, indent=2)

## Write to DCC_df

In [None]:
texts, labels, ids = preprocess_dcc_for_robbert.get_tuples_from_medcat_json(fn)

In [None]:
dataset, errors = preprocess_dcc_for_robbert.get_dataset(texts, labels, ids)

# index error:
# only one label, label for word at the end

# Mismatch:
# \# preceding, part of compound

len(errors)

In [None]:
df, ids = preprocess_dcc_for_robbert.get_dataframe(dataset)
print(f"\tProcessed {len(set(ids))} files")
df.to_csv("../data/DCC_df_ASL1_Historical.csv", index=False, sep="\t")

In [None]:
df.Experiencer.value_counts()

In [None]:
df.Temporality.value_counts()

In [None]:
df.Negation.value_counts()

## Translate English corpora

* English -- BioScope; [HF](https://huggingface.co/datasets/bigbio/bioscope), [src](https://rgai.inf.u-szeged.hu/downloads)
* English -- [Genia](http://www.geniaproject.org/genia-corpus/term-corpus)
* English -- Sherlock, SFU review corpus
