# 02 - Generating Dataset 1

The purpose of this notebook is to create a dataset that will be used for testing the performance of the package and conducting a series of experiments.

By separating the creation of the dataset from its use it should be easier to re-use the dataset for the different tests and experiments.

In [2]:
# Reload functions every time
%load_ext autoreload 
%autoreload 2

In [3]:
import json
import os
import sys

# This will add the src directory to sys.path
# meaning that the privacy_fingerprint will be found
# note it assumes the current working directory is the folder containing this notebook
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), os.pardir))+'/src')

In [4]:
from privacy_fingerprint.common.config import (
    load_global_config_from_file,
    load_experiment_config_from_file,
    load_experiment_config,
)
import privacy_fingerprint.generate.synthea as synthea
import privacy_fingerprint.generate.language_model as llm
import privacy_fingerprint.extract.aws_comprehend as aws



In [8]:
# Example config files are available in the config directory.
# These files will need to be customised with your API keys.

load_global_config_from_file("../configs/global_configs.yaml")
load_experiment_config_from_file("../configs/experiment_config.yaml")

# Config options can be modified inline. To keep this notebook/experiment small
# the number of records will be changed to 10.
expt_config = load_experiment_config()
# expt_config.synthea.encounter_type = "Encounter for symptom"
expt_config.synthea.num_records = 100  # 100_000 used to create dataset1
load_experiment_config(expt_config.dict())

/var/folders/lr/pm79dxzs0v70y4gz98dl13440000gn/T/ipykernel_30779/1344694839.py:12: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.4/migration/
  load_experiment_config(expt_config.dict())


ExperimentConfig(synthea=ExperimentSyntheaConfig(county='Hampshire', encounter_type='Encounter Inpatient', num_records=100, extra_config={}, records_per_patient=1, ethnicity_types=['White - British', 'White - Irish', 'White - Any other White background', 'Mixed - White and Black Caribbean', 'Mixed - White and Black African', 'Mixed - White and Asian', 'Mixed - Any other mixed background', 'Asian or Asian British - Indian', 'Asian or Asian British - Pakistani', 'Asian or Asian British - Bangladeshi', 'Asian or Asian British - Any other Asian background', 'Black or Black British - Caribbean', 'Black or Black British - African', 'Black or Black British - Any other Black background', 'Other Ethnic Groups - Chinese', 'Other Ethnic Groups - Any other ethnic group']), openai=ExperimentOpenAPIConfig(model='text-davinci-003', max_tokens=256, temperature=0.7, prompt='Describe this patient as if you were a medical doctor.'), scoring=ScoringConfig(encoding_scheme='one-hot', max_columns=30))

In [6]:
expt_config.synthea.encounter_type

'Encounter Inpatient'

In [7]:
# The Synthea output will be saved to a directory
output_dir = "../experiments/02_generate_dataset_inpatients"
os.makedirs(output_dir, exist_ok=True)
export_directory = os.path.join(output_dir, "synthea")

In [9]:
# CAUTION: Given the number of records, running this cell will be extremely slow.
import datetime
print(datetime.datetime.now())

# Generate structured records
synthea_records = synthea.generate_records(export_directory)

with open(os.path.join(output_dir, "synthea_dataset.json"), "w") as fp:
    json.dump(synthea_records, fp)
    
print(datetime.datetime.now())

2023-11-19 10:35:57.405413
Encounter Inpatient
2023-11-19 10:36:45.115142


A modified version of the above was run. This generated 100k records in Synthea but then limited the import of those records to 1000. This then formed our dataset1. The records generated in our run are available separately to this repository.

In [14]:
# If using a previously generated set of records they can be loaded as follows:

with open(os.path.join(output_dir, "synthea_dataset.json")) as fp:
    synthea_records = json.load(fp)

The structured notes from Synthea can then be converted to free-text clinical notes.

In [None]:
clinical_note_generator = llm.LMGenerator()
llm_results = list(clinical_note_generator.generate_text(synthea_records))

with open(os.path.join(output_dir, "llm_dataset.json"), "w") as fp:
    json.dump(llm_results, fp)

In [None]:
# If using a previously generated set of records they can be loaded as follows:

with open(os.path.join(output_dir, "llm_dataset.json")) as fp:
    llm_results = json.load(fp)

In [None]:
# The NER step using AWS ComprehendMedical is the most expensive step.
# The cost can be estimated with the following function:

print("Estimated cost is $", aws.calculate_ner_cost(llm_results))

In [None]:
aws_extract = aws.ComprehendExtractor()
ner_records = [aws_extract.extract_record(r) for r in llm_results]

with open(os.path.join(output_dir, "ner_dataset.json"), "w") as fp:
    json.dump(ner_records, fp)

In [None]:
# If using a previously generated set of records they can be loaded as follows:

with open(os.path.join(output_dir, "ner_dataset.json")) as fp:
    ner_records = json.load(fp)

With the raw NER results generated, experiments will move to individual notebooks.