# Generative example

This example uses the LangChain framework with Ollama to generate synthetic clinical notes from synthetic structured data.

Running this notebook requires an additional install of [Ollama](https://ollama.ai/) and the particular model used is `llama2:latest` (also named `llama2:7b-chat`) from the [Ollama model library](https://ollama.ai/library/llama2/tags).

Open a terminal and run `ollama pull llama2` to download the model.

## Import libraries

In [2]:
notebook_name = 'explore_llama'

In [3]:
# Load libraries
import sys
import os
import pandas as pd
import numpy as np
from pathlib import Path

import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
# Import the variables that have been set in the init.py folder in the root directory
# These include a constant called PROJECT_ROOT which stores the absolute path to this folder
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), os.pardir)))
import init
PROJECT_ROOT = os.getenv("PROJECT_ROOT")

# Add the src folder to sys path, so that the application knows to look there for libraries
sys.path.append(str(Path(PROJECT_ROOT) / 'src'))

In [23]:
import json

from langchain.llms import Ollama
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.prompts import PromptTemplate

## Load input from json

In [5]:
input_path = "../../privfp-poc-zk/experiments/02_generate_dataset/synthea_dataset_inpatient.json"

with open(input_path) as file:
    data = json.load(file)

batch = []

for i in range(5):
    batch.append({"data": json.dumps(data[i])})

In [6]:
# Note - each batch is its own dict, with the keys the same as in the {} holders in the prompt template 
print(type(batch[0]))
batch[1]

<class 'dict'>


{'data': '{"name": "Mrs. Katina Trantow", "NHS number": "118 767 7799", "address": "348 Huels Hollow, Emsworth, PO10 7BR", "date of birth": "1962-11-03", "marital status": "married", "ethnicity": "White - Irish", "gender": "female", "visit type": "Encounter Inpatient", "visit date": "2000-12-15T07:37:49Z", "provider": {"doctor": "Jaquelyn Bernier", "facility": "Havant Community Health"}, "visit reason": "Appendicitis", "conditions": ["History of appendectomy"]}'}

## Load input from dict

In [7]:
FIHR_DIR = '../../privfp-poc-zk/experiments/02_generate_dataset'

# with open(FIHR_DIR + '/' + 'all_inpatient_visit_dict.json') as f:
#    all_inpatient_visit_dict = json.load(f)
   
#    # ed.head()


with open(FIHR_DIR + '/' + 'long_visit_dict.json') as f:
   long_visit_dict = json.load(f)

print(len(long_visit_dict.keys()))
   
ex = list(long_visit_dict.keys())[0]
ex =  "urn:uuid:6ac2bf6b-6a8c-51e9-a2dd-30eb780264c5"

18


In [8]:
import re
import datetime
import os
for key, value in long_visit_dict.items():

   encounter_id = long_visit_dict[key]['Encounter']['Encounter id']
   encounter_started = long_visit_dict[key]['Encounter']["Encounter Started"]
   uuid  = key[9:]

   patient_file = [file for file in os.listdir( FIHR_DIR + '/synthea/fhir') if re.search(uuid, file, re.IGNORECASE)]
   with open(FIHR_DIR + '/synthea/fhir/' + patient_file[0]) as f:
      patient_fhir = json.load(f)

   patient_dict = {}
   patient_uuid = long_visit_dict[key]['Patient']
   patient_dict['name'] = ' '.join(patient_fhir['entry'][0]['resource']['name'][0]['given']) + ' ' + patient_fhir['entry'][0]['resource']['name'][0]['family']
   patient_dict['birthDate'] = patient_fhir['entry'][0]['resource']['birthDate']
   patient_dict['gender'] = patient_fhir['entry'][0]['resource']['gender']
   patient_dict['age'] = int((datetime.datetime.fromisoformat(encounter_started).date() - datetime.datetime.strptime(patient_dict['birthDate'], '%Y-%m-%d').date()).days / 365.25)
   long_visit_dict[key]['Patient'] = patient_dict

In [9]:
batch = []

for i in range(5):
    batch.append({"data": json.dumps(list(long_visit_dict.keys())[i])})

## Load LLM

The model used to run inference can be easily swapped out using LangChain and Ollama!

Simply open a terminal and run `ollama pull <model_name:tag>` to retrieve any model from the Ollama model library and pass `<model_name:tag>` as the new argument when instantiating the LLM.

For example, if you wanted to use Mistral instead of Llama2, you would need to run `ollama pull mistral` in a terminal and set `model="mistral"` below.

In [24]:
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
llm = Ollama(model="llama2", callback_manager=callback_manager)

## Construct prompt

In [11]:
template = """[INST]
<<SYS>>
Your goal is to give your judgement about a patient's condition, and to make a plan for their admission to the hospital. 
You may be given extensive list of lab results or observations.  Do not restate the results of observations as a list. However, you should refer to them in giving your judgement. 
<</SYS>>

### INSTRUCTIONS ### 

Format the output as JSON with the following keys:

"history_of_present_illness": a detailed narrative of the patient's current complaint

"physical_examination": findings from a detailed examination of the patient, as a narrative within which you comment on vital signs, observations and procedures

"assessment_and_plan": interpretations, diagnoses, and proposed management plan

Here is some information about the patient: 

{data}
[/INST]
"""

prompt = PromptTemplate.from_template(template)
chain = prompt | llm

In [12]:
# note - could have other input variables defined by the {} in the template eg {demographics}
# These would be set in the 
prompt.input_variables

['data']

## Run on single item

In [20]:
long_visit_dict[list(long_visit_dict.keys())[i]]

{'Patient': {'name': 'Shizuko29 Rozella39 Morar593',
  'birthDate': '1945-07-26',
  'gender': 'female',
  'age': 75},
 'Encounter': {'Encounter id': 'urn:uuid:441802e5-48e2-d381-7fbd-6c0d528a0566',
  'Encounter Started': '2020-08-06 13:08:58',
  'Encounter Ended': '2020-09-01 13:23:58',
  'Encounter Duration': 2247300,
  'Hospital Staff': 'Dr. Guillermo498 Fay398',
  'Type of admission': 'Admission to hospice (procedure)'},
 'Condition': {},
 'Procedure': {'1': {'Text': 'Certification procedure (procedure)',
   'Started': '2020-08-06T13:08:58+01:00',
   'Ended': '2020-08-06T13:23:58+01:00'},
  '2': {'Text': 'Notifications (procedure)',
   'Started': '2020-08-06T13:08:58+01:00',
   'Ended': '2020-08-06T13:23:58+01:00'},
  '3': {'Text': 'Initial patient assessment (procedure)',
   'Started': '2020-08-06T13:08:58+01:00',
   'Ended': '2020-08-06T13:23:58+01:00'},
  '4': {'Text': 'Development of individualized plan of care (procedure)',
   'Started': '2020-08-06T13:08:58+01:00',
   'Ended':

In [25]:
for i in range(5):
    llm_input = {'data': json.dumps(long_visit_dict[list(long_visit_dict.keys())[i]])}
    formatted_prompt = prompt.format_prompt(data = llm_input)
    response = llm(formatted_prompt.to_string())
    long_visit_dict[list(long_visit_dict.keys())[i]]['Admission_note'] = response

The output you provided is a JSON object that represents a collection of observations for a patient. Here's a breakdown of the structure:

* `data`: This is the main data structure in the output, which contains a list of observation objects. Each observation object has several properties, including:
	+ `Text`: This is the text of the observation, represented as a string.
	+ `Started`: This is the start time of the observation, represented as a timestamp in ISO 8601 format (e.g., "2022-06-15T19:37:52+01:00").
	+ `Ended`: This is the end time of the observation, also represented as a timestamp in ISO 8601 format (e.g., "2022-06-15T19:52:52+01:00").
* `Observation`: This is an array that contains a single observation object. The `Observation` object has the same properties as the ` Observation` object described above.
* `Patient`: This is an object that represents the patient for whom the observations were made. It has several properties, including:
	+ `id`: This is the unique identifier 

KeyboardInterrupt: 

In [11]:
formatted_prompt = prompt.format_prompt(data = batch[0])
type(formatted_prompt.to_string())

# for a batch of items
# llm.generate([formatted_prompt.to_string()])

# for a single item
response = llm(formatted_prompt.to_string())

  warn_deprecated(


{
"history_of_present_illness": "The patient is a 45-year-old male who presents with sudden onset of severe chest pain radiating to the left arm, shortness of breath, and lightheadedness. The patient reports that the symptoms began approximately 1 hour ago and have worsened over time. The patient has no history of cardiac disease or other significant medical conditions. The patient is conscious and alert, but appears anxious and agitated.",
"physical_examination": "On examination, the patient's vital signs are as follows: blood pressure 180/100 mmHg, pulse rate 120 bpm, respiratory rate 24 bpm, oxygen saturation 90% on room air. The patient has a distended abdomen with tenderness to palpation in the epigastric area. There is a faint murmur heard on the left side of the chest. The patient's neck is supple and there is no jugular vein distension. The patient's cardiac auscultation reveals a grade 2/6 systolic murmur at the apex of the heart.",
"assessment_and_plan": "Based on the patient

In [25]:
from utils.write_to_json import write_to_json

try:
    response_json = json.loads(response)
    write_to_json(response_json, notebook_name)
except:
    print("Can't convert to json")

In [21]:
os.listdir(output_path)
# output_file = output_path + '/llm_dataset.json'

# with open(output_file, 'w') as f:
#         json.dump(response, f)

['experiment_1.1_20231006_0850.json',
 'experiment_1.1_20231008_2049.json',
 'experiment_1.1_20231008_2106.json',
 '__init__.py',
 'experiment_1.1.3_20231029_1823.json',
 'experiment_1.1.1_20231027_1639.json',
 'experiment_1.0_20231002_2120.json',
 'experiment_1.1.2_20231027_1747.json',
 'experiment_1.1.1_20231022_1938.json']

## Run inference on batch

In [26]:
outputs = chain.batch(batch)

{
"history_of_present_illness": "The patient is a 45-year-old male who presents with a 2-day history of worsening headache, nausea, and vomiting. The patient states that the symptoms started suddenly and have been getting worse over time. He also reports feeling dizzy and disoriented, with difficulty walking and maintaining balance. The patient has no significant medical history or recent travel. He is currently working as a manual laborer and denies any trauma or recent injuries.",
"physical_examination": "On examination, the patient is pale and sweaty. Vital signs show a temperature of 38.5 degrees Celsius, pulse rate of 120 beats per minute, respiratory rate of 24 breaths per minute, and blood pressure of 100/60 mmHg. The patient's headache is severe and localized to the frontal region, with no other focal neurological signs. There are no signs of dehydration or electrolyte imbalance. The patient's movements are slow and deliberate, with a slight tremor in both hands. The patient's 

In [37]:
outputs[0]

'{\n"history_of_present_illness": "The patient is a 45-year-old male who presents with a 2-day history of worsening headache, nausea, and vomiting. The patient states that the symptoms started suddenly and have been getting worse over time. He also reports feeling dizzy and disoriented, with difficulty walking and maintaining balance. The patient has no significant medical history or recent travel. He is currently working as a manual laborer and denies any trauma or recent injuries.",\n"physical_examination": "On examination, the patient is pale and sweaty. Vital signs show a temperature of 38.5 degrees Celsius, pulse rate of 120 beats per minute, respiratory rate of 24 breaths per minute, and blood pressure of 100/60 mmHg. The patient\'s headache is severe and localized to the frontal region, with no other focal neurological signs. There are no signs of dehydration or electrolyte imbalance. The patient\'s movements are slow and deliberate, with a slight tremor in both hands. The patie

In [48]:
# def clean_json_string(input_string):
#     # Regex to find control characters except new line, carriage return, and tab
#     control_char_regex = r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f]'
    
#     # Replace found control characters with an empty string
#     cleaned_string = re.sub(control_char_regex, '', input_string)
    
#     return cleaned_string

# # json.loads(clean_json_string(outputs[0]))

# def escape_json_string(input_string):
#     # Escapes control characters like newline and tab that are within string values
#     escaped_string = input_string.replace('\n', '\\n').replace('\r', '\\r').replace('\t', '\\t')
#     return escaped_string

def remove_control_characters(input_string):
    # Regex to match all control characters
    control_chars = re.compile(r'[\x00-\x1F\x7F]')
    
    # Remove the control characters
    cleaned_string = re.sub(control_chars, '', input_string)
    
    return cleaned_string

def fix_json_trailing_comma(json_string):
    # Regular expression to find trailing commas before a closing brace or bracket
    fixed_json = re.sub(r',(\s*[}\]])', r'\1', json_string)
    return fixed_json

json.loads(fix_json_trailing_comma(remove_control_characters(outputs[0])))
# json.loads(remove_control_characters(outputs[0]))



{'history_of_present_illness': 'The patient is a 45-year-old male who presents with a 2-day history of worsening headache, nausea, and vomiting. The patient states that the symptoms started suddenly and have been getting worse over time. He also reports feeling dizzy and disoriented, with difficulty walking and maintaining balance. The patient has no significant medical history or recent travel. He is currently working as a manual laborer and denies any trauma or recent injuries.',
 'physical_examination': "On examination, the patient is pale and sweaty. Vital signs show a temperature of 38.5 degrees Celsius, pulse rate of 120 beats per minute, respiratory rate of 24 breaths per minute, and blood pressure of 100/60 mmHg. The patient's headache is severe and localized to the frontal region, with no other focal neurological signs. There are no signs of dehydration or electrolyte imbalance. The patient's movements are slow and deliberate, with a slight tremor in both hands. The patient's 

In [52]:
json.loads(fix_json_trailing_comma(remove_control_characters(outputs[0])))

{'history_of_present_illness': 'The patient is a 45-year-old male who presents with a 2-day history of worsening headache, nausea, and vomiting. The patient states that the symptoms started suddenly and have been getting worse over time. He also reports feeling dizzy and disoriented, with difficulty walking and maintaining balance. The patient has no significant medical history or recent travel. He is currently working as a manual laborer and denies any trauma or recent injuries.',
 'physical_examination': "On examination, the patient is pale and sweaty. Vital signs show a temperature of 38.5 degrees Celsius, pulse rate of 120 beats per minute, respiratory rate of 24 breaths per minute, and blood pressure of 100/60 mmHg. The patient's headache is severe and localized to the frontal region, with no other focal neurological signs. There are no signs of dehydration or electrolyte imbalance. The patient's movements are slow and deliberate, with a slight tremor in both hands. The patient's 

In [59]:
cleaned_outputs = []

for i in range(5):
    print(i)
    try:
        response_json = json.loads(fix_json_trailing_comma(remove_control_characters(outputs[i])))
        encounter_key = list(long_visit_dict.keys())[i]
        cleaned_outputs.append({encounter_key : response_json})
    except:
        print('Could not parse item ' + str(i))
    





0
1
2
Could not parse item 2
3
4


In [62]:
try:
    response_json = json.loads(response)
    write_to_json(cleaned_outputs, notebook_name)
except:
    print("Can't convert to json")

In [19]:
# output_path = "../../data"
# os.listdir(output_path)
# output_file = output_path + '/llm_dataset.json'

# with open(output_file, 'w') as f:
#         json.dump(batch, f)

list