# Synthea Covid-19 Preprocessing

## Synthea COVID-19 Module Analysis

This notebook provides and analysis of data generated by Synthea's COVID-19 module. Analysis is run on the CSV output from Synthea.

Code in this notebook depends on Pandas, NumPy, matplotlib and seaborn.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import datetime
from omegaconf import OmegaConf
import math

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
%load_ext autoreload
%autoreload 1
%aimport analysis

# Global configs
yaml_cfg = """
seed: 42
predict_target: outcome # outcome/LOS
"""
config = OmegaConf.create(yaml_cfg)
print(config.seed)

Read in all of the data

In [None]:
data_path = "./raw_data/"
conditions = pd.read_csv(data_path + "conditions.csv")
patients = pd.read_csv(data_path + "patients.csv")
observations = pd.read_csv(data_path + "observations.csv")
care_plans = pd.read_csv(data_path + "careplans.csv")
encounters = pd.read_csv(data_path + "encounters.csv")
devices = pd.read_csv(data_path + "devices.csv")
supplies = pd.read_csv(data_path + 'supplies.csv')
procedures = pd.read_csv(data_path + "procedures.csv")
medications = pd.read_csv(data_path + "medications.csv")

Grab the IDs of patients that have been diagnosed with COVID-19

In [None]:
covid_patient_ids = conditions[conditions.CODE == 840539006].PATIENT.unique()
covid_patient_ids[0:3]

This grabs every patient with a negative SARS-CoV-2 test. This will include patients who tested negative up front as well as patients that tested negative after leaving the hospital

In [None]:
negative_covid_patient_ids = observations[(observations.CODE == '94531-1') & (observations.VALUE == 'Not detected (qualifier value)')].PATIENT.unique()
negative_covid_patient_ids[0:3]

Grabs IDs for all patients that died in the simulation. This will be more than just COVID-19 deaths.

In [None]:
deceased_patients = patients[patients.DEATHDATE.notna()].Id

Grabs IDs for patients that have completed the care plan for isolation at home.

In [None]:
completed_isolation_patients = care_plans[(care_plans.CODE == 736376001) & (care_plans.STOP.notna()) & (care_plans.REASONCODE == 840539006)].PATIENT

Survivors are the union of those who have completed isolation at home or have a negative SARS-CoV-2 test.

In [None]:
survivor_ids = np.union1d(completed_isolation_patients, negative_covid_patient_ids)

Grab IDs for patients with admission due to COVID-19

In [None]:
inpatient_ids = encounters[(encounters.REASONCODE == 840539006) & (encounters.CODE == 1505002)].PATIENT

The number of inpatient survivors

In [None]:
np.intersect1d(inpatient_ids, survivor_ids).shape

The number of inpatient non-survivors

In [None]:
np.intersect1d(inpatient_ids, deceased_patients).shape

In [None]:
inpatient_ids.shape

## Health outcomes experienced by COVID-19 patients

The following table shows different health outcomes that were experienced by COVID-19 patients during the course of the disease.

In [None]:
analysis.outcome_table(inpatient_ids, survivor_ids, deceased_patients, conditions)

## Lab values for COVID-19 patients

The following code presents lab values taken for COVID-19 patients. Values are separated into survivors and non survivors.

The first block of code selects lab values of interest from all observations in the simulation.

In [None]:
# lab_obs = observations[(observations.CODE == '48065-7') | (observations.CODE == '26881-3') | 
#                           (observations.CODE == '2276-4') | (observations.CODE == '89579-7') |
#                           (observations.CODE == '2532-0') | (observations.CODE == '731-0') |
#                           (observations.CODE == '14804-9')
#                       ] # TODO more lab tests values
labtest_features = observations.CODE.unique().tolist()
# lab_obs[0:3]
# labtest_features[0:3]
lab_obs = observations[observations.CODE.isin(labtest_features)] # TODO

Select COVID-19 conditions out of all conditions in the simulation

In [None]:
covid_conditions = conditions[conditions.CODE == 840539006]
covid_conditions[0:3]

Merge the COVID-19 conditions with the patients

In [None]:
covid_patients = covid_conditions.merge(patients, how='left', left_on='PATIENT', right_on='Id')
covid_patients[0:3]

Add an attribute to the DataFrame indicating whether this is a survivor or not.

In [None]:
covid_patients['outcome'] = covid_patients.PATIENT.isin(survivor_ids)
covid_patients

Reduce the columns on the DataFrame to ones needed

In [None]:
covid_patients = covid_patients[['PATIENT', 'START', 'STOP', 'BIRTHDATE', 'DEATHDATE', 'outcome', 'GENDER']] # TODO other columns to consider
covid_patients[0:3]

Calculate attributes needed to support the plot. Also coerce all lab values into a numeric data type.

In [None]:
# def larger_date(a, b):
#     if a.dt.days>b.dt.days: return a
#     return b

covid_patients_obs = covid_patients.merge(lab_obs, on='PATIENT')
covid_patients_obs['START'] = pd.to_datetime(covid_patients_obs.START)
covid_patients_obs['STOP'] = pd.to_datetime(covid_patients_obs.STOP)
covid_patients_obs['BIRTHDATE'] = pd.to_datetime(covid_patients_obs.BIRTHDATE)
covid_patients_obs['DEATHDATE'] = pd.to_datetime(covid_patients_obs.DEATHDATE)
covid_patients_obs['DATE'] = pd.to_datetime(covid_patients_obs.DATE)
# covid_patients_obs['lab_days'] = covid_patients_obs.DATE - covid_patients_obs.START
# covid_patients_obs['days'] = covid_patients_obs.lab_days / np.timedelta64(1, 'D')
# covid_patients_obs['los_days'] = covid_patients_obs.STOP - covid_patients_obs.DATE

# !!!!NOTICE
# covid_patients_obs['LOS'] = (larger_date(covid_patients_obs.STOP, covid_patients_obs.DEATHDATE) - covid_patients_obs.DATE) / np.timedelta64(1, 'D')

covid_patients_obs['temp_a'] = (covid_patients_obs['STOP'] - covid_patients_obs['DATE']) / np.timedelta64(1, 'Y')
covid_patients_obs['temp_b'] = (covid_patients_obs['DEATHDATE'] - covid_patients_obs['DATE']) / np.timedelta64(1, 'Y')

covid_patients_obs['LOS'] = np.nanmax(covid_patients_obs[['temp_a', 'temp_b']].values, axis=1)

covid_patients_obs.drop(columns=['temp_a', 'temp_b'], inplace=True)


covid_patients_obs['VALUE'] = pd.to_numeric(covid_patients_obs['VALUE'], errors='coerce')
# covid_patients_obs['VALUE'] = covid_patients_obs['VALUE']
covid_patients_obs['AGE'] = (covid_patients_obs.DATE - covid_patients_obs.BIRTHDATE) / np.timedelta64(1, 'Y')

# covid_patients_obs['days']

In [None]:
# for index, v in covid_patients_obs.iterrows():
#     if pd.isna(v.LOS):
#         # print(v['DEATHDATE'])
#         covid_patients_obs.loc[index, 'LOS'] = (v['DEATHDATE'] - v['DATE']) / np.timedelta64(1, 'D')

In [None]:
df_train = covid_patients_obs
# df_train.to_csv('train.csv')

In [None]:
# labtest_features are already defined

demographic_features = ['AGE', 'GENDER']

target_features = ['outcome', 'LOS']


In [None]:
df_train.rename(columns={'PATIENT': 'PATIENT_ID'}, inplace=True)
df_train['GENDER'].replace('M', 1, inplace=True)
df_train['GENDER'].replace('F', 0, inplace=True)
# df_train['outcome'].replace(True, 0, inplace=True)
# df_train['outcome'].replace(False, 1, inplace=True)


In [None]:
# print(df_train['outcome'].describe())
# print(len(df_train['PATIENT_ID'].unique()))
# print(df_train)

# df_train.to_csv('train.csv')
df_train['outcome'].replace({False: 1, True: 0}, inplace=True)
print('outcome', df_train['outcome'].describe())
# print(df_train[(df_train['outcome'] == False)])
# print(df_train[(df_train['outcome'] == True)])

print('LOS', df_train['LOS'].describe())
print('DEATHDATE', df_train['DEATHDATE'].describe(datetime_is_numeric=True))
print('STOP', df_train['STOP'].describe(datetime_is_numeric=True))
print('DATE', df_train['DATE'].describe(datetime_is_numeric=True))


In [None]:
df_train = df_train[['PATIENT_ID', 'DATE', 'START', 'AGE', 'GENDER', 'BIRTHDATE', 'outcome', 'LOS' , 'CODE', 'VALUE']]
# NOTICE: delete DEATHDATE and STOP column, otherwise the `pivot` function will only reserve single outcome patients
# TODO: LOS issue
# df_train.to_csv('train.csv')

# print(df_train['outcome'].describe())

df_train = df_train.pivot_table(index = ['PATIENT_ID', 'DATE', 'START', 'AGE', 'GENDER', 'BIRTHDATE', 'outcome', 'LOS'], columns = 'CODE', values = 'VALUE', aggfunc = 'mean').reset_index()

# print('------')
# print(df_train['outcome'].describe())
# print('------')
# print(df_train)

In [None]:
# print(len(df_train['PATIENT_ID'].unique()))
# df_train.to_csv('a.csv')

# df_train['outcome'].describe()
# df_train
# print(df_train['outcome'].describe())
# print('!!!------------------------------------------------')
# print(df_train[(df_train['outcome'] == 0)])
# print('aaa------------------------------------------------')
# print(df_train[(df_train['outcome'] == 0)])

In [None]:
cols = df_train.columns.tolist()
selected_labtest_features = [f for f in cols if f in labtest_features]
print(type(cols))
print(type(labtest_features))
print(len(selected_labtest_features))
labtest_features = selected_labtest_features

print(labtest_features)

In [None]:
# merge lab tests of the same (patient_id, date)
df_train = df_train.groupby(['PATIENT_ID', 'DATE'], dropna=True, as_index = False).mean()
df_train['outcome'].describe()

In [None]:
# save features' statistics information
def calculate_statistic_info(df, features):
    statistic_info = {}
    len_df = len(df)
    for _, e in enumerate(features):
        h = {}
        h['count'] = int(df[e].count())
        h['missing'] = float((100-df[e].count()*100/len_df))
        # print(h['missing'],'% missing')
        h['mean'] = float(df[e].mean())
        h['max'] = float(df[e].max())
        h['min'] = float(df[e].min())
        h['median'] = float(df[e].median())
        h['std'] = float(df[e].std())
        statistic_info[e] = h
    return statistic_info

labtest_statistic_info = calculate_statistic_info(df_train, labtest_features)

groupby_patientid_df = df_train.groupby(['PATIENT_ID'], dropna=True, as_index = False).mean()
# print(groupby_patientid_df)
demographic_statistic_info = calculate_statistic_info(groupby_patientid_df, demographic_features)

statistic_info = labtest_statistic_info | demographic_statistic_info

df_train['outcome'].describe()

In [None]:
# filter features
selected_labtest_features = []
for f in labtest_statistic_info:
    if labtest_statistic_info[f]['missing'] < 70:
        selected_labtest_features.append(f)
print(len(selected_labtest_features))
labtest_features = selected_labtest_features


In [None]:
demographic_statistic_info

In [None]:
df_train = df_train[(df_train['LOS'] >= 0)]
# TODO: limit the upper bound of LOS

df_train['outcome'].describe()

In [None]:
# normalize data
def normalize_data(df, features, statistic_info):
    df_features = df[features]
    df_features = df_features.apply(lambda x: (x - statistic_info[x.name]['mean']) / (statistic_info[x.name]['std']+1e-12))
    # print(df_features)
    df = pd.concat([df[['PATIENT_ID', 'DATE', 'outcome', 'LOS']], df_features], axis=1)
    return df
df_train = normalize_data(df_train, demographic_features + labtest_features, statistic_info)

In [None]:
def calculate_data_existing_length(data):
    res = 0
    for i in data:
        if not pd.isna(i):
            res += 1
    return res
# 默认 data 中的元素都是按时间排序的
def our_fill(data, mean=0):
    data_len = len(data)
    data_exist_len = calculate_data_existing_length(data)
    if data_len == data_exist_len:
        return data
    elif data_exist_len == 0:
        for i in range(data_len):
            data[i] = mean
        return data
    if pd.isna(data[0]):
        # 只考虑length of data > 0
        # 这一部分保证了data[0]非空
        not_na_pos = 0
        for i in range(data_len):
            if not pd.isna(data[i]):
                not_na_pos = i
                break
        for i in range(not_na_pos):
            data[i] = data[not_na_pos]
    for i in range(1, data_len):
        if pd.isna(data[i]):
            data[i] = data[i-1]
    return data
# print(df_train)

In [None]:
# fill missing data using our strategy and convert to time series records
grouped = df_train.groupby('PATIENT_ID')

all_x_demographic = []
all_x_labtest = []
all_y = []

for name, group in grouped:
    sorted_group = group.sort_values(by=['DATE'], ascending=True)
    # print(df_train)
    patient_demographic = []
    patient_labtest = []
    patient_y = []
    for f in labtest_features+demographic_features:
        our_fill(sorted_group[f].values)
    for _, v in sorted_group.iterrows():
        if config.predict_target == 'outcome':
            patient_y.append(v[config.predict_target])
        elif config.predict_target == 'LOS':
            if v['outcome'] == 1:
                patient_y.append(70-v['LOS'])
            else:
                patient_y.append(v['LOS'])
        demo = []
        lab = []
        for f in demographic_features:
            demo.append(v[f])
        for f in labtest_features:
            lab.append(v[f])
        patient_labtest.append(lab)
        patient_demographic.append(demo)
    all_x_demographic.append(patient_demographic[-1])
    all_x_labtest.append(patient_labtest)
    if config.predict_target == 'outcome':
        all_y.append(patient_y[-1])
    elif config.predict_target == 'LOS':
        all_y.append(patient_y)
        

# all_x_demographic (二维数组，每个患者对应的静态指标)
# all_x_labtest (三维数组，每个患者的各个指标)
# all_y (二维患者结局/三维Length of stay)

In [None]:
# save pickle format dataset
pd.to_pickle(all_x_demographic,f'./processed_data/train_x_demographic.pkl' )
pd.to_pickle(all_x_labtest,f'./processed_data/train_x_labtest.pkl' )
pd.to_pickle(all_y,f'./processed_data/train_y_{config.predict_target}.pkl' )

In [None]:
df_y = pd.DataFrame({'y':all_y})
df_y.describe()