# MIMIC-IV: Prepare EMR data and link to CXR studies
1. Load the dataset with mortality data and other metadata and calculate the time between the start of ventilation and the event of death (if applicable). *Exclude patients with `time_to_death` < 24 hours.* Column `alive96h` indicates if a patient was still alive 96 hours after intubation. Column `over72h` indicates if a patient was still intubated at 72 hours.
2. Load CXR metadata. Combine date and time of CXR studies and convert to `datetime`.
3. Link each ICU stay to a radiology study. Create a list of the corresponding DICOM images.
4. Load the dataset with 63 clinical features and add columns `over72h` and `alive96h` to it.

Create 3 files: 
* `cxr-image-list.csv` contains the list of images 
* `metadata.csv` contains metadata including  CXR study IDs 
* `ft63.csv` contains clinical features for the whole cohort

In [1]:
import numpy as np
import pandas as pd
import datetime as dt

# To show all columns in a dataframe
pd.options.display.max_info_columns=250
pd.options.display.max_columns=500

# To make pretty plots
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('seaborn-ticks')
sns.set_style('ticks')
plt.rcParams['figure.figsize'] = (6, 4)
plt.rcParams['axes.titlesize'] = 22
plt.rcParams['axes.labelsize'] = 20
plt.rcParams['xtick.labelsize'] = 16
plt.rcParams['ytick.labelsize'] = 16

%matplotlib inline

**Load the dataset with mortality data and other metadata**

In [2]:
df_meta = pd.read_csv("../data/mimic/ft63_invasive_procedureevents_based_cohort_mortality_v2.csv")
print(df_meta.shape)
df_meta.head()

(12719, 12)


Unnamed: 0,stay_id,starttime,endtime,duration,over72h,subject_id,hadm_id,hosp_intime,hosp_outtime,icu_intime,icu_outtime,deathtime
0,30000670,2182-04-14 07:45:00,2182-04-15 10:00:00,1575,0,13134463,28333727,2182-04-10 22:25:00,2182-04-19 15:56:00,2182-04-10 22:42:19,2182-04-19 00:37:36,
1,30000974,2119-06-21 19:30:00,2119-07-07 13:10:00,22660,1,19407684,29905273,2119-06-21 19:09:00,2119-07-08 18:45:00,2119-06-21 23:57:00,2119-07-08 19:32:46,
2,30001939,2151-04-06 16:55:00,2151-04-15 15:40:00,12885,1,19023641,25083387,2151-03-18 12:42:00,2151-04-15 18:10:00,2151-04-06 13:22:49,2151-04-15 19:25:07,2151-04-15 18:10:00
3,30002055,2171-09-26 14:28:00,2171-09-29 08:55:00,3987,0,10887901,28942534,2171-09-26 12:10:00,2171-10-29 14:45:00,2171-09-26 13:42:00,2171-10-09 09:50:58,
4,30003299,2169-08-22 01:51:00,2169-08-28 12:02:00,9251,1,12093201,23308326,2169-08-22 00:46:00,2169-09-13 15:15:00,2169-08-22 00:48:13,2169-08-29 13:54:47,


**Convert timestamps to `datetime`**

In [3]:
date_cols = ["starttime", "endtime", "hosp_intime", "hosp_outtime", "icu_intime", "icu_outtime", "deathtime"]
df_meta[date_cols] = df_meta[date_cols].apply(pd.to_datetime)

**Calculate death from the start of ventilation in minutes**

In [4]:
df_meta["time_to_death"] = (df_meta.deathtime - df_meta.starttime) / pd.Timedelta(minutes=1)
df_meta.time_to_death = df_meta.time_to_death / 60

df_meta["alive96h"] = (df_meta.time_to_death.dropna() > 96).astype(int)
df_meta.alive96h.fillna(1, inplace=True)
df_meta.alive96h = df_meta.alive96h.astype(int)

**Convert MV `duration` to hours**

In [5]:
df_meta.duration = df_meta.duration / 60

**Add log duration**

In [6]:
df_meta["log_duration"] = np.log(df_meta.duration)

**Exclude patients with `time_to_death` < 24 hours**

In [7]:
stays_to_excl = df_meta[df_meta.time_to_death < 24].stay_id
len(stays_to_excl)

67

In [8]:
df_meta.drop(df_meta[df_meta.stay_id.isin(stays_to_excl)].index, inplace=True)

## Linking to CXR data

**Load the dataset with CXR metadata**

In [10]:
df_cxr = pd.read_csv("../data/mimic-cxr-2.0.0-metadata.csv")
print(df_cxr.shape)
df_cxr.head()

(377110, 12)


Unnamed: 0,dicom_id,subject_id,study_id,PerformedProcedureStepDescription,ViewPosition,Rows,Columns,StudyDate,StudyTime,ProcedureCodeSequence_CodeMeaning,ViewCodeSequence_CodeMeaning,PatientOrientationCodeSequence_CodeMeaning
0,02aa804e-bde0afdd-112c0b34-7bc16630-4e384014,10000032,50414267,CHEST (PA AND LAT),PA,3056,2544,21800506,213014.531,CHEST (PA AND LAT),postero-anterior,Erect
1,174413ec-4ec4c1f7-34ea26b7-c5f994f8-79ef1962,10000032,50414267,CHEST (PA AND LAT),LATERAL,3056,2544,21800506,213014.531,CHEST (PA AND LAT),lateral,Erect
2,2a2277a9-b0ded155-c0de8eb9-c124d10e-82c5caab,10000032,53189527,CHEST (PA AND LAT),PA,3056,2544,21800626,165500.312,CHEST (PA AND LAT),postero-anterior,Erect
3,e084de3b-be89b11e-20fe3f9f-9c8d8dfe-4cfd202c,10000032,53189527,CHEST (PA AND LAT),LATERAL,3056,2544,21800626,165500.312,CHEST (PA AND LAT),lateral,Erect
4,68b5c4b1-227d0485-9cc38c3f-7b84ab51-4b472714,10000032,53911762,CHEST (PORTABLE AP),AP,2705,2539,21800723,80556.875,CHEST (PORTABLE AP),antero-posterior,


**Retain only scans with frontal view**

In [11]:
df_cxr.drop(df_cxr[(df_cxr.ViewCodeSequence_CodeMeaning!="antero-posterior") & 
                   (df_cxr.ViewCodeSequence_CodeMeaning!="postero-anterior")
                  ].index, inplace=True)
df_cxr.reset_index(drop=True, inplace=True)

**Combine study date and time and convert to `datetime`**

In [12]:
df_cxr.StudyDate = pd.to_datetime(df_cxr.StudyDate, format="%Y%m%d")

df_cxr.StudyTime = df_cxr.StudyTime.apply(lambda x: str(round(x)).zfill(6))
df_cxr.StudyTime = pd.to_datetime(df_cxr.StudyTime, format="%H%M%S")

df_cxr["StudyDateTime"] = df_cxr.apply(lambda x: 
                                       pd.Timestamp.combine(x.StudyDate.date(),
                                                            x.StudyTime.time()), 
                                       axis=1)

**Unique studies and times**

In [13]:
cxr_studies = df_cxr[["subject_id", "study_id", "StudyDateTime"]].copy()
cxr_studies.drop_duplicates(subset=["subject_id", "study_id"], inplace=True)

**Link each ICU stay to a radiology study**

In [14]:
def get_study_id(tmp):
    if tmp.subject_id in cxr_studies.subject_id.values:
        tmp_cxr = cxr_studies[cxr_studies.subject_id == tmp.subject_id].copy()
        tmp_cxr["during_MV"] = tmp_cxr.apply(lambda x: (x.StudyDateTime > tmp.starttime) & 
                                         (x.StudyDateTime < tmp.starttime + pd.to_timedelta(24, unit='h')),
                                         axis=1)
        if tmp_cxr.during_MV.sum():
            study_id = tmp_cxr[tmp_cxr.during_MV==True].sort_values(by="StudyDateTime", 
                                                                    ascending=False
                                                                   ).iloc[0].study_id
            return study_id

In [15]:
df_meta["study_id"] = df_meta.apply(get_study_id, axis=1)

**Create a list of relevant images**

In [16]:
cxr_images = df_cxr[df_cxr.study_id.isin(df_meta.study_id.dropna())][["dicom_id", "subject_id", "study_id"]]
cxr_images.reset_index(drop=True, inplace=True)

cxr_images.to_csv("../data/cxr-image-list.csv", index=False)

## Clinical features only

**Load the dataset with clinical features**

In [17]:
df = pd.read_csv("../data/mimic/ft98_mimic_new.csv")
print(df.shape)
df.head()

(12719, 103)


Unnamed: 0,stay_id,starttime,endtime,duration,over72h,admission_location,insurance,language,ethnicity,marital_status,gender,age,hours_in_hosp_before_intubation,weight,height,co2_total_max,co2_total_avg,co2_total_min,ph_max,ph_avg,ph_min,lactate_max,lactate_avg,lactate_min,pao2fio2ratio,heart_rate_max,heart_rate_avg,heart_rate_min,mbp_max,mbp_avg,mbp_min,mbp_ni_max,mbp_ni_avg,mbp_ni_min,resp_rate_max,resp_rate_avg,resp_rate_min,temp_max,temp_avg,temp_min,spo2_max,spo2_avg,spo2_min,glucose_max,glucose_avg,glucose_min,vasopressin,epinephrine,dobutamine,norepinephrine,phenylephrine,dopamine,count_of_vaso,fio2_max,fio2_avg,fio2_min,peep_max,peep_avg,peep_min,plateau_pressure_max,plateau_pressure_avg,plateau_pressure_min,rrt,sinus_rhythm,neuroblocker,congestive_heart_failure,cerebrovascular_disease,dementia,chronic_pulmonary_disease,rheumatic_disease,mild_liver_disease,diabetes_without_cc,diabetes_with_cc,paraplegia,renal_disease,malignant_cancer,severe_liver_disease,metastatic_solid_tumor,aids,SOFA,respiration,coagulation,liver,cardiovascular,cns,renal,apsiii,hr_score,mbp_score,temp_score,resp_rate_score,pao2_aado2_score,hematocrit_score,wbc_score,creatinine_score,uo_score,bun_score,sodium_score,albumin_score,bilirubin_score,glucose_score,acidbase_score,gcs_score
0,30000670,2182-04-14 07:45:00,2182-04-15 10:00:00,26.25,0,EMERGENCY ROOM,Medicare,ENGLISH,BLACK/AFRICAN AMERICAN,DIVORCED,M,69,81,51.7,173.0,38.0,37.0,36.0,7.39,7.39,7.39,2.2,2.2,2.2,305.0,83.0,67.52,56.0,95.0,79.88,68.0,95.0,79.88,68.0,20.5,15.442308,13.0,36.61,35.7875,35.0,100.0,99.923077,98.0,179.0,165.5,149.0,0,0,0,0,0,0,0,50.0,41.111111,40.0,6.4,4.64,0.0,18.0,14.2,11.0,0,1.0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,4,2.0,0.0,,0.0,1,1.0,47,1.0,7.0,2.0,8.0,0.0,3.0,0.0,0.0,7.0,7.0,0.0,,,0.0,12.0,0.0
1,30000974,2119-06-21 19:30:00,2119-07-07 13:10:00,377.666667,1,EMERGENCY ROOM,Medicare,ENGLISH,WHITE,SINGLE,F,92,0,55.0,157.0,25.0,25.0,25.0,7.44,7.44,7.44,2.4,2.4,2.4,252.0,91.0,82.5,69.0,78.0,66.478261,52.0,53.0,52.5,52.0,34.0,29.108696,20.0,38.22,37.406667,37.0,100.0,98.318182,97.0,159.0,130.5,102.0,0,0,0,1,0,0,1,50.0,50.0,50.0,5.0,5.0,5.0,17.0,15.5,14.0,0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,0.0,0.0,0.0,3.0,3,0.0,76,0.0,15.0,0.0,6.0,0.0,3.0,5.0,0.0,5.0,11.0,0.0,6.0,0.0,0.0,12.0,13.0
2,30001939,2151-04-06 16:55:00,2151-04-15 15:40:00,214.75,1,EMERGENCY ROOM,Medicaid,ENGLISH,WHITE,SINGLE,M,47,460,42.0,175.0,20.0,19.142857,18.0,7.28,7.224286,7.18,1.3,1.25,1.2,92.5,108.0,87.821429,62.0,98.0,76.431034,58.0,98.0,78.15,57.0,34.0,21.55,14.0,36.67,36.325,35.89,100.0,96.964286,92.0,158.0,112.4,91.0,0,0,0,1,0,0,1,100.0,88.333333,80.0,7.0,4.454545,0.0,31.0,31.0,31.0,0,1.0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,13,4.0,1.0,0.0,4.0,4,0.0,132,1.0,15.0,2.0,9.0,14.0,3.0,5.0,3.0,5.0,2.0,2.0,11.0,0.0,0.0,12.0,48.0
3,30002055,2171-09-26 14:28:00,2171-09-29 08:55:00,66.45,0,WALK-IN/SELF REFERRAL,Medicare,ENGLISH,BLACK/AFRICAN AMERICAN,MARRIED,M,69,2,58.8,178.0,,,,,,,,,,,137.0,116.5,90.0,150.0,83.25,44.0,65.0,53.25,44.0,26.0,19.58,10.0,35.0,33.95,33.2,100.0,99.190476,84.0,271.0,229.111111,167.0,0,1,0,1,0,0,2,30.0,30.0,30.0,19.0,11.933333,8.0,30.0,23.833333,18.0,0,0.0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,5,,,,4.0,1,0.0,62,7.0,10.0,20.0,0.0,15.0,,,,5.0,,,,,3.0,2.0,0.0
4,30003299,2169-08-22 01:51:00,2169-08-28 12:02:00,154.183333,1,EMERGENCY ROOM,Other,ENGLISH,WHITE,SINGLE,M,26,1,120.0,178.0,29.0,24.888889,21.0,7.4,7.335556,7.27,4.0,2.777778,1.5,280.0,133.0,119.5,101.0,122.0,93.071429,70.0,,,,18.0,17.105263,12.0,37.44,36.971667,36.39,100.0,98.555556,96.0,185.0,152.166667,130.0,0,0,0,0,0,0,0,50.0,48.333333,40.0,5.0,5.0,5.0,25.0,23.6,22.0,0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0.0,0.0,,0.0,3,0.0,48,7.0,7.0,0.0,0.0,0.0,3.0,0.0,0.0,4.0,0.0,0.0,,,0.0,12.0,15.0


**Convert timestamps to `datetime`**

In [18]:
df[["starttime", "endtime"]] = df[["starttime", "endtime"]].apply(pd.to_datetime)

**Exclude patients with `time_to_death` < 24 hours**

In [19]:
df.drop(df[df.stay_id.isin(stays_to_excl)].index, inplace=True)

**Check that MV duration is the same and drop it**

In [20]:
assert (df_meta.duration.round(3) == df.duration.round(3)).all()
assert (df_meta.over72h == df.over72h).all()

In [21]:
df.drop(columns=["duration", "over72h"], inplace=True)

**Make sure that `stay_id, starttime` and `endtime` columns are consistent**

In [22]:
assert df.stay_id.nunique() == df_meta.stay_id.nunique()
assert set(df.stay_id) == set(df_meta.stay_id)

In [23]:
assert set(df.starttime) == set(df_meta.starttime)
assert set(df.endtime) == set(df_meta.endtime)

**Merge datasets**

In [24]:
df = df.merge(df_meta[["stay_id", "starttime", "endtime", "duration", "log_duration", "over72h", "alive96h"]], 
              on=["stay_id", "starttime", "endtime"])

## Outlier values

**Change very high `plateau_pressure_max` values to NaN**

In [28]:
df.loc[df.plateau_pressure_max > 1000, "plateau_pressure_max"] = np.nan

**Change very high `glucose_max` values to NaN**

In [29]:
df.loc[df.glucose_max == 999999.0, "glucose_max"] = np.nan

**Change negative `hours_in_hosp_before_ventilation` to 0**

In [30]:
df.loc[df.hours_in_hosp_before_intubation < 0, "hours_in_hosp_before_intubation"] = 0

**Save datasets**

In [31]:
df_meta.to_csv("../data/mimic-metadata.csv", index=False)

In [32]:
df.to_csv("../data/mimic-ft98.csv", index=False)

# eICU: Format the dataset

In [33]:
def format_df(df):
    df.rename(columns={"patientunitstayid": "stay_id",
                       "vent_start": "starttime",
                       "vent_end": "endtime",
                       "hr_min": "heart_rate_min", 
                       "hr_max": "heart_rate_max", 
                       "resp_min": "resp_rate_min", 
                       "resp_max": "resp_rate_max", 
                       "mbp_arterial_max": "mbp_max", 
                       "cadiovascular": "cardiovascular",
                       "vent_duration": "duration"}, 
              inplace=True)
    df = df.round(3)
    return df

In [34]:
df1 = pd.read_csv("../data/eicu/ft17_eicu_new.csv")
print(df1.shape)
df1 = format_df(df1)
df1.head()

(21185, 22)


Unnamed: 0,stay_id,starttime,endtime,duration,over72h,ph_max,spo2_min,heart_rate_min,heart_rate_max,resp_rate_min,resp_rate_max,temp_min,temp_max,glucose_max,glucose_min,co2_total_max,co2_total_min,mbp_max,mbp_ni_min,apsiii,peep_max,peep_min
0,2127890,1853,4506,44.217,0,,89.0,107.0,196.0,16.0,49.0,35.9,37.4,187.0,80.0,,,93.0,52.0,96.0,8.0,5.0
1,2519150,95,4175,68.0,0,,88.0,96.0,116.0,13.0,28.0,37.4,38.4,194.0,106.0,,,90.0,60.0,39.0,10.0,5.0
2,919705,3012,5367,39.25,0,7.51,85.0,58.0,73.0,15.0,20.0,35.8,36.4,288.0,219.0,,,129.0,61.0,35.0,15.0,5.0
3,1554681,44,1724,28.0,0,7.4,91.0,87.0,113.0,0.0,23.0,36.8,37.4,,,,,88.0,56.0,71.0,,
4,260998,89,1937,30.8,0,7.25,75.0,109.0,121.0,28.0,35.0,36.3,37.3,278.0,92.0,,,82.5,43.0,140.0,8.0,8.0


In [35]:
df2 = pd.read_csv("../data/eicu/eicu_features_v2.csv", index_col=0)
print(df2.shape)
df2 = format_df(df2)
df2.drop(columns=["peep_max", "peep_avg", "peep_min", "temp_avg"], inplace=True)
df2.head()

(21185, 19)


Unnamed: 0,stay_id,starttime,endtime,duration,over72h,apsiii,resp_rate_min,ph_max,temp_max,co2_total_avg,co2_total_min,fio2_min,plateau_pressure_max,height,vasopressor
0,177241,259,1836,26.283,0,,35.0,7.34,36.9,,,60.0,29.0,167.6,0.0
1,188948,2638,6630,66.533,0,65.0,36.0,7.24,37.2,,,60.0,25.0,172.7,0.0
2,224432,3573,7952,72.983,1,75.0,28.0,7.17,40.7,,,40.0,26.0,185.4,0.0
3,257535,134,6545,106.85,1,67.0,28.0,7.3,37.947,,,70.0,,170.2,1.0
4,349218,1332,6699,89.45,1,116.0,26.0,7.461,37.2,,,30.0,23.0,154.9,1.0


In [36]:
df1 = df1.merge(df2)
df1.shape

(21185, 27)

In [37]:
df2 = pd.read_csv("../data/eicu/supp_peep.csv")
print(df2.shape)
df2 = format_df(df2)
df2.head()

(21185, 8)


Unnamed: 0,stay_id,starttime,endtime,duration,over72h,peep_max,peep_avg,peep_min
0,153487,2114,19442,288.8,1,18.0,7.857,5.0
1,166995,1576,3798,37.033,0,16.0,8.667,5.0
2,175631,1166,11933,179.45,1,30.0,21.625,8.0
3,181480,173,10378,170.083,1,18.0,14.25,12.0
4,190821,865,5398,75.55,1,22.0,19.5,12.0


In [38]:
df1 = df1.merge(df2)
df1.shape

(21185, 28)

In [39]:
df2 = pd.read_csv("../data/eicu/supp_temp_avg.csv")
print(df2.shape)
df2 = format_df(df2)
df2.head()

(21185, 6)


Unnamed: 0,stay_id,starttime,endtime,duration,over72h,temp_avg
0,1132606,168,1859,28.183,0,36.783
1,2951298,44,2612,42.8,0,36.657
2,471290,137,2640,41.717,0,36.367
3,2126939,231,2418,36.45,0,36.933
4,3233209,474,2879,40.083,0,32.81


In [40]:
df1 = df1.merge(df2)
df1.shape

(21185, 29)

In [41]:
df2 = pd.read_csv("../data/eicu/eicu_aps_subscores.csv")
print(df2.shape)
df2 = format_df(df2)
df2.head()

(21185, 17)


Unnamed: 0,stay_id,hr_score,mbp_score,temp_score,resp_rate_score,pao2_aado2_score,hematocrit_score,wbc_score,creatinine_score,uo_score,bun_score,sodium_score,albumin_score,bilirubin_score,glucose_score,acidbase_score,gcs_score
0,3059098,,,,,,,,,,,,,,,,
1,1557074,,,,,,,,,,,,,,,,
2,1573088,,,,,,,,,,,,,,,,
3,2916549,,,,,,,,,,,,,,,,
4,3330465,,,,,,,,,,,,,,,,


In [42]:
df1 = df1.merge(df2)
df1.shape

(21185, 45)

In [43]:
df2 = pd.read_csv("../data/eicu/eicu_sofa_subscores.csv")
print(df2.shape)
df2 = format_df(df2)
df2.head()

(21185, 8)


Unnamed: 0,stay_id,SOFA,respiration,coagulation,liver,cardiovascular,cns,renal
0,958647,17,3,2,4,3,1,4
1,652524,18,4,4,2,3,3,2
2,965324,0,0,0,0,0,0,0
3,3205487,17,4,1,1,4,4,3
4,2419145,17,3,2,4,4,0,4


In [44]:
df1 = df1.merge(df2)
df1.shape

(21185, 52)

In [45]:
df2 = pd.read_csv("../data/eicu/eicu_hospitalid.csv")
print(df2.shape)
df2 = format_df(df2)
df2.head()

(21185, 2)


Unnamed: 0,stay_id,hospitalid
0,1617649,256
1,1822518,275
2,2292860,338
3,2905612,413
4,3123057,420


In [46]:
df1 = df1.merge(df2)
df1.shape

(21185, 53)

In [47]:
df2 = pd.read_csv("../data/eicu/eicu_hospital_info.csv")
print(df2.shape)
df2 = format_df(df2)
df2.head()

(21185, 5)


Unnamed: 0,stay_id,hospitalid,numbedscategory,region,teachingstatus
0,967715,182,100 - 249,South,False
1,3134154,428,100 - 249,South,False
2,2377619,337,,,False
3,977869,184,250 - 499,South,False
4,966290,184,250 - 499,South,False


In [48]:
df2.numbedscategory = df2.numbedscategory.astype('category')
df2.numbedscategory.cat.rename_categories({"<100": "S", "100 - 249": "M", 
                                           "250 - 499": "L", ">= 500": "XL"}, 
                                          inplace=True)
df2.numbedscategory.cat.reorder_categories(["S", "M", "L", "XL"], inplace=True)

  res = method(*args, **kwargs)
  res = method(*args, **kwargs)


In [49]:
df1 = df1.merge(df2)
df1.shape

(21185, 56)

In [50]:
df2 = pd.read_csv("../data/eicu/eicu_some_ft.csv")
print(df2.shape)
df2 = format_df(df2)
df2.head()

(21185, 15)


Unnamed: 0,stay_id,starttime,endtime,duration,over72h,lactate_max,lactate_min,lactate_avg,resp_rate_max,resp_rate_min,resp_rate_avg,plateau_pressure_max,plateau_pressure_avg,plateau_pressure_min,age
0,147985,539,2506,32.783,0,12.6,4.0,7.933,29.0,0.0,13.809,26.0,18.6,11.0,78
1,153409,11,3132,52.017,0,14.1,10.9,12.233,31.0,19.0,24.965,31.0,28.308,23.0,58
2,157895,384,7366,116.367,1,2.9,2.9,2.9,36.0,8.0,18.808,15.0,14.5,14.0,50
3,158050,304,54314,900.167,1,4.9,2.0,3.45,22.0,17.0,20.25,22.0,20.25,18.0,72
4,171281,460,5375,81.917,1,3.5,3.2,3.35,20.0,10.0,14.889,25.0,22.8,20.0,28


In [51]:
df1 = df1.merge(df2)
df1.shape

(21185, 63)

In [52]:
df2 = pd.read_csv("../data/eicu/eicu_hospitaladmitoffset.csv")
print(df2.shape)
df2 = format_df(df2)
df2.head()

(21185, 2)


Unnamed: 0,stay_id,hospitaladmitoffset
0,2145961,-202
1,2961463,-4
2,3097292,-4
3,1134760,-95
4,1661026,-10


In [53]:
df2.loc[df2.hospitaladmitoffset > 0, "hospitaladmitoffset"] = 0

In [54]:
df1 = df1.merge(df2)
df1.shape

(21185, 64)

In [55]:
df1["hours_in_hosp_before_intubation"] = ((df1.starttime - df1.hospitaladmitoffset) / 60).astype(int)
df1.drop("hospitaladmitoffset", axis=1, inplace=True)

**Drop `vasopressor` column**

In [56]:
df1.drop(columns=["vasopressor"], inplace=True)

**Add log duration**

In [57]:
df1["log_duration"] = np.log(df1.duration)
df1.shape

(21185, 64)

## Outlier values

**Change very high `plateau_pressure_max` values to NaN**

In [59]:
df1.loc[df1.plateau_pressure_max > 1000, "plateau_pressure_max"] = np.nan

762      1500.0
16750    9999.0
Name: plateau_pressure_max, dtype: float64

**Change string values in `age`**

In [68]:
df1.loc[df1.age == "> 89", "age"] = 90
df1.age = df1.age.astype(int)

In [69]:
df1.columns

Index(['stay_id', 'starttime', 'endtime', 'duration', 'over72h', 'ph_max',
       'spo2_min', 'heart_rate_min', 'heart_rate_max', 'resp_rate_min',
       'resp_rate_max', 'temp_min', 'temp_max', 'glucose_max', 'glucose_min',
       'co2_total_max', 'co2_total_min', 'mbp_max', 'mbp_ni_min', 'apsiii',
       'peep_max', 'peep_min', 'co2_total_avg', 'fio2_min',
       'plateau_pressure_max', 'height', 'peep_avg', 'temp_avg', 'hr_score',
       'mbp_score', 'temp_score', 'resp_rate_score', 'pao2_aado2_score',
       'hematocrit_score', 'wbc_score', 'creatinine_score', 'uo_score',
       'bun_score', 'sodium_score', 'albumin_score', 'bilirubin_score',
       'glucose_score', 'acidbase_score', 'gcs_score', 'SOFA', 'respiration',
       'coagulation', 'liver', 'cardiovascular', 'cns', 'renal', 'hospitalid',
       'numbedscategory', 'region', 'teachingstatus', 'lactate_max',
       'lactate_min', 'lactate_avg', 'resp_rate_avg', 'plateau_pressure_avg',
       'plateau_pressure_min', 'age', 'ho

**Save the dataset**

In [70]:
df1.to_csv("../data/eicu-ft58.csv", index=False)