# 1. Description

This notebook will performe the subgrouping algorithm shown in Figure S1 in the paper "Prediction of the ICU mortality based on the missing events.".

# 2. Before running...

Before proceeding the followings, plaease solve the python environment accordingly first. This program requires the following libraries.

In [1]:
import pandas as pd # 1.2.1
import itertools

Then please put the related files at the appropriate directory so that the program reaches the input files. To test if you correctly set the input files, run the below. If you could, the cell would not end without errors.

In [2]:
input_ids = set(pd.read_csv("ids/005_non_sepsis_aps_23170.csv", header=None).iloc[:,0].tolist())
len(input_ids) # 4226

23170

In [3]:
eICU_file = {}
eICU_file['apacheApsVar'] = pd.read_csv('data/apacheApsVar.csv')
eICU_file['apachePatientResult'] = pd.read_csv('data/apachePatientResult.csv')
eICU_file['apachePredVar'] = pd.read_csv('data/apachePredVar.csv')
eICU_file['patient'] = pd.read_csv('data/patient.csv')

For preparing the file "005_non_sepsis_aps_23170.csv", please follow this notebook (1_the_sepsis_group_and_non_sepsis_group.ipynb).<br>
<br>
For getting the eICU files, please see "https://www.usa.philips.com/healthcare/solutions/enterprise-telehealth/eri" or "https://eicu-crd.mit.edu/gettingstarted/access/". 

# 3. Class definition

In [4]:
class status():
    def __init__(self, df):
        # df
        self.df = df
        self.df_thistime = pd.DataFrame([], columns=df.columns)
        self.df_next = df
        # ID
        self.ids_all = set(df.patientunitstayid)
        self.ids_thistime = set([])
        self.ids_next = self.ids_all
        # parameter
        self.target = []
        
        
    def remove(self, target):
        self.target += target
        # Update ID
        tmp = set(self.df_next.drop(target, axis=1).where(self.df_next>=0).dropna().patientunitstayid)
        self.ids_thistime |= tmp
        self.ids_next -= tmp
        
        # df thistime
        self.df_thistime = self.df.drop(self.target, axis=1)
        self.df_thistime = self.df_thistime.query("patientunitstayid in @ self.ids_thistime")
        
        # df_next
        self.df_next = self.df_next.drop(target, axis=1)
        self.df_next = self.df_next.query("patientunitstayid in @ self.ids_next")
        
        
    def get_next(self, depth):
        # 1st and Last columns are not paramters
        parameters = self.df_next.columns[1:-1]
        
        # depth : the number of parameters to be excluded at once
        combinations = [list(i) for i in itertools.combinations(parameters, depth)]
        
        # # of non-NaN-records 
        num_non_nan = pd.DataFrame({
            "__".join(comb) : [len(self.df_next.drop(comb, axis=1).where(self.df_next>=0).dropna())]
            for comb in combinations
        }).T
        
        # "1 <= # of non-NaN-records < # of pids" is ideal.
        tf = num_non_nan.applymap(lambda x : 1 <= x <= len(self.df_next) - 1)        
        if tf.any().any():
            tmp = num_non_nan[tf]
            return tmp.idxmax()[0]
        
        else:
            # "# of non-NaN-records == # of pids" is acceptable.
            tf = num_non_nan.applymap(lambda x : 1 <= x <= len(self.df_next))
            if tf.any().any():
                tmp = num_non_nan[tf]
                return tmp.idxmax()[0]
            
            else:
                # If there's no more parameteres, return nan      
                if len(parameters) == 1:
                    return "nan"

                # If there's only NaN records, return ""
                else:
                    return ""
                

# 4. Prepare DataFrame

## 4.1. Definition of variables used in this study

In [5]:
eICU_parm = {}

eICU_parm['apacheApsVar'] = [
    'patientunitstayid',
    'intubated',
    'vent',
    'dialysis',
    'eyes',
    'motor',
    'verbal',
    'meds',
    'urine',
    'wbc',
    'temperature',
    'respiratoryrate',
    'sodium',
    'heartrate',
    'meanbp',
    'ph',
    'hematocrit',
    'creatinine',
    'albumin',
    'pao2',
    'pco2',
    'bun',
    'glucose',
    'bilirubin',
    'fio2'
]

eICU_parm['apachePatientResult'] = [
    'patientunitstayid',
    'apachescore',
    'predictedicumortality',
    'predictediculos',
    'predictedhospitalmortality',
    'predictedhospitallos',
    'preopmi',
    'preopcardiaccath',
    'ptcawithin24h',
    'predventdays'
]

eICU_parm['apachePredVar'] = [
    'patientunitstayid',
    'gender',
    'teachtype',
    'bedcount',
    'graftcount',
    'age',
    'thrombolytics',
    'aids',
    'hepaticfailure',
    'lymphoma',
    'metastaticcancer',
    'leukemia',
    'immunosuppression',
    'cirrhosis',
    'ima',
    'midur',
    'ventday1',
    'oobventday1',
    'oobintubday1',
    'diabetes'
]

eICU_parm['patient'] = [
    'patientunitstayid',
    'hospitalid',
    'admissionheight',
    'hospitaladmitoffset',
    'admissionweight'
]

## 4.2. DataFrame

In [6]:
#========================================
#  select columns and ids for each file
#========================================
eICU_df = {}
eICU_df['apacheApsVar'] = eICU_file['apacheApsVar'][eICU_parm['apacheApsVar']].query("patientunitstayid in @ input_ids") 
eICU_df['apachePatientResult'] = eICU_file['apachePatientResult'][eICU_parm['apachePatientResult']].query("patientunitstayid in @ input_ids") 
eICU_df['apachePredVar'] = eICU_file['apachePredVar'][eICU_parm['apachePredVar']].query("patientunitstayid in @ input_ids") 
eICU_df['patient'] = eICU_file['patient'][eICU_parm['patient']].query("patientunitstayid in @ input_ids") 


#========================================
#  make column names unique
#========================================
#  (column name -> filename + column name)
eICU_df['apacheApsVar'].columns = [
    'apacheApsVar_' + parm if not parm=="patientunitstayid" else "patientunitstayid"
    for parm in eICU_df['apacheApsVar'].columns
]
eICU_df['apachePatientResult'].columns = [
    'apachePatientResult_' + parm if not parm=="patientunitstayid" else "patientunitstayid"
    for parm in eICU_df['apachePatientResult'].columns
]
eICU_df['apachePredVar'].columns = [
    'apachePredVar_' + parm if not parm=="patientunitstayid" else "patientunitstayid"
    for parm in eICU_df['apachePredVar'].columns
]
eICU_df['patient'].columns = [
    'patient_' + parm if not parm=="patientunitstayid" else "patientunitstayid"
    for parm in eICU_df['patient'].columns
]


#========================================
#  Make X
#========================================
# 1st column : key (patientunitstayid)
key = pd.DataFrame(list(input_ids), columns=["patientunitstayid"])

# 2nd~ column : parameters
key_X = pd.merge(key, eICU_df['apacheApsVar'], on="patientunitstayid")
key_X = pd.merge(key_X, eICU_df['apachePatientResult'], on="patientunitstayid")
key_X = pd.merge(key_X, eICU_df['apachePredVar'], on="patientunitstayid")
key_X = pd.merge(key_X, eICU_df['patient'], on="patientunitstayid")


#========================================
#  Make X_y (df)
#========================================
# Last column : DEAD(=1) or ALIVE(=0)
y = eICU_file["apachePatientResult"][['patientunitstayid', 'actualicumortality']].replace('ALIVE',0).replace('EXPIRED',1)
key_X_y = pd.merge(key_X, y, on="patientunitstayid")


#========================================
# Rename
#========================================
df = key_X_y

# 5. Subgrouping

In [7]:
df_status = status(df)
k=1

print("# of INPUT : ", len(df_status.ids_all), " patientunitstayids", "\n\n")

while 1:
    print("##################################")
    print("                          Subgroup ",k) 
    print("##################################")
    print("\n") 
    print("Checking the inputs...")
    print("\n") 

    parms_tobe_excluded = []
    
    #========================================
    #  Get Subgroup
    #========================================
    while not(500 <= len(df_status.ids_thistime) <= 10000):

        parms = ""

        # upto 3 parameters taken into account
        for i in range(1,4):
            parms = df_status.get_next(i)

            # if parameters are found
            if parms != "":
                break

                
        # If no paramteres found, Output "Time out" and Stop.
        if parms == "":
            print("Time Out\n")
            df_status.df_next = pd.DataFrame()
            break
            
        # If too many nan, Output "Interpolation needed" and Stop
        if parms == "nan":
            print("Interpolation needed\n")
            df_status.df_next = pd.DataFrame()
            break

            
        # Change format
        parms = parms.split("__")        

        # Update
        df_status.remove(parms)
        parms_tobe_excluded += parms
        print("--> ", [i.split("_")[1] for i in parms], " is/are selected to be excluded.\n")
        
  
    print("--> ", ", ".join([i.split("_")[1] for i in parms_tobe_excluded]), " was/were excluded in the end.\n")
    parms_tobe_excluded = []
    df_A = pd.DataFrame()
    
    if 500 <= len(df_status.ids_next):
        print("--> ", len(df_status.ids_thistime), " patientunitstayids survived.\n")
        df_A = df_status.df_thistime
        df_status = status(df_status.df_next)
        
    else:
        # The rests pids are picked up and merged into thistime.
        print("--> ", len(df_status.ids_thistime)+len(df_status.ids_next), " patientunitstayids survived.\n")
        df_A = pd.concat([df_status.df_thistime, df_status.df_next])
        df_status.ids_next = set([])
        df_status.df_next = df_status.df_next.query("patientunitstayid in @ df_status.ids_next")
        
    df_A = df_A.where(df_A>=0).dropna()
    
    k+=1
    
    if len(df_status.df_next) ==  0:
        break

# of INPUT :  23170  patientunitstayids 


##################################
                          Subgroup  1
##################################


Checking the inputs...


-->  ['hospitaladmitoffset']  is/are selected to be excluded.

-->  hospitaladmitoffset  was/were excluded in the end.

-->  3703  patientunitstayids survived.

##################################
                          Subgroup  2
##################################


Checking the inputs...


-->  ['urine']  is/are selected to be excluded.

-->  urine  was/were excluded in the end.

-->  4112  patientunitstayids survived.

##################################
                          Subgroup  3
##################################


Checking the inputs...


-->  ['predventdays']  is/are selected to be excluded.

-->  predventdays  was/were excluded in the end.

-->  1414  patientunitstayids survived.

##################################
                          Subgroup  4
##################################


C