# Purpose of the notebook
1. In this notebook two tables from MIMIC-III dataset are merged and cleaned. 
    * Tables merged - ADMISSIONS, PATIENTS
2. Qualifying hospital dmission data are defined and extracted.
3. 30-day readmission label (0 or 1) are calculated for all qualifying hospital admissions. Note the classification will be carried ou

Note: see data/README.md for important info about access to data.

In [1]:
import pandas as pd
pd.options.mode.chained_assignment = None
# load the admissions table, count the number of rows and display column names
admissions=pd.read_csv('../../data/raw/ADMISSIONS.csv', index_col=['SUBJECT_ID']).sort_index()
print(len(admissions))
print(admissions.columns)

58976
Index(['ROW_ID', 'HADM_ID', 'ADMITTIME', 'DISCHTIME', 'DEATHTIME',
       'ADMISSION_TYPE', 'ADMISSION_LOCATION', 'DISCHARGE_LOCATION',
       'INSURANCE', 'LANGUAGE', 'RELIGION', 'MARITAL_STATUS', 'ETHNICITY',
       'EDREGTIME', 'EDOUTTIME', 'DIAGNOSIS', 'HOSPITAL_EXPIRE_FLAG',
       'HAS_CHARTEVENTS_DATA'],
      dtype='object')


In [2]:
admissions.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 58976 entries, 2 to 99999
Data columns (total 18 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   ROW_ID                58976 non-null  int64 
 1   HADM_ID               58976 non-null  int64 
 2   ADMITTIME             58976 non-null  object
 3   DISCHTIME             58976 non-null  object
 4   DEATHTIME             5854 non-null   object
 5   ADMISSION_TYPE        58976 non-null  object
 6   ADMISSION_LOCATION    58976 non-null  object
 7   DISCHARGE_LOCATION    58976 non-null  object
 8   INSURANCE             58976 non-null  object
 9   LANGUAGE              33644 non-null  object
 10  RELIGION              58518 non-null  object
 11  MARITAL_STATUS        48848 non-null  object
 12  ETHNICITY             58976 non-null  object
 13  EDREGTIME             30877 non-null  object
 14  EDOUTTIME             30877 non-null  object
 15  DIAGNOSIS             58951 non-null

### Description of the columns
Detailed description is available at https://mimic.physionet.org/mimictables/admissions/

* ROW_ID, HADM_ID - unique identifiers
* ADMITTIME, DISCHTIME, DEATHTIME, EDREGTIME, EDOUTTIME - time of various events associated with admission
    * ADMITTIME and DISCHTIME for the same patient enable calculation of the 30-day readmission label (output label)
    * ADMITTIME and DISCHTIME for the same patient enable determination of prior admissions (new engineered feature)
    * The other time features will not be used
* ADMISSION_TYPE - elective (planned) or urgent/emergency unplanned admission type
    * ADMISSION_TYPE affects 30-day readmission calculation, only unplanned admissions within 30 days count
    * newborn admissions do not qualify and will be filtered out
* ADMISSION_LOCATION, DISCHARGE_LOCATION, INSURANCE, LANGUAGE, RELIGION, MARITAL_STATUS - mostly demographic data, explanatory variables
* DIAGNOSIS - non-systematic diagnosis at admission, will not be used for this analysis
* HOSPITAL_EXPIRE_FLAG - Since this feature cannot be generalized to new data, we will not use it in the analysis
* HAS_CHARTEVENTS_DATA - will not be used in this analysis

Next steps:
1. filter out newborn admissions
2. calculate 30-day readmission label, prior admissions, length of stay in days

In [3]:
# Count the number of admissions based on categories
patient_category_count = admissions.groupby('ADMISSION_TYPE')['HADM_ID'].count()
patient_category_count = patient_category_count.sort_values(ascending = False)
print(patient_category_count)

ADMISSION_TYPE
EMERGENCY    42071
NEWBORN       7863
ELECTIVE      7706
URGENT        1336
Name: HADM_ID, dtype: int64


In [4]:
orig_adm_num = len(admissions)
admissions = admissions[admissions.ADMISSION_TYPE!='NEWBORN']
filt_adm_num = len(admissions)
print("After filtering the number of admissions went down to ", filt_adm_num, "from ", orig_adm_num)

After filtering the number of admissions went down to  51113 from  58976


In [5]:
# 2.calculate 30-day readmission label, prior admissions, length of stay in days
# convert ADMIDTIME and DISCHTIME to datetime objects
admissions['ADMITTIME'] = pd.to_datetime(admissions['ADMITTIME'])
admissions['DISCHTIME'] = pd.to_datetime(admissions['DISCHTIME'])

# sort by subject_ID and admission date
admissions = admissions.sort_values(['SUBJECT_ID','ADMITTIME'])

# Create a column "PREV_DISCHTIME" and get the date of previous discharge
admissions['PREV_DISCHTIME'] = admissions.groupby('SUBJECT_ID').DISCHTIME.shift(1)

In [6]:
# combine urgent and emergency admissions
admissions.loc[admissions['ADMISSION_TYPE']=='URGENT', 'ADMISSION_TYPE'] = 'EMERGENCY'

# filter out ELECTIVE admissions from current admission
admissions = admissions[admissions['ADMISSION_TYPE']=='EMERGENCY']

# Create a column "NEXT_ADMITTIME" and get the date of next admission
admissions['NEXT_ADMITTIME'] = admissions.groupby('SUBJECT_ID').ADMITTIME.shift(-1)

# create a new calumn calculating the number of days from previous discharge
admissions['TIME_FROM_PREV_DICH'] = (admissions['ADMITTIME'].dt.date-admissions['PREV_DISCHTIME'].dt.date).dt.days

# create a new calumn calculating the number of days to next admission
admissions['TIME_TO_NEXT_ADMIT'] = (admissions['NEXT_ADMITTIME'].dt.date-admissions['DISCHTIME'].dt.date).dt.days

# create 2 new columns: 1. - OUTPUT_LABEL (0-not readmitted within 30 days, 1- readmitted within 30 days), 1. - 1Y_PRIOR_ADM (0-No, 1-Yes)
admissions.loc[admissions['TIME_TO_NEXT_ADMIT'] <= 30, 'OUTPUT_LABEL'] = 1
admissions.loc[admissions['TIME_FROM_PREV_DICH'] < 365, '1Y_PRIOR_ADM'] = 1
admissions['OUTPUT_LABEL'] = admissions['OUTPUT_LABEL'].fillna(0)
admissions['1Y_PRIOR_ADM'] = admissions['1Y_PRIOR_ADM'].fillna(0)

# calculate length of stay in days
admissions['LENGTH_OF_STAY_DAYS'] = (admissions['DISCHTIME'].dt.date-admissions['ADMITTIME'].dt.date).dt.days

In [7]:
# To get more demographic information about patients we need the PATIENTS.csv table
patients_df = pd.read_csv('../../data/raw/PATIENTS.csv', index_col=['SUBJECT_ID'], parse_dates= ['DOB'], encoding='utf-8-sig').sort_index()
patients_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46520 entries, 2 to 99999
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   ROW_ID       46520 non-null  int64         
 1   GENDER       46520 non-null  object        
 2   DOB          46520 non-null  datetime64[ns]
 3   DOD          15759 non-null  object        
 4   DOD_HOSP     9974 non-null   object        
 5   DOD_SSN      13378 non-null  object        
 6   EXPIRE_FLAG  46520 non-null  int64         
dtypes: datetime64[ns](1), int64(2), object(4)
memory usage: 2.8+ MB


## Columns to use
* Gender - demographic info
* DOB - necessary to calculate the age of patients
* Remaining columns mortality data not useful

In [8]:
admissions = admissions.merge(patients_df[['GENDER', 'DOB']],left_index=True, right_index=True)
# calculate the age by subtracting date of birth from admission date (MIMIC documentation)
admissions['AGE'] = admissions['ADMITTIME'].dt.year - admissions['DOB'].dt.year
# Note, as part of patient deidentification, the age of patients over 89 has been set to 300.
# Reset the age of patients over 89 (coded as 300) to 91 (median age before transformation)
admissions.loc[admissions['AGE'] > 89, 'AGE'] = 91
admissions.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43407 entries, 3 to 99992
Data columns (total 28 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   ROW_ID                43407 non-null  int64         
 1   HADM_ID               43407 non-null  int64         
 2   ADMITTIME             43407 non-null  datetime64[ns]
 3   DISCHTIME             43407 non-null  datetime64[ns]
 4   DEATHTIME             5595 non-null   object        
 5   ADMISSION_TYPE        43407 non-null  object        
 6   ADMISSION_LOCATION    43407 non-null  object        
 7   DISCHARGE_LOCATION    43407 non-null  object        
 8   INSURANCE             43407 non-null  object        
 9   LANGUAGE              28349 non-null  object        
 10  RELIGION              42976 non-null  object        
 11  MARITAL_STATUS        40962 non-null  object        
 12  ETHNICITY             43407 non-null  object        
 13  EDREGTIME       

### Remaining steps
3. eliminate unused columns
4. fill out missing values with the mode of the column

In [9]:
admissions.columns

Index(['ROW_ID', 'HADM_ID', 'ADMITTIME', 'DISCHTIME', 'DEATHTIME',
       'ADMISSION_TYPE', 'ADMISSION_LOCATION', 'DISCHARGE_LOCATION',
       'INSURANCE', 'LANGUAGE', 'RELIGION', 'MARITAL_STATUS', 'ETHNICITY',
       'EDREGTIME', 'EDOUTTIME', 'DIAGNOSIS', 'HOSPITAL_EXPIRE_FLAG',
       'HAS_CHARTEVENTS_DATA', 'PREV_DISCHTIME', 'NEXT_ADMITTIME',
       'TIME_FROM_PREV_DICH', 'TIME_TO_NEXT_ADMIT', 'OUTPUT_LABEL',
       '1Y_PRIOR_ADM', 'LENGTH_OF_STAY_DAYS', 'GENDER', 'DOB', 'AGE'],
      dtype='object')

In [10]:
# 3. eliminate unused columns
cols_to_delete = ['ROW_ID', 'ADMITTIME', 'DISCHTIME', 'DEATHTIME',
       'ADMISSION_TYPE', 'INSURANCE', 'LANGUAGE', 'RELIGION', 'MARITAL_STATUS',
       'EDREGTIME', 'EDOUTTIME', 'DIAGNOSIS', 'HOSPITAL_EXPIRE_FLAG',
       'HAS_CHARTEVENTS_DATA', 'PREV_DISCHTIME', 'NEXT_ADMITTIME',
       'TIME_FROM_PREV_DICH', 'TIME_TO_NEXT_ADMIT', 'DOB']
admissions.drop(columns=cols_to_delete, inplace=True)

In [11]:
admissions.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43407 entries, 3 to 99992
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   HADM_ID              43407 non-null  int64  
 1   ADMISSION_LOCATION   43407 non-null  object 
 2   DISCHARGE_LOCATION   43407 non-null  object 
 3   ETHNICITY            43407 non-null  object 
 4   OUTPUT_LABEL         43407 non-null  float64
 5   1Y_PRIOR_ADM         43407 non-null  float64
 6   LENGTH_OF_STAY_DAYS  43407 non-null  int64  
 7   GENDER               43407 non-null  object 
 8   AGE                  43407 non-null  int64  
dtypes: float64(2), int64(3), object(4)
memory usage: 3.3+ MB


In [12]:
# The resulting dataset has no missing values, let's see if some columns can be further processed e.g. to 
# reduce cardinality
print("Uniqie categories and counts for ADMISSION_LOCATION")
print(admissions.ADMISSION_LOCATION.value_counts())

Uniqie categories and counts for ADMISSION_LOCATION
EMERGENCY ROOM ADMIT         22754
CLINIC REFERRAL/PREMATURE    10020
TRANSFER FROM HOSP/EXTRAM     8414
PHYS REFERRAL/NORMAL DELI     1880
TRANSFER FROM SKILLED NUR      260
TRANSFER FROM OTHER HEALT       68
** INFO NOT AVAILABLE **         5
TRSF WITHIN THIS FACILITY        5
HMO REFERRAL/SICK                1
Name: ADMISSION_LOCATION, dtype: int64


In [13]:
# Merge categories with fewer than 0.5% samples into the largest category
admissions.loc[admissions['ADMISSION_LOCATION']=='** INFO NOT AVAILABLE **', 'ADMISSION_LOCATION']='EMERGENCY ROOM ADMIT'
admissions.loc[admissions['ADMISSION_LOCATION']=='HMO REFERRAL/SICK', 'ADMISSION_LOCATION']='EMERGENCY ROOM ADMIT'
admissions.loc[admissions['ADMISSION_LOCATION']=='TRANSFER FROM OTHER HEALT', 'ADMISSION_LOCATION']='EMERGENCY ROOM ADMIT'
admissions.loc[admissions['ADMISSION_LOCATION']=='TRSF WITHIN THIS FACILITY', 'ADMISSION_LOCATION']='EMERGENCY ROOM ADMIT'
print("Uniqie categories and counts for ADMISSION_LOCATION, after transformation")
print(admissions.ADMISSION_LOCATION.value_counts())

Uniqie categories and counts for ADMISSION_LOCATION, after transformation
EMERGENCY ROOM ADMIT         22833
CLINIC REFERRAL/PREMATURE    10020
TRANSFER FROM HOSP/EXTRAM     8414
PHYS REFERRAL/NORMAL DELI     1880
TRANSFER FROM SKILLED NUR      260
Name: ADMISSION_LOCATION, dtype: int64


In [14]:
print("Uniqie categories and counts for DISCHARGE_LOCATION")
print(admissions.DISCHARGE_LOCATION.value_counts())

Uniqie categories and counts for DISCHARGE_LOCATION
HOME                         11462
HOME HEALTH CARE              9232
SNF                           6631
REHAB/DISTINCT PART HOSP      5712
DEAD/EXPIRED                  5595
LONG TERM CARE HOSPITAL       2148
DISC-TRAN CANCER/CHLDRN H      601
SHORT TERM HOSPITAL            508
DISCH-TRAN TO PSYCH HOSP       446
HOSPICE-HOME                   392
LEFT AGAINST MEDICAL ADVI      353
HOSPICE-MEDICAL FACILITY       150
OTHER FACILITY                  60
HOME WITH HOME IV PROVIDR       59
ICF                             47
DISC-TRAN TO FEDERAL HC         10
SNF-MEDICAID ONLY CERTIF         1
Name: DISCHARGE_LOCATION, dtype: int64


In [15]:
# Most discharge categories are different from the expected fraction, so I will only merge categories with < 0.5% samples
admissions.loc[admissions['DISCHARGE_LOCATION']=='DISC-TRAN TO FEDERAL HC', 'DISCHARGE_LOCATION']='HOME'
admissions.loc[admissions['DISCHARGE_LOCATION']=='HOME WITH HOME IV PROVIDR', 'DISCHARGE_LOCATION']='HOME'
admissions.loc[admissions['DISCHARGE_LOCATION']=='HOSPICE-MEDICAL FACILITY', 'DISCHARGE_LOCATION']='HOME'
admissions.loc[admissions['DISCHARGE_LOCATION']=='ICF', 'DISCHARGE_LOCATION']='HOME'
admissions.loc[admissions['DISCHARGE_LOCATION']=='OTHER FACILITY', 'DISCHARGE_LOCATION']='HOME'
admissions.loc[admissions['DISCHARGE_LOCATION']=='SNF-MEDICAID ONLY CERTIF', 'DISCHARGE_LOCATION']='HOME'
print("Uniqie categories and counts for DISCHARGE_LOCATION after transformation")
print(admissions.DISCHARGE_LOCATION.value_counts())

Uniqie categories and counts for DISCHARGE_LOCATION after transformation
HOME                         11789
HOME HEALTH CARE              9232
SNF                           6631
REHAB/DISTINCT PART HOSP      5712
DEAD/EXPIRED                  5595
LONG TERM CARE HOSPITAL       2148
DISC-TRAN CANCER/CHLDRN H      601
SHORT TERM HOSPITAL            508
DISCH-TRAN TO PSYCH HOSP       446
HOSPICE-HOME                   392
LEFT AGAINST MEDICAL ADVI      353
Name: DISCHARGE_LOCATION, dtype: int64


In [16]:
print("Uniqie categories and counts for ETHNICITY")
print(admissions.ETHNICITY.value_counts())

Uniqie categories and counts for ETHNICITY
WHITE                                                       30300
BLACK/AFRICAN AMERICAN                                       4230
UNKNOWN/NOT SPECIFIED                                        3554
HISPANIC OR LATINO                                           1171
OTHER                                                         965
UNABLE TO OBTAIN                                              702
ASIAN                                                         696
PATIENT DECLINED TO ANSWER                                    296
HISPANIC/LATINO - PUERTO RICAN                                204
ASIAN - CHINESE                                               189
BLACK/CAPE VERDEAN                                            148
WHITE - RUSSIAN                                               145
BLACK/HAITIAN                                                  93
MULTI RACE ETHNICITY                                           89
HISPANIC/LATINO - DOMINICAN      

In [17]:
# Most of the categories are related to 5 major groups - Asian, Black, Hispanic, White and other, merge the categories
admissions['ETHNICITY'] = admissions['ETHNICITY'].str.split("/|\s").str.get(0)
ethnicity_list = ['ASIAN','BLACK','HISPANIC','WHITE']
admissions.loc[~admissions['ETHNICITY'].isin(ethnicity_list),'ETHNICITY'] = 'OTHER'
print("Uniqie categories and counts for ETHNICITY after transformation")
print(admissions.ETHNICITY.value_counts())

Uniqie categories and counts for ETHNICITY after transformation
WHITE       30555
OTHER        5743
BLACK        4508
HISPANIC     1543
ASIAN        1058
Name: ETHNICITY, dtype: int64


In [18]:
# Lastly encode all categorical variables with dummy columns
dummy_cols = ['ADMISSION_LOCATION', 'DISCHARGE_LOCATION', 'ETHNICITY', 'GENDER']
admissions = pd.get_dummies(admissions, prefix=dummy_cols, prefix_sep='_', columns=dummy_cols, sparse=False, drop_first=True)

In [19]:
admissions.columns

Index(['HADM_ID', 'OUTPUT_LABEL', '1Y_PRIOR_ADM', 'LENGTH_OF_STAY_DAYS', 'AGE',
       'ADMISSION_LOCATION_EMERGENCY ROOM ADMIT',
       'ADMISSION_LOCATION_PHYS REFERRAL/NORMAL DELI',
       'ADMISSION_LOCATION_TRANSFER FROM HOSP/EXTRAM',
       'ADMISSION_LOCATION_TRANSFER FROM SKILLED NUR',
       'DISCHARGE_LOCATION_DISC-TRAN CANCER/CHLDRN H',
       'DISCHARGE_LOCATION_DISCH-TRAN TO PSYCH HOSP',
       'DISCHARGE_LOCATION_HOME', 'DISCHARGE_LOCATION_HOME HEALTH CARE',
       'DISCHARGE_LOCATION_HOSPICE-HOME',
       'DISCHARGE_LOCATION_LEFT AGAINST MEDICAL ADVI',
       'DISCHARGE_LOCATION_LONG TERM CARE HOSPITAL',
       'DISCHARGE_LOCATION_REHAB/DISTINCT PART HOSP',
       'DISCHARGE_LOCATION_SHORT TERM HOSPITAL', 'DISCHARGE_LOCATION_SNF',
       'ETHNICITY_BLACK', 'ETHNICITY_HISPANIC', 'ETHNICITY_OTHER',
       'ETHNICITY_WHITE', 'GENDER_M'],
      dtype='object')

In [20]:
# the index of the table SUBJECT_ID will not be used anymore, from this point forward all the merges will happen 
# HADM_ID, so delete the index
# Save the intermediate table with patient demographic
admissions.to_csv('../../data/intermediate/inter022920a.csv')