# Purpose of the notebook
1. In this notebook two tables from MIMIC-III dataset are merged and cleaned. 
    * Tables merged - ADMISSIONS, PATIENTS
2. Qualifying hospital dmission data are defined and extracted.
3. 30-day readmission label (0 or 1) are calculated for all qualifying hospital admissions. Note the classification will be carried ou

Note: see data/README.md for important info about access to data.

In [53]:
import pandas as pd
pd.options.mode.chained_assignment = None
# load the admissions table, count the number of rows and display column names
admissions=pd.read_csv('../../data/raw/ADMISSIONS.csv', index_col=['SUBJECT_ID']).sort_index()
print(len(admissions))
print(admissions.columns)

58976
Index(['ROW_ID', 'HADM_ID', 'ADMITTIME', 'DISCHTIME', 'DEATHTIME',
       'ADMISSION_TYPE', 'ADMISSION_LOCATION', 'DISCHARGE_LOCATION',
       'INSURANCE', 'LANGUAGE', 'RELIGION', 'MARITAL_STATUS', 'ETHNICITY',
       'EDREGTIME', 'EDOUTTIME', 'DIAGNOSIS', 'HOSPITAL_EXPIRE_FLAG',
       'HAS_CHARTEVENTS_DATA'],
      dtype='object')


In [54]:
admissions.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 58976 entries, 2 to 99999
Data columns (total 18 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   ROW_ID                58976 non-null  int64 
 1   HADM_ID               58976 non-null  int64 
 2   ADMITTIME             58976 non-null  object
 3   DISCHTIME             58976 non-null  object
 4   DEATHTIME             5854 non-null   object
 5   ADMISSION_TYPE        58976 non-null  object
 6   ADMISSION_LOCATION    58976 non-null  object
 7   DISCHARGE_LOCATION    58976 non-null  object
 8   INSURANCE             58976 non-null  object
 9   LANGUAGE              33644 non-null  object
 10  RELIGION              58518 non-null  object
 11  MARITAL_STATUS        48848 non-null  object
 12  ETHNICITY             58976 non-null  object
 13  EDREGTIME             30877 non-null  object
 14  EDOUTTIME             30877 non-null  object
 15  DIAGNOSIS             58951 non-null

### Description of the columns
Detailed description is available at https://mimic.physionet.org/mimictables/admissions/

* ROW_ID, HADM_ID - unique identifiers
* ADMITTIME, DISCHTIME, DEATHTIME, EDREGTIME, EDOUTTIME - time of various events associated with admission
    * ADMITTIME and DISCHTIME for the same patient enable calculation of the 30-day readmission label (output label)
    * ADMITTIME and DISCHTIME for the same patient enable determination of prior admissions (new engineered feature)
    * The other time features will not be used
* ADMISSION_TYPE - elective (planned) or urgent/emergency unplanned admission type
    * ADMISSION_TYPE affects 30-day readmission calculation, only unplanned admissions within 30 days count
    * newborn admissions do not qualify and will be filtered out
* ADMISSION_LOCATION, DISCHARGE_LOCATION, INSURANCE, LANGUAGE, RELIGION, MARITAL_STATUS - mostly demographic data, explanatory variables
* DIAGNOSIS - non-systematic diagnosis at admission, will not be used for this analysis
* HOSPITAL_EXPIRE_FLAG - admissions resulting in patient death will be filtered out as non-qialifying
* HAS_CHARTEVENTS_DATA - will not be used in this analysis

Next steps:
1. filter out admissions resulting in patient death and newbor admissions
2. calculate 30-day readmission label, prior admissions, length of stay in days

In [55]:
# 1. filter out admissions resulting in patient death and newborn admissions
# Count the number of admissions based on hospital expire flag
patient_flag_count = admissions.groupby('HOSPITAL_EXPIRE_FLAG')['HADM_ID'].count()
patient_flag_count = patient_flag_count.sort_values(ascending = False)
print(patient_flag_count)
# Count the number of admissions based on categories
patient_category_count = admissions.groupby('ADMISSION_TYPE')['HADM_ID'].count()
patient_category_count = patient_category_count.sort_values(ascending = False)
print(patient_category_count)

HOSPITAL_EXPIRE_FLAG
0    53122
1     5854
Name: HADM_ID, dtype: int64
ADMISSION_TYPE
EMERGENCY    42071
NEWBORN       7863
ELECTIVE      7706
URGENT        1336
Name: HADM_ID, dtype: int64


In [56]:
orig_adm_num = len(admissions)
admissions = admissions[admissions.HOSPITAL_EXPIRE_FLAG==0]
admissions = admissions[admissions.ADMISSION_TYPE!='NEWBORN']
filt_adm_num = len(admissions)
print("After filtering the number of admissions went down to ", filt_adm_num, "from ", orig_adm_num)

After filtering the number of admissions went down to  45321 from  58976


In [57]:
# 2.calculate 30-day readmission label, prior admissions, length of stay in days
# convert ADMIDTIME and DISCHTIME to datetime objects
admissions['ADMITTIME'] = pd.to_datetime(admissions['ADMITTIME'])
admissions['DISCHTIME'] = pd.to_datetime(admissions['DISCHTIME'])

# sort by subject_ID and admission date
admissions = admissions.sort_values(['SUBJECT_ID','ADMITTIME'])

# Create a column "PREV_DISCHTIME" and get the date of previous discharge
admissions['PREV_DISCHTIME'] = admissions.groupby('SUBJECT_ID').DISCHTIME.shift(1)

In [58]:
# combine urgent and emergency admissions
admissions.loc[admissions['ADMISSION_TYPE']=='URGENT', 'ADMISSION_TYPE'] = 'EMERGENCY'

# filter out ELECTIVE admissions from current admission
admissions = admissions[admissions['ADMISSION_TYPE']=='EMERGENCY']

# Create a column "NEXT_ADMITTIME" and get the date of next admission
admissions['NEXT_ADMITTIME'] = admissions.groupby('SUBJECT_ID').ADMITTIME.shift(-1)

# create a new calumn calculating the number of days from previous discharge
admissions['TIME_FROM_PREV_DICH'] = (admissions['ADMITTIME'].dt.date-admissions['PREV_DISCHTIME'].dt.date).dt.days

# create a new calumn calculating the number of days to next admission
admissions['TIME_TO_NEXT_ADMIT'] = (admissions['NEXT_ADMITTIME'].dt.date-admissions['DISCHTIME'].dt.date).dt.days

# create 2 new columns: 1. - OUTPUT_LABEL (0-not readmitted within 30 days, 1- readmitted within 30 days), 1. - 1Y_PRIOR_ADM (0-No, 1-Yes)
admissions.loc[admissions['TIME_TO_NEXT_ADMIT'] <= 30, 'OUTPUT_LABEL'] = 1
admissions.loc[admissions['TIME_FROM_PREV_DICH'] < 365, '1Y_PRIOR_ADM'] = 1
admissions['OUTPUT_LABEL'] = admissions['OUTPUT_LABEL'].fillna(0)
admissions['1Y_PRIOR_ADM'] = admissions['1Y_PRIOR_ADM'].fillna(0)

# calculate length of stay in days
admissions['LENGTH_OF_STAY_DAYS'] = (admissions['DISCHTIME'].dt.date-admissions['ADMITTIME'].dt.date).dt.days

In [59]:
# To get more demographic information about patients we need the PATIENTS.csv table
patients_df = pd.read_csv('../../data/raw/PATIENTS.csv', index_col=['SUBJECT_ID'], parse_dates= ['DOB'], encoding='utf-8-sig').sort_index()
patients_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46520 entries, 2 to 99999
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   ROW_ID       46520 non-null  int64         
 1   GENDER       46520 non-null  object        
 2   DOB          46520 non-null  datetime64[ns]
 3   DOD          15759 non-null  object        
 4   DOD_HOSP     9974 non-null   object        
 5   DOD_SSN      13378 non-null  object        
 6   EXPIRE_FLAG  46520 non-null  int64         
dtypes: datetime64[ns](1), int64(2), object(4)
memory usage: 2.8+ MB


## Columns to use
* Gender - demographic info
* DOB - necessary to calculate the age of patients
* Remaining columns mortality data not useful

In [60]:
admissions = admissions.merge(patients_df[['GENDER', 'DOB']],left_index=True, right_index=True)
# calculate the age by subtracting date of birth from admission date (MIMIC documentation)
admissions['AGE'] = admissions['ADMITTIME'].dt.year - admissions['DOB'].dt.year
# Note, as part of patient deidentification, the age of patients over 89 has been set to 300.
# Reset the age of patients over 89 (coded as 300) to 91 (median age before transformation)
admissions.loc[admissions['AGE'] > 89, 'AGE'] = 91
admissions.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37812 entries, 3 to 99992
Data columns (total 28 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   ROW_ID                37812 non-null  int64         
 1   HADM_ID               37812 non-null  int64         
 2   ADMITTIME             37812 non-null  datetime64[ns]
 3   DISCHTIME             37812 non-null  datetime64[ns]
 4   DEATHTIME             0 non-null      object        
 5   ADMISSION_TYPE        37812 non-null  object        
 6   ADMISSION_LOCATION    37812 non-null  object        
 7   DISCHARGE_LOCATION    37812 non-null  object        
 8   INSURANCE             37812 non-null  object        
 9   LANGUAGE              25398 non-null  object        
 10  RELIGION              37476 non-null  object        
 11  MARITAL_STATUS        35964 non-null  object        
 12  ETHNICITY             37812 non-null  object        
 13  EDREGTIME       

### Remaining steps
3. eliminate unused columns
4. fill out missing values with the mode of the column

In [61]:
admissions.columns

Index(['ROW_ID', 'HADM_ID', 'ADMITTIME', 'DISCHTIME', 'DEATHTIME',
       'ADMISSION_TYPE', 'ADMISSION_LOCATION', 'DISCHARGE_LOCATION',
       'INSURANCE', 'LANGUAGE', 'RELIGION', 'MARITAL_STATUS', 'ETHNICITY',
       'EDREGTIME', 'EDOUTTIME', 'DIAGNOSIS', 'HOSPITAL_EXPIRE_FLAG',
       'HAS_CHARTEVENTS_DATA', 'PREV_DISCHTIME', 'NEXT_ADMITTIME',
       'TIME_FROM_PREV_DICH', 'TIME_TO_NEXT_ADMIT', 'OUTPUT_LABEL',
       '1Y_PRIOR_ADM', 'LENGTH_OF_STAY_DAYS', 'GENDER', 'DOB', 'AGE'],
      dtype='object')

In [62]:
# 3. eliminate unused columns
cols_to_delete = ['ROW_ID', 'ADMITTIME', 'DISCHTIME', 'DEATHTIME',
       'ADMISSION_TYPE',
       'EDREGTIME', 'EDOUTTIME', 'DIAGNOSIS', 'HOSPITAL_EXPIRE_FLAG',
       'HAS_CHARTEVENTS_DATA', 'PREV_DISCHTIME', 'NEXT_ADMITTIME',
       'TIME_FROM_PREV_DICH', 'TIME_TO_NEXT_ADMIT', 'DOB']
admissions.drop(columns=cols_to_delete, inplace=True)

In [63]:
# 4.fill out missing values with the mode of the column
for column in admissions.columns:
    admissions[column].fillna(admissions[column].mode()[0], inplace=True)
admissions.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37812 entries, 3 to 99992
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   HADM_ID              37812 non-null  int64  
 1   ADMISSION_LOCATION   37812 non-null  object 
 2   DISCHARGE_LOCATION   37812 non-null  object 
 3   INSURANCE            37812 non-null  object 
 4   LANGUAGE             37812 non-null  object 
 5   RELIGION             37812 non-null  object 
 6   MARITAL_STATUS       37812 non-null  object 
 7   ETHNICITY            37812 non-null  object 
 8   OUTPUT_LABEL         37812 non-null  float64
 9   1Y_PRIOR_ADM         37812 non-null  float64
 10  LENGTH_OF_STAY_DAYS  37812 non-null  int64  
 11  GENDER               37812 non-null  object 
 12  AGE                  37812 non-null  int64  
dtypes: float64(2), int64(3), object(8)
memory usage: 4.0+ MB


In [64]:
# the index of the table SUBJECT_ID will not be used anymore, from this point forward all the merges will happen 
# HADM_ID, so delete the index
admissions.reset_index(inplace=True)
admissions.drop('SUBJECT_ID', axis=1, inplace=True)
# Save the intermediate table with patient demographic
admissions.to_csv('../../data/intermediate/inter022020.csv')