## To Predict Chronic Pain Among Patients Admitted to ICU 

Chronic Pain - Any etiology of chronic pain (including fibromyalgia) requiring long-term opiod/narcotic medication to control.

**Notes** :

* PCA - look into each component's load factors (to see what variables included) and feature importance using top eigen values. Each component has it's own eigen value.
* Logistc regression - variables included
* Before any feature selection methods - we can use our knowledge/common sense (admission time/location probably shouldn't be included in the dataset). Therefore, do some manual filtering.
* Perform univariate analysis to throw out features before running algorithms.
* Draw a diagram for the pipeline

**Pipeline goes:** 

Data cleaning > Normalization > Prior knowledge to cut off some variables/Univariate Analysis to reduce dimension > PCA (selects components to be performed on other algorithms) and/or Logistic Regression (selects components on its own) > Classifier () algorithms

### Part I: Data Preprocessing

### Import packages

In [1]:
import numpy as np
from numpy import mean
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn import metrics
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score, roc_auc_score, roc_curve, auc
import xgboost as xgb
from sklearn.feature_selection import RFE
from sklearn.svm import SVC
from imblearn.pipeline import Pipeline

import warnings
warnings.filterwarnings('ignore')

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

# Adjust notebook settings to widen the notebook
from IPython.core.display import display, HTML
display(HTML("<style>.container {width:85% !important;}</style>"))

### Import modules/datasets

In [None]:
# every unique hospitalization for each patient in the database (defines HADM_ID_
admissions = pd.read_csv('data/ADMISSIONS.csv')
# every unique patient in the database (defines subject_id)
patients = pd.read_csv("data/PATIENTS.csv")
# the clinical service under which a patient is registered
services = pd.read_csv("data/SERVICES.csv")
# Diagnosis Related Groups (DRG), which are used by the hospital for billing purposes.
drgcodes = pd.read_csv("data/DRGCODES.csv")
# Deidentified notes, including nursing and physician notes, ECG reports, imaging reports, and discharge summaries.
noteevents = pd.read_csv("data/NOTEEVENTS.csv")
# Medications ordered, and not necessarily administered, for a given patient
prescriptions = pd.read_csv("data/PRESCRIPTIONS.csv")
# Ground truth dataset
phenotypes = pd.read_csv("data/GROUND_TRUTH.csv")

In [None]:
# lowercase all strings inside of a dataframe to lowercase
admissions = admissions.apply(lambda x: x.astype(str).str.lower())
drgcodes = drgcodes.apply(lambda x: x.astype(str).str.lower())
noteevents = noteevents.apply(lambda x: x.astype(str).str.lower())
patients = patients.apply(lambda x: x.astype(str).str.lower())
services = services.apply(lambda x: x.astype(str).str.lower())
prescriptions = prescriptions.apply(lambda x: x.astype(str).str.lower())
phenotypes = phenotypes.apply(lambda x: x.astype(str).str.lower())

# lowercase columns in all dataframes
admissions.columns = admissions.columns.str.lower()
drgcodes.columns = drgcodes.columns.str.lower()
noteevents.columns = noteevents.columns.str.lower()
patients.columns = patients.columns.str.lower()
services.columns = services.columns.str.lower()
prescriptions.columns = prescriptions.columns.str.lower()
phenotypes.columns = phenotypes.columns.str.lower()

In [None]:
# decrease the datasets by subsetting the records which ID is in phenotypes dataset
admissions_reduced = admissions[admissions['subject_id'].isin(phenotypes['subject_id'])]
drgcodes_reduced = drgcodes[drgcodes['subject_id'].isin(phenotypes['subject_id'])]
noteevents_reduced = noteevents[noteevents['subject_id'].isin(phenotypes['subject_id'])]
patients_reduced = patients[patients['subject_id'].isin(phenotypes['subject_id'])]
services_reduced = services[services['subject_id'].isin(phenotypes['subject_id'])]
prescriptions_reduced = prescriptions[prescriptions['subject_id'].isin(phenotypes['subject_id'])]

admissions_reduced = admissions_reduced.reset_index(drop=True)
drgcodes_reduced = drgcodes_reduced.reset_index(drop=True)
noteevents_reduced = noteevents_reduced.reset_index(drop=True)
patients_reduced = patients_reduced.reset_index(drop=True)
services_reduced = services_reduced.reset_index(drop=True)
prescriptions_reduced = prescriptions_reduced.reset_index(drop=True)

### Attributes Included

Link: https://mimic.mit.edu/docs/iii/tables/ 

In [None]:
admissions.columns

**SUBJECT_ID, HADM_ID**
Each row of this table contains a unique HADM_ID, which represents a single patient’s admission to the hospital. HADM_ID ranges from 1000000 - 1999999. It is possible for this table to have duplicate SUBJECT_ID, indicating that a single patient had multiple admissions to the hospital. The ADMISSIONS table can be linked to the PATIENTS table using SUBJECT_ID.

**ADMITTIME, DISCHTIME, DEATHTIME**
ADMITTIME provides the date and time the patient was admitted to the hospital, while DISCHTIME provides the date and time the patient was discharged from the hospital. If applicable, DEATHTIME provides the time of in-hospital death for the patient. Note that DEATHTIME is only present if the patient died in-hospital, and is almost always the same as the patient’s DISCHTIME. However, there can be some discrepancies due to typographical errors.

**ADMISSION_TYPE**
ADMISSION_TYPE describes the type of the admission: ‘ELECTIVE’, ‘URGENT’, ‘NEWBORN’ or ‘EMERGENCY’. Emergency/urgent indicate unplanned medical care, and are often collapsed into a single category in studies. Elective indicates a previously planned hospital admission. Newborn indicates that the HADM_ID pertains to the patient’s birth.

**ADMISSION_LOCATION**
ADMISSION_LOCATION provides information about the previous location of the patient prior to arriving at the hospital. There are 9 possible values:

* EMERGENCY ROOM ADMIT
* TRANSFER FROM HOSP/EXTRAM
* TRANSFER FROM OTHER HEALT
* CLINIC REFERRAL/PREMATURE
* ** INFO NOT AVAILABLE **
* TRANSFER FROM SKILLED NUR
* TRSF WITHIN THIS FACILITY
* HMO REFERRAL/SICK
* PHYS REFERRAL/NORMAL DELI

The truncated text occurs in the raw data.

**INSURANCE, LANGUAGE, RELIGION, MARITAL_STATUS, ETHNICITY**
The INSURANCE, LANGUAGE, RELIGION, MARITAL_STATUS, ETHNICITY columns describe patient demographics. These columns occur in the ADMISSIONS table as they are originally sourced from the admission, discharge, and transfers (ADT) data from the hospital database. The values occasionally change between hospital admissions (HADM_ID) for a single patient (SUBJECT_ID). This is reasonable for some fields (e.g. MARITAL_STATUS, RELIGION), but less reasonable for others (e.g. ETHNICITY).

**EDREGTIME, EDOUTTIME**
Time that the patient was registered and discharged from the emergency department.

**DIAGNOSIS**
The DIAGNOSIS column provides a preliminary, free text diagnosis for the patient on hospital admission. The diagnosis is usually assigned by the admitting clinician and does not use a systematic ontology. As of MIMIC-III v1.0 there were 15,693 distinct diagnoses for 58,976 admissions. The diagnoses can be very informative (e.g. chronic kidney failure) or quite vague (e.g. weakness). Final diagnoses for a patient’s hospital stay are coded on discharge and can be found in the DIAGNOSES_ICD table. While this field can provide information about the status of a patient on hospital admission, it is not recommended to use it to stratify patients.

**HOSPITAL_EXPIRE_FLAG**
This indicates whether the patient died within the given hospitalization. 1 indicates death in the hospital, and 0 indicates survival to hospital discharge.

In [None]:
drgcodes.columns

**SUBJECT_ID, HADM_ID**
Identifiers which specify the patient: SUBJECT_ID is unique to a patient and HADM_ID is unique to a patient hospital stay.

**DRG_TYPE**
DRG_TYPE provides the type of DRG code in the entry. There are two types of DRG codes in the database which have overlapping ranges but distinct definitions for the codes. The three types of DRG codes in the MIMIC-III database are ‘HCFA’ (Health Care Financing Administration), ‘MS’ (Medicare), and ‘APR’ (All Payers Registry).

**DRG_CODE**
DRG_CODE contains a code which represents the diagnosis billed for by the hospital.

**DESCRIPTION**
DESCRIPTION provides a human understandable summary of the meaning of the given DRG code. The description field frequently has acronyms which represent comorbidity levels (comorbid conditions or “CC”). The following table provides a definition for some of these acronyms:

Acronym	Description
* w CC/MCC	with CC or Major CC
* w MCC	with Major CC
* w CC	with CC and without Major CC
* w NonCC	with NonCC and without CC or Major CC
* w/o MCC	with CC or Non CC and without Major CC
* w/o CC/MCC	with nonCC and without CC or Major CC

Note that there are three levels of comorbidities: none, with comorbid conditions, and with major comorbid conditions. These acronyms are primarily used in HCFA/MS DRG codes.

**DRG_SEVERITY, DRG_MORTALITY, DRG_SEVERITY and DRG_MORTALITY**
provide additional granularity to DRG codes in the ‘APR’ DRG type. Severity and mortality allow for higher billing costs when a diagnosis is more severe, and vice versa.

In [None]:
noteevents.columns

**SUBJECT_ID, HADM_ID**
Identifiers which specify the patient: SUBJECT_ID is unique to a patient and HADM_ID is unique to a patient hospital stay.

**CHARTDATE, CHARTTIME, STORETIME**
CHARTDATE records the date at which the note was charted. CHARTDATE will always have a time value of 00:00:00.

CHARTTIME records the date and time at which the note was charted. If both CHARTDATE and CHARTTIME exist, then the date portions will be identical. All records have a CHARTDATE. A subset are missing CHARTTIME. More specifically, notes with a CATEGORY value of ‘Discharge Summary’, ‘ECG’, and ‘Echo’ never have a CHARTTIME, only CHARTDATE. Other categories almost always have both CHARTTIME and CHARTDATE, but there is a small amount of missing data for CHARTTIME (usually less than 0.5% of the total number of notes for that category).

STORETIME records the date and time at which a note was saved into the system. Notes with a CATEGORY value of ‘Discharge Summary’, ‘ECG’, ‘Radiology’, and ‘Echo’ never have a STORETIME. All other notes have a STORETIME.

**CATEGORY, DESCRIPTION**
CATEGORY and DESCRIPTION define the type of note recorded. For example, a CATEGORY of ‘Discharge summary’ indicates that the note is a discharge summary, and the DESCRIPTION of ‘Report’ indicates a full report while a DESCRIPTION of ‘Addendum’ indicates an addendum (additional text to be added to the previous report).

**CGID**
CGID is the identifier for the caregiver who input the note.

**ISERROR**
A ‘1’ in the ISERROR column indicates that a physician has identified this note as an error.

**TEXT**
TEXT contains the note text.

In [None]:
patients.columns

**SUBJECT_ID**
SUBJECT_ID is a unique identifier which specifies an individual patient. SUBJECT_ID is a candidate key for the table, so is unique for each row. Information that is consistent for the lifetime of a patient is stored in this table.

**GENDER**
GENDER is the genotypical sex of the patient.

**DOB**
DOB is the date of birth of the given patient. Patients who are older than 89 years old at any time in the database have had their date of birth shifted to obscure their age and comply with HIPAA. The shift process was as follows: the patient’s age at their first admission was determined. The date of birth was then set to exactly 300 years before their first admission.

**DOD, DOD_HOSP, DOD_SSN**
DOD is the date of death for the given patient. DOD_HOSP is the date of death as recorded in the hospital database. DOD_SSN is the date of death from the social security database. Note that DOD merged together DOD_HOSP and DOD_SSN, giving priority to DOD_HOSP if both were recorded.

**EXPIRE_FLAG**
EXPIRE_FLAG is a binary flag which indicates whether the patient died, i.e. whether DOD is null or not. These deaths include both deaths within the hospital (DOD_HOSP) and deaths identified by matching the patient to the social security master death index (DOD_SSN).

In [None]:
services.columns

**SUBJECT_ID, HADM_ID**
Identifiers which specify the patient: SUBJECT_ID is unique to a patient and HADM_ID is unique to a patient hospital stay.

**TRANSFERTIME**
TRANSFERTIME is the time at which the patient moved from the PREV_SERVICE (if present) to the CURR_SERVICE.

**PREV_SERVICE, CURR_SERVICE**
PREV_SERVICE and CURR_SERVICE are the previous and current service that the patient resides under.


**Service	Description**

**CMED**	Cardiac Medical - for non-surgical cardiac related admissions

**CSURG**	Cardiac Surgery - for surgical cardiac admissions

**DENT**	Dental - for dental/jaw related admissions

**ENT**	Ear, nose, and throat - conditions primarily affecting these areas

**GU**	Genitourinary - reproductive organs/urinary system

**GYN**	Gynecological - female reproductive systems and breasts

**MED**	Medical - general service for internal medicine

**NB**	Newborn - infants born at the hospital

**NBB**	Newborn baby - infants born at the hospital

**NMED**	Neurologic Medical - non-surgical, relating to the brain

**NSURG**	Neurologic Surgical - surgical, relating to the brain

**OBS**	Obstetrics - conerned with childbirth and the care of women giving birth

**ORTHO**	Orthopaedic - surgical, relating to the musculoskeletal system

**OMED**	Orthopaedic medicine - non-surgical, relating to musculoskeletal system

**PSURG**	Plastic - restortation/reconstruction of the human body (including cosmetic or aesthetic)

**PSYCH**	Psychiatric - mental disorders relating to mood, behaviour, cognition, or perceptions

**SURG**	Surgical - general surgical service not classified elsewhere

**TRAUM**	Trauma - injury or damage caused by physical harm from an external source

**TSURG**	Thoracic Surgical - surgery on the thorax, located between the neck and the abdomen

**VSURG**	Vascular Surgical - surgery relating to the circulatory system

In [None]:
prescriptions.columns

**SUBJECT_ID, HADM_ID, ICUSTAY_ID**
Identifiers which specify the patient: SUBJECT_ID is unique to a patient, HADM_ID is unique to a patient hospital stay and ICUSTAY_ID is unique to a patient ICU stay.

**STARTDATE, ENDDATE**
STARTDATE and ENDDATE specify the date period for which the prescription was valid.

**DRUG_TYPE**
DRUG_TYPE provides the type of drug prescribed.

**DRUG, DRUG_NAME_POE, DRUG_NAME_GENERIC**
These columns are various representations of the drug prescribed to the patient.

**FORMULARY_DRUG_CD, GSN, NDC**
These columns provide a representation of the drug in various coding systems. GSN is the Generic Sequence Number. NDC is the National Drug Code

**PROD_STRENGTH
DOSE_VAL_RX, DOSE_UNIT_RX
FORM_VAL_DISP, FORM_UNIT_DISP**

**ROUTE**
The route prescribed for the drug.

### Functions

In [None]:
# function to get unique values
def unique(list1):
    x = np.array(list1)
    print(np.unique(x))

### Clean dataset: phenotypes

In [None]:
# Only keep the interested outcome feature
phenotypes_reduced = phenotypes[['hadm_id','subject_id','chronic.pain.fibromyalgia']]
# Drop duplicated records by subject_id and hadm_id
phenotypes_reduced = phenotypes_reduced.drop_duplicates(subset=['subject_id','hadm_id'], ignore_index = True)

In [None]:
# Size of phenotypes_reduced
phenotypes_reduced.shape

In [None]:
phenotypes_reduced.head()

In [None]:
phenotypes['chronic.pain.fibromyalgia'].value_counts()

### Clean dataset: admissions

#### Regarding diagnosis feature from admission: 

15,693 distinct diagnoses for 58,976 admissions. The diagnoses can be very informative (e.g. chronic kidney failure) or quite vague (e.g. weakness). Final diagnoses for a patient’s hospital stay are coded on discharge and can be found in the DIAGNOSES_ICD table. While this field can provide information about the status of a patient on hospital admission, it is not recommended to use it to stratify patients.


In [None]:
# Size of admissions_reduced
admissions_reduced.shape

In [None]:
# Since the dates and times from the database are deidentified -- create new features to get the time difference
admissions_reduced['edouttime'] = pd.to_datetime(admissions_reduced['edouttime'])
admissions_reduced['edregtime'] = pd.to_datetime(admissions_reduced['edregtime'])
admissions_reduced['length_ed'] = (admissions_reduced['edouttime'] - admissions_reduced['edregtime']).dt.days
admissions_reduced['dischtime'] = pd.to_datetime(admissions_reduced['dischtime'])
admissions_reduced['admittime'] = pd.to_datetime(admissions_reduced['admittime'])
admissions_reduced['length_admit'] = (admissions_reduced['dischtime'] - admissions_reduced['admittime']).dt.days
# Drop time-related features used to create new features
admissions_reduced = admissions_reduced.drop(['edregtime', 'edouttime', 'dischtime', 'admittime'], axis = 1)

In [None]:
# Create aggregate, dummy, and new variables for admission df to create one row per id
just_dummies = pd.get_dummies(admissions_reduced['admission_type'], prefix='admission_type', drop_first=True)
admissions_reduced = pd.concat([admissions_reduced, just_dummies], axis=1)
# Create dummy variables for admission location
just_dummies = pd.get_dummies(admissions_reduced['admission_location'], prefix='admission_loc', drop_first=True)
admissions_reduced = pd.concat([admissions_reduced, just_dummies], axis=1)
# Create dummy variables for discharge location
just_dummies = pd.get_dummies(admissions_reduced['discharge_location'], prefix='discharge_loc', drop_first=True)
admissions_reduced = pd.concat([admissions_reduced, just_dummies], axis=1)
# Create dummy variables for insurance
just_dummies = pd.get_dummies(admissions_reduced['insurance'], prefix='insurance', drop_first=True)
admissions_reduced = pd.concat([admissions_reduced, just_dummies], axis=1)
# Create dummy variables for religions
just_dummies = pd.get_dummies(admissions_reduced['religion'], prefix='religion', drop_first=True)
admissions_reduced = pd.concat([admissions_reduced, just_dummies], axis=1)
# Create dummy variables for language
just_dummies = pd.get_dummies(admissions_reduced['language'], prefix='language', drop_first=True)
admissions_reduced = pd.concat([admissions_reduced, just_dummies], axis=1)
# Create dummy variables for marital_status
just_dummies = pd.get_dummies(admissions_reduced['marital_status'], prefix='marital_status', drop_first=True)
admissions_reduced = pd.concat([admissions_reduced, just_dummies], axis=1)
# Create dummy variables for ethnicity
just_dummies = pd.get_dummies(admissions_reduced['ethnicity'], prefix='ethnicity', drop_first=True)
admissions_reduced = pd.concat([admissions_reduced, just_dummies], axis=1)

In [None]:
# Remove features used to dummy variables
admissions_reduced = admissions_reduced.drop(['row_id', 'deathtime', 'diagnosis', 'religion', 'language','marital_status', 'ethnicity', 'insurance', 'admission_location', 'discharge_location', 'admission_type'], axis = 1)

In [None]:
admissions_reduced["hospital_expire_flag"] = admissions_reduced.hospital_expire_flag.astype(float)
admissions_reduced["has_chartevents_data"] = admissions_reduced.has_chartevents_data.astype(float)

In [None]:
# Size of cleaned admissions_reduced dataset
admissions_reduced.shape

### Clean dataset: patients

In [None]:
# Size of patients_reduced dataset
patients_reduced.shape

In [None]:
# Create numerical code for string variables in the gender feature
patients_reduced.gender[patients_reduced.gender == 'm'] = 1
patients_reduced.gender[patients_reduced.gender == 'f'] = 0

In [None]:
# Removing PIH features that had been deidentified
patients_reduced = patients_reduced.drop(['row_id', 'dob', 'dod', 'dod_hosp', 'dod_ssn'], axis = 1)

In [None]:
# Final features left for patients_reduced dataset
patients_reduced.head(2)

### Cleaning dataset:  drgcodes

In [None]:
drgcodes_reduced.shape

In [None]:
drgcodes_reduced.head(3)

In [None]:
# Create dummy variables for drg_code
just_dummies = pd.get_dummies(drgcodes_reduced['drg_code'], prefix='drg_code', drop_first=True)
drgcodes_reduced = pd.concat([drgcodes_reduced, just_dummies], axis=1)

# Create dummy variables for drg_code
just_dummies = pd.get_dummies(drgcodes_reduced['drg_type'], prefix='drg_type', drop_first=True)
drgcodes_reduced = pd.concat([drgcodes_reduced, just_dummies], axis=1)

In [None]:
# Transform object to numerical features
drgcodes_reduced['drg_mortality'] = pd.to_numeric(drgcodes_reduced.drg_mortality, errors='coerce').fillna(0, downcast='infer').astype('Int32')
drgcodes_reduced['drg_severity'] = pd.to_numeric(drgcodes_reduced.drg_severity, errors='coerce').fillna(0, downcast='infer').astype('Int32')

In [None]:
# In order to have one record for each unique combination of subject_id and hadm_id, mean of the drg_mortality and drg_severity
# are calculated
drgcodes_reduced['avg_drg_mortality'] = drgcodes_reduced.groupby(['subject_id', 'hadm_id']).drg_mortality.transform('mean')
drgcodes_reduced['avg_drg_severity'] = drgcodes_reduced.groupby(['subject_id', 'hadm_id']).drg_severity.transform('mean')

In [None]:
drgcodes_reduced['avg_drg_mortality'] = drgcodes_reduced.avg_drg_mortality.astype(float)
drgcodes_reduced['avg_drg_severity'] = drgcodes_reduced.avg_drg_severity.astype(float)

In [None]:
# Drop duplicates by comparing subject_id and hadm_id
drgcodes_reduced = drgcodes_reduced.drop_duplicates(subset=['subject_id','hadm_id'], ignore_index = True)

In [None]:
drgcodes_reduced = drgcodes_reduced.drop(['row_id', 'description', 'drg_code', 'drg_type', 'drg_severity', 'drg_mortality' ], axis = 1)

In [None]:
# Final size of the drgcodes_reduced
drgcodes_reduced.shape

In [None]:
drgcodes_reduced.columns

### Merge Datasets

In [None]:
phenotypes_reduced.shape

In [None]:
main = pd.merge(admissions_reduced, phenotypes_reduced,
                how ='right',
                on = ['subject_id', 'hadm_id'])

main = pd.merge(main, patients_reduced,
                how ='left',
                on = ['subject_id'])

main = pd.merge(main, drgcodes_reduced,
                how ='left',
                on = ['subject_id', 'hadm_id'])

In [None]:
main["expire_flag"] = main.expire_flag.astype(float)
main['gender'] = main.gender.astype(int)

In [None]:
main.shape # final dataset (813, 742)

### Missingness of the final merged datasets

In [None]:
main.isnull().mean() # length_ed had ~30% missingness

In [None]:
main['length_ed'] = main['length_ed'].fillna(0) # not entirely sure about this 

In [None]:
cols = list(main.columns.values)
cols.pop(cols.index('chronic.pain.fibromyalgia'))
main = main[cols+['chronic.pain.fibromyalgia']]

In [None]:
main["hospital_expire_flag"] = main.hospital_expire_flag.astype(float)
main["has_chartevents_data"] = main.has_chartevents_data.astype(float)
main["expire_flag"] = main.expire_flag.astype(float)
main['gender'] = main.gender.astype(int)
main['avg_drg_mortality'] = main.avg_drg_mortality.astype(float)
main['avg_drg_severity'] = main.avg_drg_severity.astype(float)
main['chronic.pain.fibromyalgia'] = main['chronic.pain.fibromyalgia'].astype(int)

In [9]:
main.drop(['Unnamed: 0', 'Unnamed: 0.1', 'Unnamed: 0.1.1', 'subject_id', 'hadm_id'], axis = 1, inplace=True)

In [10]:
main.to_csv('data/main.csv')

In [2]:
# Load cleaned and merged dataset
main = pd.read_csv("data/main.csv")

In [3]:
main.shape # (813, 731)

(813, 731)

In [4]:
main.drop(['Unnamed: 0'], axis = 1, inplace=True)

In [5]:
main.head(2)

Unnamed: 0,hospital_expire_flag,has_chartevents_data,length_ed,length_admit,admission_type_emergency,admission_type_urgent,admission_loc_emergency room admit,admission_loc_phys referral/normal deli,admission_loc_transfer from hosp/extram,admission_loc_transfer from other healt,admission_loc_transfer from skilled nur,discharge_loc_disc-tran cancer/chldrn h,discharge_loc_disch-tran to psych hosp,discharge_loc_home,discharge_loc_home health care,discharge_loc_home with home iv providr,discharge_loc_hospice-home,discharge_loc_hospice-medical facility,discharge_loc_icf,discharge_loc_left against medical advi,discharge_loc_long term care hospital,discharge_loc_other facility,discharge_loc_rehab/distinct part hosp,discharge_loc_short term hospital,discharge_loc_snf,insurance_medicaid,insurance_medicare,insurance_private,insurance_self pay,religion_buddhist,religion_catholic,religion_christian scientist,religion_episcopalian,religion_greek orthodox,religion_hebrew,religion_jehovah's witness,religion_jewish,religion_muslim,religion_nan,religion_not specified,religion_other,religion_protestant quaker,religion_romanian east. orth,religion_unitarian-universalist,religion_unobtainable,language_*hun,language_*man,language_arab,language_camb,language_cant,language_cape,language_engl,language_fren,language_gree,language_hait,language_ital,language_nan,language_pers,language_port,language_ptun,language_russ,language_span,language_urdu,marital_status_life partner,marital_status_married,marital_status_nan,marital_status_separated,marital_status_single,marital_status_unknown (default),marital_status_widowed,ethnicity_asian - asian indian,ethnicity_asian - chinese,ethnicity_black/african,ethnicity_black/african american,ethnicity_black/cape verdean,ethnicity_black/haitian,ethnicity_hispanic or latino,ethnicity_hispanic/latino - guatemalan,ethnicity_hispanic/latino - puerto rican,ethnicity_other,ethnicity_patient declined to answer,ethnicity_portuguese,ethnicity_unable to obtain,ethnicity_unknown/not specified,ethnicity_white,ethnicity_white - brazilian,ethnicity_white - eastern european,ethnicity_white - russian,gender,expire_flag,drg_code_10,drg_code_100,drg_code_101,drg_code_104,drg_code_105,drg_code_106,drg_code_107,drg_code_108,drg_code_109,drg_code_11,drg_code_110,drg_code_1103,drg_code_1104,drg_code_111,drg_code_113,drg_code_114,drg_code_1144,drg_code_115,drg_code_1152,drg_code_1153,drg_code_116,drg_code_117,drg_code_12,drg_code_120,drg_code_1201,drg_code_1202,drg_code_1203,drg_code_1204,drg_code_121,drg_code_1212,drg_code_1213,drg_code_1214,drg_code_122,drg_code_123,drg_code_124,drg_code_127,drg_code_129,drg_code_130,drg_code_1303,drg_code_1304,drg_code_131,drg_code_1332,drg_code_1333,drg_code_1334,drg_code_134,drg_code_1342,drg_code_1343,drg_code_1344,drg_code_1362,drg_code_1363,drg_code_1364,drg_code_1372,drg_code_1373,drg_code_1374,drg_code_138,drg_code_1393,drg_code_1394,drg_code_14,drg_code_1402,drg_code_1403,drg_code_1404,drg_code_141,drg_code_1413,drg_code_1422,drg_code_1431,drg_code_1432,drg_code_1433,drg_code_1434,drg_code_144,drg_code_1443,drg_code_1444,drg_code_146,drg_code_148,drg_code_150,drg_code_152,drg_code_154,drg_code_159,drg_code_161,drg_code_1613,drg_code_1614,drg_code_1623,drg_code_1624,drg_code_163,drg_code_1631,drg_code_1632,drg_code_1633,drg_code_1634,drg_code_164,drg_code_1652,drg_code_1653,drg_code_1654,drg_code_166,drg_code_1662,drg_code_1663,drg_code_1664,drg_code_167,drg_code_1671,drg_code_1673,drg_code_168,drg_code_1693,drg_code_1694,drg_code_170,drg_code_1703,drg_code_1712,drg_code_1713,drg_code_1714,drg_code_172,drg_code_1731,drg_code_1732,drg_code_1733,drg_code_1734,drg_code_174,drg_code_1741,drg_code_1742,drg_code_1743,drg_code_1744,drg_code_175,drg_code_1752,drg_code_1753,drg_code_1754,drg_code_176,drg_code_177,drg_code_178,drg_code_18,drg_code_180,drg_code_1802,drg_code_1803,drg_code_1804,drg_code_182,drg_code_185,drg_code_186,drg_code_187,drg_code_188,drg_code_189,drg_code_190,drg_code_1901,drg_code_1903,drg_code_1904,drg_code_191,drg_code_1913,drg_code_1914,drg_code_1923,drg_code_193,drg_code_194,drg_code_1941,drg_code_1942,drg_code_1943,drg_code_1944,drg_code_195,drg_code_1964,drg_code_197,drg_code_1972,drg_code_1973,drg_code_1974,drg_code_1993,drg_code_1994,drg_code_2,drg_code_20,drg_code_200,drg_code_201,drg_code_2012,drg_code_2013,drg_code_2014,drg_code_202,drg_code_203,drg_code_2034,drg_code_204,drg_code_205,drg_code_206,drg_code_2063,drg_code_2064,drg_code_207,drg_code_2071,drg_code_2072,drg_code_2073,drg_code_2074,drg_code_208,drg_code_209,drg_code_21,drg_code_210,drg_code_211,drg_code_212,drg_code_213,drg_code_214,drg_code_216,drg_code_217,drg_code_219,drg_code_22,drg_code_220,drg_code_2201,drg_code_2203,drg_code_2204,drg_code_221,drg_code_2211,drg_code_2212,drg_code_2213,drg_code_2214,drg_code_222,drg_code_2224,drg_code_2233,drg_code_2234,drg_code_2243,drg_code_226,drg_code_2283,drg_code_2284,drg_code_229,drg_code_2293,drg_code_2294,drg_code_23,drg_code_233,drg_code_234,drg_code_235,drg_code_236,drg_code_237,drg_code_238,drg_code_239,drg_code_24,drg_code_240,drg_code_2402,drg_code_2403,drg_code_241,drg_code_2412,drg_code_2413,drg_code_2414,drg_code_242,drg_code_2423,drg_code_2424,drg_code_243,drg_code_2433,drg_code_2434,drg_code_244,drg_code_2442,drg_code_2443,drg_code_2444,drg_code_2454,drg_code_246,drg_code_2463,drg_code_2464,drg_code_247,drg_code_2473,drg_code_2474,drg_code_248,drg_code_2482,drg_code_2483,drg_code_2484,drg_code_249,drg_code_2493,drg_code_2494,drg_code_25,drg_code_250,drg_code_251,drg_code_252,drg_code_2523,drg_code_253,drg_code_2532,drg_code_2533,drg_code_2534,drg_code_254,drg_code_2541,drg_code_2543,drg_code_2544,drg_code_256,drg_code_258,drg_code_26,drg_code_2602,drg_code_2603,drg_code_2604,drg_code_2613,drg_code_2623,drg_code_264,drg_code_265,drg_code_27,drg_code_278,drg_code_2791,drg_code_2793,drg_code_2794,drg_code_28,drg_code_280,drg_code_2802,drg_code_2803,drg_code_282,drg_code_2822,drg_code_2823,drg_code_2832,drg_code_2833,drg_code_2834,drg_code_2842,drg_code_2843,drg_code_2844,drg_code_286,drg_code_287,drg_code_289,drg_code_29,drg_code_291,drg_code_292,drg_code_293,drg_code_294,drg_code_295,drg_code_296,drg_code_299,drg_code_3,drg_code_300,drg_code_301,drg_code_3013,drg_code_302,drg_code_303,drg_code_3031,drg_code_3032,drg_code_304,drg_code_3042,drg_code_3043,drg_code_3044,drg_code_305,drg_code_308,drg_code_309,drg_code_3093,drg_code_3094,drg_code_312,drg_code_313,drg_code_3132,drg_code_3134,drg_code_314,drg_code_3144,drg_code_315,drg_code_3154,drg_code_316,drg_code_317,drg_code_320,drg_code_3212,drg_code_3213,drg_code_325,drg_code_326,drg_code_327,drg_code_329,drg_code_330,drg_code_331,drg_code_334,drg_code_34,drg_code_3404,drg_code_3423,drg_code_3424,drg_code_3434,drg_code_3463,drg_code_347,drg_code_3473,drg_code_3514,drg_code_356,drg_code_3612,drg_code_368,drg_code_371,drg_code_372,drg_code_376,drg_code_377,drg_code_378,drg_code_3833,drg_code_385,drg_code_388,drg_code_389,drg_code_391,drg_code_392,drg_code_393,drg_code_394,drg_code_395,drg_code_397,drg_code_398,drg_code_4,drg_code_401,drg_code_402,drg_code_403,drg_code_4043,drg_code_4052,drg_code_4053,drg_code_4054,drg_code_406,drg_code_413,drg_code_414,drg_code_415,drg_code_416,drg_code_418,drg_code_4202,drg_code_4203,drg_code_4204,drg_code_4223,drg_code_423,drg_code_4233,drg_code_4241,drg_code_4243,drg_code_4244,drg_code_4252,drg_code_4253,drg_code_4254,drg_code_43,drg_code_432,drg_code_433,drg_code_439,drg_code_44,drg_code_4402,drg_code_4403,drg_code_441,drg_code_4414,drg_code_442,drg_code_4423,drg_code_443,drg_code_4434,drg_code_444,drg_code_4442,drg_code_4443,drg_code_445,drg_code_4473,drg_code_4474,drg_code_449,drg_code_451,drg_code_452,drg_code_453,drg_code_454,drg_code_458,drg_code_4602,drg_code_4603,drg_code_4604,drg_code_463,drg_code_4633,drg_code_4634,drg_code_4663,drg_code_4664,drg_code_468,drg_code_4682,drg_code_4684,drg_code_470,drg_code_471,drg_code_473,drg_code_475,drg_code_477,drg_code_478,drg_code_480,drg_code_481,drg_code_482,drg_code_483,drg_code_484,drg_code_485,drg_code_486,drg_code_487,drg_code_489,drg_code_490,drg_code_493,drg_code_494,drg_code_496,drg_code_497,drg_code_498,drg_code_5,drg_code_504,drg_code_510,drg_code_5112,drg_code_512,drg_code_513,drg_code_515,drg_code_516,drg_code_517,drg_code_518,drg_code_52,drg_code_521,drg_code_523,drg_code_524,drg_code_526,drg_code_527,drg_code_528,drg_code_529,drg_code_531,drg_code_5311,drg_code_5312,drg_code_532,drg_code_533,drg_code_534,drg_code_535,drg_code_536,drg_code_539,drg_code_54,drg_code_541,drg_code_542,drg_code_543,drg_code_546,drg_code_547,drg_code_548,drg_code_549,drg_code_550,drg_code_551,drg_code_553,drg_code_554,drg_code_555,drg_code_557,drg_code_558,drg_code_559,drg_code_562,drg_code_564,drg_code_565,drg_code_566,drg_code_569,drg_code_570,drg_code_572,drg_code_574,drg_code_575,drg_code_576,drg_code_578,drg_code_579,drg_code_603,drg_code_61,drg_code_628,drg_code_637,drg_code_638,drg_code_639,drg_code_64,drg_code_640,drg_code_641,drg_code_642,drg_code_643,drg_code_652,drg_code_66,drg_code_6602,drg_code_6604,drg_code_6612,drg_code_6631,drg_code_6633,drg_code_6634,drg_code_673,drg_code_6804,drg_code_6814,drg_code_682,drg_code_683,drg_code_689,drg_code_690,drg_code_6903,drg_code_6913,drg_code_6914,drg_code_6933,drg_code_698,drg_code_7,drg_code_70,drg_code_7103,drg_code_7104,drg_code_7112,drg_code_7113,drg_code_7114,drg_code_7202,drg_code_7203,drg_code_7204,drg_code_7212,drg_code_7213,drg_code_7214,drg_code_7223,drg_code_7241,drg_code_7243,drg_code_7244,drg_code_73,drg_code_736,drg_code_75,drg_code_759,drg_code_76,drg_code_7702,drg_code_7703,drg_code_7743,drg_code_7744,drg_code_7752,drg_code_7753,drg_code_7754,drg_code_78,drg_code_79,drg_code_7912,drg_code_7914,drg_code_80,drg_code_809,drg_code_811,drg_code_8114,drg_code_8122,drg_code_8123,drg_code_8124,drg_code_8132,drg_code_8133,drg_code_8134,drg_code_814,drg_code_816,drg_code_8163,drg_code_8164,drg_code_82,drg_code_834,drg_code_835,drg_code_847,drg_code_85,drg_code_853,drg_code_854,drg_code_856,drg_code_857,drg_code_862,drg_code_863,drg_code_864,drg_code_867,drg_code_869,drg_code_87,drg_code_870,drg_code_871,drg_code_872,drg_code_88,drg_code_89,drg_code_8904,drg_code_8923,drg_code_8924,drg_code_894,drg_code_896,drg_code_897,drg_code_9,drg_code_907,drg_code_91,drg_code_9104,drg_code_915,drg_code_917,drg_code_918,drg_code_919,drg_code_92,drg_code_9302,drg_code_9304,drg_code_9502,drg_code_9503,drg_code_9504,drg_code_9513,drg_code_9514,drg_code_9522,drg_code_9523,drg_code_9524,drg_code_955,drg_code_96,drg_code_963,drg_code_97,drg_code_974,drg_code_977,drg_code_981,drg_code_987,drg_code_99,drg_type_hcfa,drg_type_ms,avg_drg_mortality,avg_drg_severity,chronic.pain.fibromyalgia
0,0.0,1.0,0.0,10,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0.0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,0
1,0.0,1.0,0.0,7,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.0,2.0,0


### Restructuring -- Move the outcome variable to be the last column in the dataset

#### Set X as features and y as the outcome

In [15]:
main.head(2)

Unnamed: 0,hospital_expire_flag,has_chartevents_data,length_ed,length_admit,admission_type_emergency,admission_type_urgent,admission_loc_emergency room admit,admission_loc_phys referral/normal deli,admission_loc_transfer from hosp/extram,admission_loc_transfer from other healt,admission_loc_transfer from skilled nur,discharge_loc_disc-tran cancer/chldrn h,discharge_loc_disch-tran to psych hosp,discharge_loc_home,discharge_loc_home health care,discharge_loc_home with home iv providr,discharge_loc_hospice-home,discharge_loc_hospice-medical facility,discharge_loc_icf,discharge_loc_left against medical advi,discharge_loc_long term care hospital,discharge_loc_other facility,discharge_loc_rehab/distinct part hosp,discharge_loc_short term hospital,discharge_loc_snf,insurance_medicaid,insurance_medicare,insurance_private,insurance_self pay,religion_buddhist,religion_catholic,religion_christian scientist,religion_episcopalian,religion_greek orthodox,religion_hebrew,religion_jehovah's witness,religion_jewish,religion_muslim,religion_nan,religion_not specified,religion_other,religion_protestant quaker,religion_romanian east. orth,religion_unitarian-universalist,religion_unobtainable,language_*hun,language_*man,language_arab,language_camb,language_cant,language_cape,language_engl,language_fren,language_gree,language_hait,language_ital,language_nan,language_pers,language_port,language_ptun,language_russ,language_span,language_urdu,marital_status_life partner,marital_status_married,marital_status_nan,marital_status_separated,marital_status_single,marital_status_unknown (default),marital_status_widowed,ethnicity_asian - asian indian,ethnicity_asian - chinese,ethnicity_black/african,ethnicity_black/african american,ethnicity_black/cape verdean,ethnicity_black/haitian,ethnicity_hispanic or latino,ethnicity_hispanic/latino - guatemalan,ethnicity_hispanic/latino - puerto rican,ethnicity_other,ethnicity_patient declined to answer,ethnicity_portuguese,ethnicity_unable to obtain,ethnicity_unknown/not specified,ethnicity_white,ethnicity_white - brazilian,ethnicity_white - eastern european,ethnicity_white - russian,gender,expire_flag,drg_code_10,drg_code_100,drg_code_101,drg_code_104,drg_code_105,drg_code_106,drg_code_107,drg_code_108,drg_code_109,drg_code_11,drg_code_110,drg_code_1103,drg_code_1104,drg_code_111,drg_code_113,drg_code_114,drg_code_1144,drg_code_115,drg_code_1152,drg_code_1153,drg_code_116,drg_code_117,drg_code_12,drg_code_120,drg_code_1201,drg_code_1202,drg_code_1203,drg_code_1204,drg_code_121,drg_code_1212,drg_code_1213,drg_code_1214,drg_code_122,drg_code_123,drg_code_124,drg_code_127,drg_code_129,drg_code_130,drg_code_1303,drg_code_1304,drg_code_131,drg_code_1332,drg_code_1333,drg_code_1334,drg_code_134,drg_code_1342,drg_code_1343,drg_code_1344,drg_code_1362,drg_code_1363,drg_code_1364,drg_code_1372,drg_code_1373,drg_code_1374,drg_code_138,drg_code_1393,drg_code_1394,drg_code_14,drg_code_1402,drg_code_1403,drg_code_1404,drg_code_141,drg_code_1413,drg_code_1422,drg_code_1431,drg_code_1432,drg_code_1433,drg_code_1434,drg_code_144,drg_code_1443,drg_code_1444,drg_code_146,drg_code_148,drg_code_150,drg_code_152,drg_code_154,drg_code_159,drg_code_161,drg_code_1613,drg_code_1614,drg_code_1623,drg_code_1624,drg_code_163,drg_code_1631,drg_code_1632,drg_code_1633,drg_code_1634,drg_code_164,drg_code_1652,drg_code_1653,drg_code_1654,drg_code_166,drg_code_1662,drg_code_1663,drg_code_1664,drg_code_167,drg_code_1671,drg_code_1673,drg_code_168,drg_code_1693,drg_code_1694,drg_code_170,drg_code_1703,drg_code_1712,drg_code_1713,drg_code_1714,drg_code_172,drg_code_1731,drg_code_1732,drg_code_1733,drg_code_1734,drg_code_174,drg_code_1741,drg_code_1742,drg_code_1743,drg_code_1744,drg_code_175,drg_code_1752,drg_code_1753,drg_code_1754,drg_code_176,drg_code_177,drg_code_178,drg_code_18,drg_code_180,drg_code_1802,drg_code_1803,drg_code_1804,drg_code_182,drg_code_185,drg_code_186,drg_code_187,drg_code_188,drg_code_189,drg_code_190,drg_code_1901,drg_code_1903,drg_code_1904,drg_code_191,drg_code_1913,drg_code_1914,drg_code_1923,drg_code_193,drg_code_194,drg_code_1941,drg_code_1942,drg_code_1943,drg_code_1944,drg_code_195,drg_code_1964,drg_code_197,drg_code_1972,drg_code_1973,drg_code_1974,drg_code_1993,drg_code_1994,drg_code_2,drg_code_20,drg_code_200,drg_code_201,drg_code_2012,drg_code_2013,drg_code_2014,drg_code_202,drg_code_203,drg_code_2034,drg_code_204,drg_code_205,drg_code_206,drg_code_2063,drg_code_2064,drg_code_207,drg_code_2071,drg_code_2072,drg_code_2073,drg_code_2074,drg_code_208,drg_code_209,drg_code_21,drg_code_210,drg_code_211,drg_code_212,drg_code_213,drg_code_214,drg_code_216,drg_code_217,drg_code_219,drg_code_22,drg_code_220,drg_code_2201,drg_code_2203,drg_code_2204,drg_code_221,drg_code_2211,drg_code_2212,drg_code_2213,drg_code_2214,drg_code_222,drg_code_2224,drg_code_2233,drg_code_2234,drg_code_2243,drg_code_226,drg_code_2283,drg_code_2284,drg_code_229,drg_code_2293,drg_code_2294,drg_code_23,drg_code_233,drg_code_234,drg_code_235,drg_code_236,drg_code_237,drg_code_238,drg_code_239,drg_code_24,drg_code_240,drg_code_2402,drg_code_2403,drg_code_241,drg_code_2412,drg_code_2413,drg_code_2414,drg_code_242,drg_code_2423,drg_code_2424,drg_code_243,drg_code_2433,drg_code_2434,drg_code_244,drg_code_2442,drg_code_2443,drg_code_2444,drg_code_2454,drg_code_246,drg_code_2463,drg_code_2464,drg_code_247,drg_code_2473,drg_code_2474,drg_code_248,drg_code_2482,drg_code_2483,drg_code_2484,drg_code_249,drg_code_2493,drg_code_2494,drg_code_25,drg_code_250,drg_code_251,drg_code_252,drg_code_2523,drg_code_253,drg_code_2532,drg_code_2533,drg_code_2534,drg_code_254,drg_code_2541,drg_code_2543,drg_code_2544,drg_code_256,drg_code_258,drg_code_26,drg_code_2602,drg_code_2603,drg_code_2604,drg_code_2613,drg_code_2623,drg_code_264,drg_code_265,drg_code_27,drg_code_278,drg_code_2791,drg_code_2793,drg_code_2794,drg_code_28,drg_code_280,drg_code_2802,drg_code_2803,drg_code_282,drg_code_2822,drg_code_2823,drg_code_2832,drg_code_2833,drg_code_2834,drg_code_2842,drg_code_2843,drg_code_2844,drg_code_286,drg_code_287,drg_code_289,drg_code_29,drg_code_291,drg_code_292,drg_code_293,drg_code_294,drg_code_295,drg_code_296,drg_code_299,drg_code_3,drg_code_300,drg_code_301,drg_code_3013,drg_code_302,drg_code_303,drg_code_3031,drg_code_3032,drg_code_304,drg_code_3042,drg_code_3043,drg_code_3044,drg_code_305,drg_code_308,drg_code_309,drg_code_3093,drg_code_3094,drg_code_312,drg_code_313,drg_code_3132,drg_code_3134,drg_code_314,drg_code_3144,drg_code_315,drg_code_3154,drg_code_316,drg_code_317,drg_code_320,drg_code_3212,drg_code_3213,drg_code_325,drg_code_326,drg_code_327,drg_code_329,drg_code_330,drg_code_331,drg_code_334,drg_code_34,drg_code_3404,drg_code_3423,drg_code_3424,drg_code_3434,drg_code_3463,drg_code_347,drg_code_3473,drg_code_3514,drg_code_356,drg_code_3612,drg_code_368,drg_code_371,drg_code_372,drg_code_376,drg_code_377,drg_code_378,drg_code_3833,drg_code_385,drg_code_388,drg_code_389,drg_code_391,drg_code_392,drg_code_393,drg_code_394,drg_code_395,drg_code_397,drg_code_398,drg_code_4,drg_code_401,drg_code_402,drg_code_403,drg_code_4043,drg_code_4052,drg_code_4053,drg_code_4054,drg_code_406,drg_code_413,drg_code_414,drg_code_415,drg_code_416,drg_code_418,drg_code_4202,drg_code_4203,drg_code_4204,drg_code_4223,drg_code_423,drg_code_4233,drg_code_4241,drg_code_4243,drg_code_4244,drg_code_4252,drg_code_4253,drg_code_4254,drg_code_43,drg_code_432,drg_code_433,drg_code_439,drg_code_44,drg_code_4402,drg_code_4403,drg_code_441,drg_code_4414,drg_code_442,drg_code_4423,drg_code_443,drg_code_4434,drg_code_444,drg_code_4442,drg_code_4443,drg_code_445,drg_code_4473,drg_code_4474,drg_code_449,drg_code_451,drg_code_452,drg_code_453,drg_code_454,drg_code_458,drg_code_4602,drg_code_4603,drg_code_4604,drg_code_463,drg_code_4633,drg_code_4634,drg_code_4663,drg_code_4664,drg_code_468,drg_code_4682,drg_code_4684,drg_code_470,drg_code_471,drg_code_473,drg_code_475,drg_code_477,drg_code_478,drg_code_480,drg_code_481,drg_code_482,drg_code_483,drg_code_484,drg_code_485,drg_code_486,drg_code_487,drg_code_489,drg_code_490,drg_code_493,drg_code_494,drg_code_496,drg_code_497,drg_code_498,drg_code_5,drg_code_504,drg_code_510,drg_code_5112,drg_code_512,drg_code_513,drg_code_515,drg_code_516,drg_code_517,drg_code_518,drg_code_52,drg_code_521,drg_code_523,drg_code_524,drg_code_526,drg_code_527,drg_code_528,drg_code_529,drg_code_531,drg_code_5311,drg_code_5312,drg_code_532,drg_code_533,drg_code_534,drg_code_535,drg_code_536,drg_code_539,drg_code_54,drg_code_541,drg_code_542,drg_code_543,drg_code_546,drg_code_547,drg_code_548,drg_code_549,drg_code_550,drg_code_551,drg_code_553,drg_code_554,drg_code_555,drg_code_557,drg_code_558,drg_code_559,drg_code_562,drg_code_564,drg_code_565,drg_code_566,drg_code_569,drg_code_570,drg_code_572,drg_code_574,drg_code_575,drg_code_576,drg_code_578,drg_code_579,drg_code_603,drg_code_61,drg_code_628,drg_code_637,drg_code_638,drg_code_639,drg_code_64,drg_code_640,drg_code_641,drg_code_642,drg_code_643,drg_code_652,drg_code_66,drg_code_6602,drg_code_6604,drg_code_6612,drg_code_6631,drg_code_6633,drg_code_6634,drg_code_673,drg_code_6804,drg_code_6814,drg_code_682,drg_code_683,drg_code_689,drg_code_690,drg_code_6903,drg_code_6913,drg_code_6914,drg_code_6933,drg_code_698,drg_code_7,drg_code_70,drg_code_7103,drg_code_7104,drg_code_7112,drg_code_7113,drg_code_7114,drg_code_7202,drg_code_7203,drg_code_7204,drg_code_7212,drg_code_7213,drg_code_7214,drg_code_7223,drg_code_7241,drg_code_7243,drg_code_7244,drg_code_73,drg_code_736,drg_code_75,drg_code_759,drg_code_76,drg_code_7702,drg_code_7703,drg_code_7743,drg_code_7744,drg_code_7752,drg_code_7753,drg_code_7754,drg_code_78,drg_code_79,drg_code_7912,drg_code_7914,drg_code_80,drg_code_809,drg_code_811,drg_code_8114,drg_code_8122,drg_code_8123,drg_code_8124,drg_code_8132,drg_code_8133,drg_code_8134,drg_code_814,drg_code_816,drg_code_8163,drg_code_8164,drg_code_82,drg_code_834,drg_code_835,drg_code_847,drg_code_85,drg_code_853,drg_code_854,drg_code_856,drg_code_857,drg_code_862,drg_code_863,drg_code_864,drg_code_867,drg_code_869,drg_code_87,drg_code_870,drg_code_871,drg_code_872,drg_code_88,drg_code_89,drg_code_8904,drg_code_8923,drg_code_8924,drg_code_894,drg_code_896,drg_code_897,drg_code_9,drg_code_907,drg_code_91,drg_code_9104,drg_code_915,drg_code_917,drg_code_918,drg_code_919,drg_code_92,drg_code_9302,drg_code_9304,drg_code_9502,drg_code_9503,drg_code_9504,drg_code_9513,drg_code_9514,drg_code_9522,drg_code_9523,drg_code_9524,drg_code_955,drg_code_96,drg_code_963,drg_code_97,drg_code_974,drg_code_977,drg_code_981,drg_code_987,drg_code_99,drg_type_hcfa,drg_type_ms,avg_drg_mortality,avg_drg_severity,chronic.pain.fibromyalgia
0,0.0,1.0,0.0,10,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0.0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,0
1,0.0,1.0,0.0,7,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.0,2.0,0


In [None]:
df = main.drop(['Unnamed: 0','subject_id', 'hadm_id'], axis = 1)

## Feature Scaling - Standardization vs. Normalization

Feature scaling is essential for machine learning algorithms that calculate distances between data. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.

* Normalization is recommended when you have a normally distributed observations.
* Standardization works all the time. (recommended)
* We need to perform Feature Scaling when we are dealing with Gradient Descent Based algorithms (Linear and Logistic Regression, Neural Network) and Distance-based algorithms (KNN, K-means, SVM) as these are very sensitive to the range of the data points.

* It is a good practice to fit the scaler on the training data and then use it to transform the testing data. This would avoid any data leakage during the model testing process. Also, the scaling of target values is generally not required.
* Only apply standardization to numerical columns and not the other One-Hot Encoded features. Standardizing the One-Hot encoded features would mean assigning a distribution to categorical features. You don’t want to do that! While it is fine to apply normalization to all kinds of columns including One-Hot Encorded features because One-Hot encoded features are already in the range between 0 to 1. So, normalization would not affect their value.

### Normalization

In [30]:
df = main.copy()

In [8]:
y = df['chronic.pain.fibromyalgia']
X = df.drop(['chronic.pain.fibromyalgia'], axis=1)
print('y shape:', y.shape)
print('X shape:', X.shape)

y shape: (813,)
X shape: (813, 729)


In [9]:
y.value_counts() # imbalanced classes

0    702
1    111
Name: chronic.pain.fibromyalgia, dtype: int64

In [10]:
# Normalization 
from sklearn.preprocessing import MinMaxScaler

# fit scaler on training data
norm = MinMaxScaler().fit(X)

# transform training data
# X_train_norm = norm.transform(X_train)

# # transofrm testing data
# X_test_norm = norm.transform(X_test)

In [None]:
# # Focusing on normalized dataset for now. 
# # Standardization with sklearn
# from sklearn.preprocessing import StandardScaler

# # copy of datasets
# X_train_stand = X_train.copy()
# X_test_stand = X_test.copy()

# # numerical features
# num_cols = ['hospital_expire_flag', 'avg_drg_mortality']

# # apply standardization on numerical features
# for i in num_cols:
#     scale = StandardScaler().fit(X_train_stand[[i]])
#     X_train_stand[i] = scale.transform(X_train_stand[[i]])
#     X_test_stand[i] = scale.transform(X_test_stand[[i]])

## Feature selection

Three benefits of performing feature selection before modeling your data are:

Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
Improves Accuracy: Less misleading data means modeling accuracy improves.
Reduces Training Time: Less data means that algorithms train faster.

### Forward Feature Selection

In [22]:
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LogisticRegression
# Sequential Forward Selection(sfs)
sfs = SFS(LogisticRegression(),
          k_features=20,
          forward=True,
          floating=False,
          scoring = 'f1',
          cv = 5)

sfs.fit(X, y)
sfs.k_feature_names_  

('hospital_expire_flag',
 'has_chartevents_data',
 'length_ed',
 'length_admit',
 'admission_type_emergency',
 'admission_type_urgent',
 'admission_loc_emergency room admit',
 'admission_loc_phys referral/normal deli',
 'admission_loc_transfer from hosp/extram',
 'admission_loc_transfer from other healt',
 'admission_loc_transfer from skilled nur',
 'discharge_loc_disc-tran cancer/chldrn h',
 'discharge_loc_disch-tran to psych hosp',
 'discharge_loc_home',
 'discharge_loc_home health care',
 'insurance_medicaid',
 'religion_protestant quaker',
 'language_engl',
 'gender',
 'drg_code_4203')

### Recursive feature elimination

In [17]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
rfe_100 = RFE(logreg, 100)
rfe_100 = rfe_100.fit(X, y)

In [18]:
def column_index(df, query_cols):
    cols = df.columns.values
    sidx = np.argsort(cols)
    return sidx[np.searchsorted(cols, query_cols, sorter = sidx)]

feature_index = []
features = []
column_index(X, X.columns.values)

for num, i in enumerate(rfe_100.get_support(), start=0):
    if i == True:
        feature_index.append(str(num))

for num, i in enumerate(X.columns.values, start=0):
    if str(num) in feature_index:
        features.append(X.columns.values[num])

print("Features Selected: {}\n".format(len(feature_index)))
print("Features Indexes: \n{}\n".format(feature_index))
print("Feature Names: \n{}".format(features))

Features Selected: 100

Features Indexes: 
['0', '5', '9', '11', '13', '14', '17', '20', '25', '32', '35', '36', '41', '43', '44', '46', '49', '50', '51', '52', '56', '60', '61', '64', '68', '72', '73', '80', '83', '84', '88', '100', '113', '121', '128', '129', '133', '137', '139', '143', '146', '147', '149', '157', '244', '257', '265', '285', '306', '313', '324', '334', '336', '346', '349', '365', '376', '388', '393', '396', '400', '403', '429', '444', '456', '463', '470', '477', '503', '507', '512', '514', '520', '522', '525', '526', '532', '559', '560', '565', '570', '581', '584', '590', '596', '627', '631', '632', '643', '646', '650', '665', '675', '689', '693', '696', '710', '713', '725', '728']

Feature Names: 
['hospital_expire_flag', 'admission_type_urgent', 'admission_loc_transfer from other healt', 'discharge_loc_disc-tran cancer/chldrn h', 'discharge_loc_home', 'discharge_loc_home health care', 'discharge_loc_hospice-medical facility', 'discharge_loc_long term care hospital'

### PCA

In [23]:
from sklearn.decomposition import PCA
pca = PCA(n_components = 100)
X_pca = pca.fit_transform(X)
# X_train_norm_pca = pca.fit_transform(X_train_norm)
# X_test_norm_pca = pca.transform(X_test_norm)
# X_train_stand_pca = pca.fit_transform(X_train_stand)
# X_test_stand_pca = pca.transform(X_test_stand)

In [24]:
pca.fit_transform(X)

array([[-1.52496029e+00, -1.79189610e+00, -6.20935091e-02, ...,
         1.56179776e-02, -1.51054523e-02, -4.29649709e-03],
       [-4.52699111e+00,  1.51627006e+00, -5.22852453e-01, ...,
        -4.98787855e-02,  2.99090779e-02, -9.31862109e-02],
       [ 4.47536740e+00,  2.31073642e+00, -6.46388318e-01, ...,
        -6.24456347e-03,  9.43790362e-03,  4.00708050e-02],
       ...,
       [ 2.34795771e+01, -1.68349710e+00, -7.07003331e-01, ...,
        -4.12299429e-03, -1.12200773e-02,  4.04565786e-02],
       [-2.53458152e+00,  3.22854737e-01,  3.03720285e-01, ...,
        -4.37233593e-02,  1.49733976e-02,  4.65453354e-02],
       [-4.52554978e+00, -1.81676917e+00, -5.11113811e-01, ...,
        -2.61115299e-02,  6.00988362e-05, -2.63227263e-02]])

In [26]:
list(X.columns)

['hospital_expire_flag',
 'has_chartevents_data',
 'length_ed',
 'length_admit',
 'admission_type_emergency',
 'admission_type_urgent',
 'admission_loc_emergency room admit',
 'admission_loc_phys referral/normal deli',
 'admission_loc_transfer from hosp/extram',
 'admission_loc_transfer from other healt',
 'admission_loc_transfer from skilled nur',
 'discharge_loc_disc-tran cancer/chldrn h',
 'discharge_loc_disch-tran to psych hosp',
 'discharge_loc_home',
 'discharge_loc_home health care',
 'discharge_loc_home with home iv providr',
 'discharge_loc_hospice-home',
 'discharge_loc_hospice-medical facility',
 'discharge_loc_icf',
 'discharge_loc_left against medical advi',
 'discharge_loc_long term care hospital',
 'discharge_loc_other facility',
 'discharge_loc_rehab/distinct part hosp',
 'discharge_loc_short term hospital',
 'discharge_loc_snf',
 'insurance_medicaid',
 'insurance_medicare',
 'insurance_private',
 'insurance_self pay',
 'religion_buddhist',
 'religion_catholic',
 'relig

In [31]:
# number of components
n_pcs= pca.components_.shape[0]

# get the index of the most important feature on EACH component
# LIST COMPREHENSION HERE
most_important = [np.abs(pca.components_[i]).argmax() for i in range(n_pcs)]

# get the names
most_important_names = [X.columns[most_important[i]] for i in range(n_pcs)]

# LIST COMPREHENSION HERE AGAIN
dic = {'PC{}'.format(i): most_important_names[i] for i in range(n_pcs)}

# build the dataframe
df_pca = pd.DataFrame(dic.items())

In [32]:
df_pca

Unnamed: 0,0,1
0,PC0,length_admit
1,PC1,avg_drg_severity
2,PC2,insurance_medicare
3,PC3,marital_status_married
4,PC4,ethnicity_white
5,PC5,religion_catholic
6,PC6,ethnicity_white
7,PC7,gender
8,PC8,discharge_loc_home health care
9,PC9,gender


### Chi-square test

In [35]:
from sklearn import feature_selection
chi2_res = feature_selection.chi2(X, y)

df_chi2 = pd.DataFrame({
    'attr1': 'Chronic Pain',
    'attr2': X.columns,
    'chi2': chi2_res[0],
    'p': chi2_res[1],
    'alpha': 0.01
})

df_chi2['H0'] = np.where(df_chi2['p'] < 0.01, 'reject','fail to reject')

df_chi2[df_chi2['H0'] == 'reject'].sort_values('chi2')

Unnamed: 0,attr1,attr2,chi2,p,alpha,H0
51,Chronic Pain,language_engl,7.42139,0.006445,0.01,reject
3,Chronic Pain,length_admit,10.498862,0.001194,0.01,reject
73,Chronic Pain,ethnicity_black/african american,10.705825,0.001068,0.01,reject
25,Chronic Pain,insurance_medicaid,10.991124,0.000915,0.01,reject
43,Chronic Pain,religion_unitarian-universalist,12.76926,0.000352,0.01,reject
149,Chronic Pain,drg_code_1403,12.76926,0.000352,0.01,reject
41,Chronic Pain,religion_protestant quaker,13.174568,0.000284,0.01,reject
470,Chronic Pain,drg_code_4203,15.602606,7.8e-05,0.01,reject
727,Chronic Pain,avg_drg_mortality,18.847238,1.4e-05,0.01,reject
728,Chronic Pain,avg_drg_severity,21.414729,4e-06,0.01,reject


In [37]:
from sklearn import feature_selection
chi2_res = feature_selection.chi2(X, y)

df_chi2 = pd.DataFrame({
    'attr1': 'Chronic Pain',
    'attr2': X.columns,
    'chi2': chi2_res[0],
    'p': chi2_res[1],
    'alpha': 0.05
})

df_chi2['H0'] = np.where(df_chi2['p'] < 0.05, 'reject','fail to reject')

df_chi2[df_chi2['H0'] == 'reject'].sort_values('chi2')

Unnamed: 0,attr1,attr2,chi2,p,alpha,H0
129,Chronic Pain,drg_code_1304,3.858989,0.04948,0.05,reject
14,Chronic Pain,discharge_loc_home health care,3.950826,0.046848,0.05,reject
20,Chronic Pain,discharge_loc_long term care hospital,4.318833,0.037693,0.05,reject
44,Chronic Pain,religion_unobtainable,4.373556,0.036501,0.05,reject
514,Chronic Pain,drg_code_468,4.482444,0.034245,0.05,reject
133,Chronic Pain,drg_code_1334,4.482444,0.034245,0.05,reject
64,Chronic Pain,marital_status_married,4.862347,0.027449,0.05,reject
83,Chronic Pain,ethnicity_unknown/not specified,4.987644,0.025529,0.05,reject
36,Chronic Pain,religion_jewish,5.711306,0.016856,0.05,reject
13,Chronic Pain,discharge_loc_home,5.76951,0.016307,0.05,reject


In [43]:
df.select_dtypes("float").columns

Index(['hospital_expire_flag', 'has_chartevents_data', 'length_ed',
       'expire_flag', 'avg_drg_mortality', 'avg_drg_severity'],
      dtype='object')

## Only include subset of interested variables and splitting the dataset into training set and test set

In [None]:
# from forward

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
# from recursive

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
# from PCA

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
# from chi-square test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## Classifier Algorithms

In [None]:
# Logistic Regression
model = LogisticRegression(random_state=0)

model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('F1 score Raw:', f1_score(y_test, y_pred))
print('Recall score Raw:', recall_score(y_test, y_pred))
print('Accuracy score Raw:', accuracy_score(y_test, y_pred))
print('\n')

model.fit(X_train_norm, y_train)
y_pred = model.predict(X_test_norm)
print('F1 for Normalized:', f1_score(y_test, y_pred))
print('Recall for Normalized:', recall_score(y_test, y_pred))
print('Accuracy score Normalized:', accuracy_score(y_test, y_pred))
print('\n')

model.fit(X_train_stand, y_train)
y_pred = model.predict(X_test_stand)
print('F1 for Standardized:', f1_score(y_test, y_pred))
print('Recall for Standardized:', recall_score(y_test, y_pred))
print('Accuracy score Standardized:', accuracy_score(y_test, y_pred))
print('\n')

model.fit(X_train_norm_pca, y_train)
y_pred = model.predict(X_test_norm_pca)
print('F1 for PCA Normalized:', f1_score(y_test, y_pred))
print('Recall for PCA Normalized:', recall_score(y_test, y_pred))
print('Accuracy score for PCA Normalized:', accuracy_score(y_test, y_pred))
print('\n')

model.fit(X_train_stand_pca, y_train)
y_pred = model.predict(X_test_stand_pca)
print('F1 for PCA Standardized:', f1_score(y_test, y_pred))
print('Recall for PCA Standardized:', recall_score(y_test, y_pred))
print('Accuracy score for PCA Standardized:', accuracy_score(y_test, y_pred))
print('\n')

In [None]:
# KNN
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()

grid={"n_neighbors":range(2,10)}
model_cv=GridSearchCV(model,grid,cv=5)
model_cv.fit(X_train,y_train)
y_pred = model_cv.predict(X_test)
print("Best parameters: ", model_cv.best_params_)

In [None]:
model = KNeighborsClassifier(n_neighbors=6)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('F1 score Raw:', f1_score(y_test, y_pred))
print('Recall score Raw:', recall_score(y_test, y_pred))
print('Accuracy score Raw:', accuracy_score(y_test, y_pred))
print('\n')

model.fit(X_train_norm, y_train)
y_pred = model.predict(X_test_norm)
print('F1 for Normalized:', f1_score(y_test, y_pred))
print('Recall for Normalized:', recall_score(y_test, y_pred))
print('Accuracy score Normalized:', accuracy_score(y_test, y_pred))
print('\n')

model.fit(X_train_stand, y_train)
y_pred = model.predict(X_test_stand)
print('F1 for Standardized:', f1_score(y_test, y_pred))
print('Recall for Standardized:', recall_score(y_test, y_pred))
print('Accuracy score Standardized:', accuracy_score(y_test, y_pred))
print('\n')

model.fit(X_train_norm_pca, y_train)
y_pred = model.predict(X_test_norm_pca)
print('F1 for PCA Normalized:', f1_score(y_test, y_pred))
print('Recall for PCA Normalized:', recall_score(y_test, y_pred))
print('Accuracy score for PCA Normalized:', accuracy_score(y_test, y_pred))
print('\n')

model.fit(X_train_stand_pca, y_train)
y_pred = model.predict(X_test_stand_pca)
print('F1 for PCA Standardized:', f1_score(y_test, y_pred))
print('Recall for PCA Standardized:', recall_score(y_test, y_pred))
print('Accuracy score for PCA Standardized:', accuracy_score(y_test, y_pred))
print('\n')

In [None]:
# Support Vector Machine
# Assumption: observations are linear
from sklearn.svm import SVC
model = SVC()

grid={'C':[1,10,100,1000],'gamma':[1,0.1,0.001,0.0001], 'kernel':['linear','rbf', 'sigmoid','poly'],'degree': [2,3,5,7]}
model_cv=GridSearchCV(model,grid,cv=5)
model_cv.fit(X_train,y_train)
y_pred = model_cv.predict(X_test)
print("Best parameters: ", model_cv.best_params_)

In [None]:
model = SVC(C=1, degree=2, gamma=0.1, kernel='rbf')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('F1 score Raw:', f1_score(y_test, y_pred))
print('Recall score Raw:', recall_score(y_test, y_pred))
print('Accuracy score Raw:', accuracy_score(y_test, y_pred))
print('\n')

model.fit(X_train_norm, y_train)
y_pred = model.predict(X_test_norm)
print('F1 for Normalized:', f1_score(y_test, y_pred))
print('Recall for Normalized:', recall_score(y_test, y_pred))
print('Accuracy score Normalized:', accuracy_score(y_test, y_pred))
print('\n')

model.fit(X_train_stand, y_train)
y_pred = model.predict(X_test_stand)
print('F1 for Standardized:', f1_score(y_test, y_pred))
print('Recall for Standardized:', recall_score(y_test, y_pred))
print('Accuracy score Standardized:', accuracy_score(y_test, y_pred))
print('\n')

model.fit(X_train_norm_pca, y_train)
y_pred = model.predict(X_test_norm_pca)
print('F1 for PCA Normalized:', f1_score(y_test, y_pred))
print('Recall for PCA Normalized:', recall_score(y_test, y_pred))
print('Accuracy score for PCA Normalized:', accuracy_score(y_test, y_pred))
print('\n')

model.fit(X_train_stand_pca, y_train)
y_pred = model.predict(X_test_stand_pca)
print('F1 for PCA Standardized:', f1_score(y_test, y_pred))
print('Recall for PCA Standardized:', recall_score(y_test, y_pred))
print('Accuracy score for PCA Standardized:', accuracy_score(y_test, y_pred))
print('\n')

In [None]:
# Naive Bayes
# Advantages: This algorithm requires a small amount of training data to estimate the necessary parameters. 
# Naive Bayes classifiers are extremely fast compared to more sophisticated methods.

# Disadvantages: Naive Bayes is is known to be a bad estimator.

from sklearn.naive_bayes import GaussianNB
model = GaussianNB()

grid={'var_smoothing':[1e-11, 1e-10, 1e-9, 1e-8]}
model_cv=GridSearchCV(model,grid,cv=5)
model_cv.fit(X_train,y_train)
y_pred = model_cv.predict(X_test)
print("Best parameters: ", model_cv.best_params_)

In [None]:
model = GaussianNB(var_smoothing=1e-11)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('F1 score Raw:', f1_score(y_test, y_pred))
print('Recall score Raw:', recall_score(y_test, y_pred))
print('Accuracy score Raw:', accuracy_score(y_test, y_pred))
print('\n')

model.fit(X_train_norm, y_train)
y_pred = model.predict(X_test_norm)
print('F1 for Normalized:', f1_score(y_test, y_pred))
print('Recall for Normalized:', recall_score(y_test, y_pred))
print('Accuracy score Normalized:', accuracy_score(y_test, y_pred))
print('\n')

model.fit(X_train_stand, y_train)
y_pred = model.predict(X_test_stand)
print('F1 for Standardized:', f1_score(y_test, y_pred))
print('Recall for Standardized:', recall_score(y_test, y_pred))
print('Accuracy score Standardized:', accuracy_score(y_test, y_pred))
print('\n')

model.fit(X_train_norm_pca, y_train)
y_pred = model.predict(X_test_norm_pca)
print('F1 for PCA Normalized:', f1_score(y_test, y_pred))
print('Recall for PCA Normalized:', recall_score(y_test, y_pred))
print('Accuracy score for PCA Normalized:', accuracy_score(y_test, y_pred))
print('\n')

model.fit(X_train_stand_pca, y_train)
y_pred = model.predict(X_test_stand_pca)
print('F1 for PCA Standardized:', f1_score(y_test, y_pred))
print('Recall for PCA Standardized:', recall_score(y_test, y_pred))
print('Accuracy score for PCA Standardized:', accuracy_score(y_test, y_pred))
print('\n')

In [None]:
# Decision Tree
# Advantages: Decision Tree is simple to understand and visualise, requires little data preparation, and can handle both numerical and categorical data.
# Disadvantages: Decision tree can create complex trees that do not generalise well, and decision trees can be unstable because small variations in the data might result in a completely different tree being generated.
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()

grid={"splitter":['best', 'random'], "max_depth":[5,10,15,20], "min_samples_split": [5,10,15,20], "max_features":['auto', 'sqrt', 'log2']}
model_cv=GridSearchCV(model,grid,cv=5)
model_cv.fit(X_train,y_train)
y_pred = model_cv.predict(X_test)
print("Best parameters: ", model_cv.best_params_)

In [None]:
model = DecisionTreeClassifier(max_depth=5, max_features='log2', min_samples_split=20, splitter='best')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('F1 score Raw:', f1_score(y_test, y_pred))
print('Recall score Raw:', recall_score(y_test, y_pred))
print('Accuracy score Raw:', accuracy_score(y_test, y_pred))
print('\n')

model.fit(X_train_norm, y_train)
y_pred = model.predict(X_test_norm)
print('F1 for Normalized:', f1_score(y_test, y_pred))
print('Recall for Normalized:', recall_score(y_test, y_pred))
print('Accuracy score Normalized:', accuracy_score(y_test, y_pred))
print('\n')

model.fit(X_train_stand, y_train)
y_pred = model.predict(X_test_stand)
print('F1 for Standardized:', f1_score(y_test, y_pred))
print('Recall for Standardized:', recall_score(y_test, y_pred))
print('Accuracy score Standardized:', accuracy_score(y_test, y_pred))
print('\n')

model.fit(X_train_norm_pca, y_train)
y_pred = model.predict(X_test_norm_pca)
print('F1 for PCA Normalized:', f1_score(y_test, y_pred))
print('Recall for PCA Normalized:', recall_score(y_test, y_pred))
print('Accuracy score for PCA Normalized:', accuracy_score(y_test, y_pred))
print('\n')

model.fit(X_train_stand_pca, y_train)
y_pred = model.predict(X_test_stand_pca)
print('F1 for PCA Standardized:', f1_score(y_test, y_pred))
print('Recall for PCA Standardized:', recall_score(y_test, y_pred))
print('Accuracy score for PCA Standardized:', accuracy_score(y_test, y_pred))
print('\n')

In [None]:
# Random Forest 
# Ensemble Learning : using different machine algorithms

# The algorithm does not work well for datasets having a lot of outliers, something which needs addressing prior to the model building.

# Advantages: Reduction in over-fitting and random forest classifier is more accurate than decision trees in most cases.

# Disadvantages: Slow real time prediction, difficult to implement, and complex algorithm.

# Build on top of Decision Trees

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()

grid={'bootstrap': [True, False],
     'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
     'max_features': ['auto', 'sqrt','log2'],
     'min_samples_split': [5,10,15,20],
     'n_estimators': [10, 20, 40, 60, 80, 100]}
model_cv=GridSearchCV(model,grid,cv=5)
model_cv.fit(X_train,y_train)
y_pred = model_cv.predict(X_test)
print("Best parameters: ", model_cv.best_params_)

In [None]:
model = RandomForestClassifier(bootstrap=False, max_depth=80, max_features='log2', min_samples_split=5, n_estimators=10)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('F1 score Raw:', f1_score(y_test, y_pred))
print('Recall score Raw:', recall_score(y_test, y_pred))
print('Accuracy score Raw:', accuracy_score(y_test, y_pred))
feat_importances = pd.Series(model.feature_importances_, index=X_train.columns)
feat_importances.nlargest(20).plot(kind='barh')
plt.show()

model.fit(X_train_norm, y_train)
y_pred = model.predict(X_test_norm)
print('F1 for Normalized:', f1_score(y_test, y_pred))
print('Recall for Normalized:', recall_score(y_test, y_pred))
print('Accuracy score Normalized:', accuracy_score(y_test, y_pred))
feat_importances = pd.Series(model.feature_importances_, index=X_train.columns)
feat_importances.nlargest(20).plot(kind='barh')
plt.show()

model.fit(X_train_stand, y_train)
y_pred = model.predict(X_test_stand)
print('F1 for Standardized:', f1_score(y_test, y_pred))
print('Recall for Standardized:', recall_score(y_test, y_pred))
print('Accuracy score Standardized:', accuracy_score(y_test, y_pred))
feat_importances = pd.Series(model.feature_importances_, index=X_train.columns)
feat_importances.nlargest(20).plot(kind='barh')
plt.show()

model.fit(X_train_norm_pca, y_train)
y_pred = model.predict(X_test_norm_pca)
print('F1 for PCA Normalized:', f1_score(y_test, y_pred))
print('Recall for PCA Normalized:', recall_score(y_test, y_pred))
print('Accuracy score for PCA Normalized:', accuracy_score(y_test, y_pred))
print('\n')

model.fit(X_train_stand_pca, y_train)
y_pred = model.predict(X_test_stand_pca)
print('F1 for PCA Standardized:', f1_score(y_test, y_pred))
print('Recall for PCA Standardized:', recall_score(y_test, y_pred))
print('Accuracy score for PCA Standardized:', accuracy_score(y_test, y_pred))


In [None]:
#XGBoost
model = xgb.XGBClassifier()

grid={'max_depth': [3, 5, 6, 10, 15, 20],
      'learning_rate': [0.01, 0.1, 0.2, 0.3],
      'subsample': np.arange(0.5, 1.0, 0.1),
      'colsample_bytree': np.arange(0.4, 1.0, 0.1),
      'colsample_bylevel': np.arange(0.4, 1.0, 0.1),
      'n_estimators': [10, 50, 100, 150, 200, 500, 1000]}
model_cv=GridSearchCV(model,grid,cv=5)
model_cv.fit(X_train,y_train)
y_pred = model_cv.predict(X_test)
print("Best parameters: ", model_cv.best_params_)

In [None]:
def get_xgb_imp(xgb, feat_names):
    from numpy import array
    imp_vals = xgb.booster().get_fscore()
    imp_dict = {feat_names[i]:float(imp_vals.get('f'+str(i),0.)) for i in range(len(feat_names))}
    total = array(imp_dict.values()).sum()
    return {k:v/total for k,v in imp_dict.items()}

In [None]:
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('F1 score Raw:', f1_score(y_test, y_pred))
print('Recall score Raw:', recall_score(y_test, y_pred))
print('Accuracy score Raw:', accuracy_score(y_test, y_pred))
get_xgb_imp(model,feat_names)

model.fit(X_train_norm, y_train)
y_pred = model.predict(X_test_norm)
print('F1 for Normalized:', f1_score(y_test, y_pred))
print('Recall for Normalized:', recall_score(y_test, y_pred))
print('Accuracy score Normalized:', accuracy_score(y_test, y_pred))
get_xgb_imp(model,feat_names)

model.fit(X_train_stand, y_train)
y_pred = model.predict(X_test_stand)
print('F1 for Standardized:', f1_score(y_test, y_pred))
print('Recall for Standardized:', recall_score(y_test, y_pred))
print('Accuracy score Standardized:', accuracy_score(y_test, y_pred))
get_xgb_imp(model,feat_names)

model.fit(X_train_norm_pca, y_train)
y_pred = model.predict(X_test_norm_pca)
print('F1 for PCA Normalized:', f1_score(y_test, y_pred))
print('Recall for PCA Normalized:', recall_score(y_test, y_pred))
print('Accuracy score for PCA Normalized:', accuracy_score(y_test, y_pred))
print('\n')

model.fit(X_train_stand_pca, y_train)
y_pred = model.predict(X_test_stand_pca)
print('F1 for PCA Standardized:', f1_score(y_test, y_pred))
print('Recall for PCA Standardized:', recall_score(y_test, y_pred))
print('Accuracy score for PCA Standardized:', accuracy_score(y_test, y_pred))