# Sepsis prediction from clinical data

Sepsis is a life-threatening condition that occurs when the body's response to infection causes tissue damage, organ failure, or death ([source](https://doi.org/10.13026/v64v-d857)). Internationally, an estimated 30 million people develop sepsis and 6 million people die from sepsis each year; an estimated 4.2 million newborns and children are affected ([WHO](https://www.who.int/news-room/fact-sheets/detail/sepsis)). Early detection and antibiotic treatment of sepsis are critical for improving sepsis outcomes, where each hour of delayed treatment has been associated with roughly an 4-8% increase in mortality ([source](https://doi.org/10.13026/v64v-d857)).

In this notebook I will do exploratory analysis of clinical data and will try to develop the best model for early prediction of the sepsis among patients.

# Contents
1. [Exploratory Data Analysis](#EDA)
2. [Feature Engineering](#FE)
3. [Model Selection](#MS)
4. [Tunning the model's hyperparameters](#FTSM)

---

**Import libraries:**

In [None]:
# data manipulation libraries
import numpy as np # linear algebra
import pandas as pd # data processing

# data visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# machine learning methods
from sklearn.model_selection import train_test_split # data splitting into train and test
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OrdinalEncoder


from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier

from sklearn.model_selection import cross_validate
from sklearn.metrics import f1_score

**Load data:**

In [None]:
dataSepsis = pd.read_csv("/kaggle/input/datasepsis/dataSepsis.csv", sep=";")

----

<a id="EDA"></a>
# 1. Exploratory Data Analysis

First, let's get a cursory overview of the data with Pandas methods `head()`, `info()`, and `describe()`:

In [None]:
dataSepsis.head(15).T

Dataset contains data of 36 thousand patients. Each patient is represented by 41 features.

In [None]:
dataSepsis.info()

In [None]:
dataSepsis.isna().sum(axis = 0).sort_values(ascending=False) / len(dataSepsis) * 100

In [None]:
dataSepsis.nunique()

---

### Attributes:
List all the attributes in the dataset. Label continuous attributes with *c* and discrete with *d*

**Vital signs (columns 1-8)** <br>
HR - Heart rate (beats per minute); <br>
O2Sat - Pulse oximetry (%); <br>
Temp - Temperature (Deg C)<br>
SBP - Systolic BP (mm Hg)<br>
MAP - Mean arterial pressure (mm Hg)<br>
DBP - Diastolic BP (mm Hg)<br>
Resp - Respiration rate (breaths per minute)<br>
EtCO2 - End tidal carbon dioxide (mm Hg)<br><br>
**Laboratory values (columns 9-34)**<br>
BaseExcess - Measure of excess bicarbonate (mmol/L)<br>
HCO3 - Bicarbonate (mmol/L)<br>
FiO2 - Fraction of inspired oxygen (%)<br>
pH - N/A<br>
PaCO2 - Partial pressure of carbon dioxide from arterial blood (mm Hg)<br>
SaO2 - Oxygen saturation from arterial blood (%)<br>
AST - Aspartate transaminase (IU/L)<br>
BUN - Blood urea nitrogen (mg/dL)<br>
Alkalinephos - Alkaline phosphatase (IU/L)<br>
Calcium - (mg/dL)<br>
Chloride - (mmol/L)<br>
Creatinine - (mg/dL)<br>
Bilirubin_direct - Bilirubin direct (mg/dL)<br>
Glucose - Serum glucose (mg/dL)<br>
Lactate - Lactic acid (mg/dL)<br>
Magnesium - (mmol/dL)<br>
Phosphate - (mg/dL)<br>
Potassium - (mmol/L)<br>
Bilirubin_total - Total bilirubin (mg/dL)<br>
TroponinI - Troponin I (ng/mL)<br>
Hct - Hematocrit (%)<br>
Hgb - Hemoglobin (g/dL)<br>
PTT - partial thromboplastin time (seconds)<br>
WBC - Leukocyte count (count*10^3/µL)<br>
Fibrinogen - (mg/dL)<br>
Platelets - (count*10^3/µL)<br><br>
**Demographics (columns 35-40)**<br>
Age - Years (100 for patients 90 or above)<br>
Gender - Female (0) or Male (1)<br>
Unit1 - Administrative identifier for ICU unit (MICU)<br>
Unit2 - Administrative identifier for ICU unit (SICU)<br>
HospAdmTime - Hours between hospital admit and ICU admit<br>
ICULOS - ICU length-of-stay (hours since ICU admit)<br><br>
**Outcome (column 41)** <br>
SepsisLabel - For sepsis patients, `SepsisLabel` is $1$ if $t≥t_{sepsis}-6$ and $0$ if $t<t_{sepsis}−6$. <br>
For non-sepsis patients, `SepsisLabel` is $0$.

---

In [None]:
dataSepsis.describe(include="all").T

Compute share of missing values for each feature (%):

In [None]:
dataSepsis["isSepsis"].value_counts()

### Early summary:
+ Most of the features are continuous with only **gender**, **Unit1**, **Unit2**, and target **isSepsis** represented by discrete values. In total 39 continuous features, 2 categorical.
+ A lot of features miss more than half of values, with **Bilirubin_direct** missing as much as 97%. We may expect that these rare values were measured because some kind of abnormality was expected hence non-missing values may be non-representative of the total population.
+ A lot of negative values in **HospAdmTime** which probably means that the patient was first delivered to ICU and some time later released from ICU to a hospital. Positive values mean that the patient has gotten to ICU after spending some time in a hospital. This is just an assumption, however, an should be checked.
+ **Unit1** and **Unit2** stand for ICU units. Based on them we can find out, whether that person has had to be put in MICU (medical intensive care unit) or SICU (surgical intensive care unit).
+ Septic patients constitute only 7% of the total dataset we have to take this into account when selecting a model

---

Now we should split the data into train and test data and put test data aside until we have a trained model.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(dataSepsis.drop("isSepsis", axis=1), dataSepsis["isSepsis"], test_size=0.1, random_state=42)

---

## Visualise the data

In [None]:
# set plots style
sns.set_theme(context="notebook", style="whitegrid", palette="tab10")

### Visualise **vital signs**:

In [None]:
#X_train.columns

vital_signs = ["HR", "O2Sat", "Temp", "SBP", "MAP", "DBP", "Resp", "EtCO2"]

plt.figure(figsize=(18,12))
plt.subplots_adjust(hspace = .3)
for i, column in enumerate(vital_signs, 1):
    plt.subplot(4,2,i)
    sns.histplot(data=X_train, x=column, hue=y_train, stat="density", common_norm=False, bins=60, kde=True)
    
#plt.savefig("vital_signs.png", dpi=400)

### Visualize **laboratory values**:

In [None]:
#X_train.columns

lab_values = ['BaseExcess', 'HCO3', 'FiO2', 'pH', 'PaCO2', 'SaO2', 'AST', 'BUN',
       'Alkalinephos', 'Calcium', 'Chloride', 'Creatinine', 'Bilirubin_direct',
       'Glucose', 'Lactate', 'Magnesium', 'Phosphate', 'Potassium',
       'Bilirubin_total', 'TroponinI', 'Hct', 'Hgb', 'PTT', 'WBC',
       'Fibrinogen', 'Platelets']

plt.figure(figsize=(18,42))
plt.subplots_adjust(hspace = .3)
for i, column in enumerate(lab_values, 1):
    plt.subplot(13,2,i)
    sns.histplot(data=X_train, x=column, hue=y_train, stat="density", bins=60, common_norm=False, kde=True)
    
#plt.savefig("lab_values.png", dpi=400)

### Visualise **demographics:**

In [None]:
#X_train.columns

demographics = ["Age", "HospAdmTime", "ICULOS"]

plt.figure(figsize=(18,8))
plt.subplots_adjust(hspace = .3)
for i, column in enumerate(demographics, 1):
    plt.subplot(2,2,i)
    sns.histplot(data=X_train, x=column, hue=y_train, stat="density", bins=60, common_norm=False, kde=True)
    
#plt.savefig("demographics.png", dpi=400)

In [None]:
def plotGender(data):
    gender = data
    gender[gender==0] = "female"
    gender[gender==1] = "male"
    
    sns.countplot(x=gender, hue=y_train, dodge=False)    

    
def plotUnit(data):
    Unit1 = data["Unit1"][data["Unit1"]==1].count() # patients in Unit1
    Unit2 = data["Unit2"][data["Unit2"]==1].count() # patients in Unit2
    totalNa = len(data["Unit1"][(data["Unit1"].isna()) & (data["Unit2"].isna())])
    
    sns.barplot(x=["Medical ICU","Surgical ICU","Not Given"] ,y=[Unit1, Unit2, totalNa])

In [None]:
plt.figure(figsize=(18,5))
plt.subplot(1,2,1)
plt.title("Gender Distribution")
plotGender(X_train["Gender"])   
plt.subplot(1,2,2)
plt.title("ICU distribution")
plotUnit(X_train)

#plt.savefig("additional.png", dpi=400)

### Observations
**Vital signs:**
+ **HR**, **Temp**, and **Resp** seem to differ between septic and non-septic patients
+ The rest of the attributes don't differ and may be irrelevant in terms of sepsis prediction <br>

**Laboratory values:**
+ Despite having similar mean, feature **BaseExcess** appears to deviate higher from the mean for septic patients, it is possible that abnormal concentration of excess bicarbonate is inherent to septic patients
+ **FiO2** is represented by somewhat discrete values with bimodal distribution. This and the fact that just about 20% of the patients have record of this value makes this features likely to be diregarded as non-representative
+ **pH** of septic patients appears to be higher (more basic pH)
+ **BUN** also appears to be higher in concentrations for sepsis-positive patients
+ **Calcium** concentration, although similar for septic and non-septic patients, has outliers for septic patients at very low concentrations. We may want to investigate this further
+ **Bilirubin_direct** looks to be higher for septic patients, but we have to keep in mind that more than 96% of the patients lack this attribute. Nontheless, it is possible though that bilirubin concentration was measured only if doctors suspected this attribute to be abnormal and we there are indeed some very high concentration for septic patients
+ **Bilirubin_total** is higher for septic patients as well. Worth noting, that total bilirubin concentration is defined as the sum of **Bilirubin_direct** and indirect bilirubin. Therefore we may expect this feature to be strongly correlated with **Bilirubin_direct** 
+ **Hct** and **Hgb** values seem to be slightly lower in concentrations for septic patients
+ Septic patients appear to have slightly higher **PTT** sometime
+ **Fibrinogen** of septic patient appears to be bimodal and slightly higher in concentration than that of non-septic patients. About 95% of patients miss this feature thus we might expect that fibrinogen was measured for some specific reason
+ Septic patients may have slightly lower concantrations of **Platelets**

**Demographics:**
+ No differnce in **Age** between septic and non-septic patients. It is unlikely that one or few years of diffence in age may lead to higher chance of developing sepsis. However, in many biochmical signs tend to change with age, therefore it may be benefitial to include age in our model but to divide it into some discrete, more representative values. A lot of age values are 100, which are for patients older than 90.
+ **HospAdmTime** is quite similar for all patients
+ Patients that stayed at ICU longer have had higher chances of eventually developing sepsis
+ Patients that didn't have a record of ICU unit were likely assigned to other ICU than SICU and MICU (e.g. Cardiac ICU, Trauma ICU etc.) as all the patients have a record of time spent in ICU (**ICULOS** attributte).

---

Let's see if the type of **ICU** that a patient is treated in is related to chances of developing a sepsis:

In [None]:
def CombineUnits(units_cols):
    data = units_cols.copy()
    data["Unit"] = pd.Series(np.zeros((len(data))))
    data.loc[data["Unit1"] == 1, "Unit"] = "MICU"
    data.loc[data["Unit2"] == 1, "Unit"] = "SICU"
    data.loc[(data["Unit1"].isna()) & (data["Unit2"].isna()), "Unit"] = "Other ICU"
    return data[["Unit"]]


def ShareSepticByUnit(UnitCol, y):
    shares = {}
    
    IsSepsis_micu = y.loc[UnitCol["Unit"] == "MICU"]
    IsSepsis_sicu = y.loc[UnitCol["Unit"] == "SICU"]
    IsSepsis_other = y.loc[UnitCol["Unit"] == "Other ICU"]
    
    shares["MICU"] = IsSepsis_micu[IsSepsis_micu == 1].count() / len(IsSepsis_micu) * 100
    shares["SICU"] = IsSepsis_sicu[IsSepsis_sicu == 1].count() / len(IsSepsis_sicu) * 100
    shares["Other"] = IsSepsis_other[IsSepsis_other == 1].count() / len(IsSepsis_other) * 100
        
    return shares


IsSeptic_shares = ShareSepticByUnit(CombineUnits(X_train.copy()), y_train)

In [None]:
plt.figure(figsize=(18,5))
plt.subplot(1,2,1)
sns.countplot(data=CombineUnits(X_train.copy()), x="Unit", hue=y_train)
plt.subplot(1,2,2)
plt.ylim([0,20])
plt.ylabel("Developed Sepsis (%)")
plt.yticks([i for i in range(0,21,2)])
sns.barplot(x=list(IsSeptic_shares.keys()), y=list(IsSeptic_shares.values()))

#plt.savefig("add2.png", dpi=400)

Not by much, but patients treated in surgical ICU had lower probability of developing sepsis. This feature may be useful for our model.

---

### Preliminary feature exclusion <br>
Before we move any further with our analysis let's discuss if we want to disregard any features as irrelevant or non-representative (or both). 
+ **FiO2** (fraction of inspired oxygen) is missing for 82 % of the patients from the test set. The distribution of this feature also looks very unusual, perhaps indicating that this sample is quite non-representative. 
+ **EtCO2** (end tidal carbon dioxide) lacks for more than 95 % of the patients. This feature does not change between septic and non-septic patients and although there might be a specific reasong to measure this parameter is appear to be non-related to sepsis occurence. 
+ **SaO2** is similar case to **EtCO2** - lots of missing values and no apparent difference between septic and non-septic patients, we will drop this feature. 
+ **HospAdmTime** (hours between hospital admit and ICU admit) does not differ between positive and negative patients. Overall, most of the patients are delivered to ICU as soon as they develop a critical condition and it's probably irrelevant how long they have been present in a hospital in stable condition.
+ **TroponinI** is missing in most cases (>95%) and is reported to indicate heart and kidney failure, thus we may exclude thus feature



Now that we excluded some features let's take a look if any numerical features correlate to each other:

In [None]:
#X_train.columns
correlation_features = ["HR", "O2Sat", "Temp", "SBP", "MAP", "DBP", "Resp", "BaseExcess", "HCO3", "pH", "PaCO2", "AST",
                       "BUN", "Alkalinephos", "Calcium", "Chloride", "Creatinine", "Bilirubin_direct", "Glucose", "Lactate",
                       "Magnesium", "Phosphate", "Potassium", "Bilirubin_total", "Hct", "Hgb", "PTT", "WBC",
                       "Fibrinogen", "Platelets", "Age"]

mat_corr = X_train[correlation_features].corr()

mask = np.zeros_like(mat_corr)
mask[np.triu_indices_from(mask)] = True
plt.figure(figsize=(26,22))
sns.heatmap(mat_corr, mask=mask, square=True, annot=True, fmt=".2f", center=0, linewidths=.5, cmap="RdBu")

#plt.savefig("cormat.png", dpi=400)

In [None]:
mat_corr["Bilirubin_total"].sort_values(ascending=False)[:3]

In [None]:
mat_corr["DBP"].sort_values(ascending=False)[:3]

In [None]:
mat_corr["SBP"].sort_values(ascending=False)[:3]

In [None]:
mat_corr["HCO3"].sort_values(ascending=False)[:4]

In [None]:
mat_corr["pH"].sort_values(ascending=False)[:4]

In [None]:
mat_corr["PaCO2"].sort_values(ascending=False)[:4]

In [None]:
mat_corr["Hgb"].sort_values(ascending=False)[:3]

+ Unsurprisingly a high correclation between **Bilirubin_direct** and **Bilirubin_total**. **Bilirubin_direct** is missing for about 96 % of patients, having another feature that highly correlates to this one enables us to exclude **Bilirubin_direct** from classification model. 
+ There is a high correlation between **DBP** (diastolic BP), **MAP** (mean arterial pressure), and **SBP** (systolic BP). Mean arterial pressure is known ([source](https://doi.org/10.1007/s00134-009-1427-2)) to be primary indicator of patient state in near septic conditions, thus, both **DBP** and **SBP** may be disregarded as **MAP** is calculated from these two features. 
+ We can notice correlation between three features - **pH**, **HCO<sub>3</sub>**, and **PaCO<sub>2</sub>**. Despite being correlated the ratio between these three features may tell us whether the patient is undergoing *acidosis* or *alkalosis* and which one it is *metabolic* or *respiratory*. Thus, it's better to leave these features and maybe combine them into some categorical feature that will state what type of acidosis/alkalosis has been developed. 
+ Lastly, hemoglobin (**Hgb**) and hematocrit (**Hct**) values happened to be highly correlated. This is to no surprise because whilst hemoglobin expresses concentration of hemoglobin in blood, hematocrit is a volume share of erytrocytes in blood. Research articles about sepsis diagnosis mostly concentrate on hemoglobin out of these two, while hematocrit is often used to asses anemia and other diseases. Let's stick with experts and leave just **Hgb**.

---

Now, let's perform a chi square test. This test will show us what features are independant with the target label.

In [None]:
from sklearn.feature_selection import chi2
from sklearn.impute import SimpleImputer

X_train.columns
chi_cols = ['HR', 'O2Sat', 'Temp', 'SBP', 'MAP', 'DBP', 'Resp', 'EtCO2', 
       'HCO3', 'FiO2', 'pH', 'PaCO2', 'SaO2', 'AST', 'BUN',
       'Alkalinephos', 'Calcium', 'Chloride', 'Creatinine', 'Bilirubin_direct',
       'Glucose', 'Lactate', 'Magnesium', 'Phosphate', 'Potassium',
       'Bilirubin_total', 'TroponinI', 'Hct', 'Hgb', 'PTT', 'WBC',
       'Fibrinogen', 'Platelets', 'Age', 'ICULOS']

X_chi = X_train[chi_cols].copy()

imputer = SimpleImputer(strategy="median")
X_chi[chi_cols] = imputer.fit_transform(X_chi)

chis = chi2(X_chi, y_train)

len(chi_cols)
chis[0].reshape(len(chi_cols),1)

chi_dict = {}
p_dict = {}
for i in range(len(chi_cols)):
    chi_dict[chi_cols[i]] = chis[0][i]
    p_dict[chi_cols[i]] = chis[1][i]

In [None]:
p_dict

P-values represent probability of a null-hypothesis applicability to a given feature. In this case the null-hypothesis is that a feature and target label are dependant values. For feature selection we will use the most common p-value threshold - 95 %. Thus, all features with p-value higher than 0.05 violate null-hypothesis and may be disregarded. These features are: <br>
**EtCO2**<br>
**FiO2**<br>
**SaO2**<br>
**Chloride**<br>
**Bilirubin_direct**<br>
**Lactate**<br>
**Magnesium**<br>
**Phosphate**<br>
**Potassium**<br>
Other than that we can see that **pH** and **PaCO2** violate null-hypothesis. We mentioned that these features may predict acid-base disturbances in patients. However, since we have **BaseExcess** feature **pH** becomes rudimentary, however we can still use **PaCO2** to find out wheter a patient has metabolic or respiratory disturbance. Specifically, septic patients often develop a condition known as *Respiratory alkalosis with metabolic acidosis*.

---

Let's examine some outliers. We know that patients above 90 y.o are marked as 100 y.o. This may be misleading; we may expect a difference in metabolic and immune processes between 90 y.o. and say 110 y.o. Therefore we will exclude patients older than 90 later in [feature engineering.](#FE) Other outliers were observed in **Calcium** and **PTT** features, let's check them:

In [None]:
X_train[X_train["Calcium"] < 2].head().T

In [None]:
X_train["Calcium"][X_train["Calcium"] < 2].count()

In [None]:
X_train[X_train["PTT"] > 150].head().T

In [None]:
X_train["Calcium"][X_train["PTT"] > 150].count()

Cursory examining don't lead to any substantial finding. The only observation worth noting is rather low temperatures recorded from patients with high **PTT**. We will not remove these observations for now.

---

<a id="FE"></a>
# 2. Feature Engineering

Now it's time to engineer our features. First, let's make a list of features that are left after **EDA**. Expand the list below if you want to see the features, there are still quite a lot of them.

**Vital signs (columns 1-8)** <br>
HR - Heart rate (beats per minute); <br>
O2Sat - Pulse oximetry (%); <br>
Temp - Temperature (Deg C)<br>
MAP - Mean arterial pressure (mm Hg)<br>
Resp - Respiration rate (breaths per minute)<br>
**Laboratory values (columns 9-34)**<br>
BaseExcess - Measure of excess bicarbonate (mmol/L)<br>
HCO3 - Bicarbonate (mmol/L)<br>
pH - N/A<br>
PaCO2 - Partial pressure of carbon dioxide from arterial blood (mm Hg)<br>
SaO2 - Oxygen saturation from arterial blood (%)<br>
AST - Aspartate transaminase (IU/L)<br>
BUN - Blood urea nitrogen (mg/dL)<br>
Alkalinephos - Alkaline phosphatase (IU/L)<br>
Calcium - (mg/dL)<br>
Chloride - (mmol/L)<br>
Creatinine - (mg/dL)<br>
Glucose - Serum glucose (mg/dL)<br>
Lactate - Lactic acid (mg/dL)<br>
Magnesium - (mmol/dL)<br>
Phosphate - (mg/dL)<br>
Potassium - (mmol/L)<br>
Bilirubin_total - Total bilirubin (mg/dL)<br>
Hgb - Hemoglobin (g/dL)<br>
PTT - partial thromboplastin time (seconds)<br>
WBC - Leukocyte count (count*10^3/µL)<br>
Fibrinogen - (mg/dL)<br>
Platelets - (count*10^3/µL)<br><br>
**Demographics (columns 35-40)**<br>
Age - Years (100 for patients 90 or above)<br>
Gender - Female (0) or Male (1)<br>
Unit - Administrative identifier for ICU unit<br>
ICULOS - ICU length-of-stay (hours since ICU admit)<br><br>

---

### Age

Let's begin with the most obvious - **Age**. It's unlikely that a difference in one or few years will result in significantly different metabolic processes, immune responses and overall fitness. Nonetheless, we may expect a teen and a senior to have different metabolism. Thus, let's divide age into categories:

In [None]:
# remove outliers
y_train = y_train.loc[X_train["Age"] <= 90]
X_train = X_train.loc[X_train["Age"] <= 90]

y_test = y_test.loc[X_test["Age"] <= 90]
X_test = X_test.loc[X_test["Age"] <= 90]

In [None]:
def discretizateAge(data):
    # teen, youth, adult, senior
    bins = [13, 18, 30, 60, np.inf]
    data = np.digitize(data, bins=bins)
    data = data.reshape(len(data), 1)
    return data

DiscretizateAge = FunctionTransformer(discretizateAge)
DiscretizateAge.fit_transform(X_train["Age"]).shape

In [None]:
age_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("discretizator", DiscretizateAge)
])

age_pipeline.fit_transform(X_train[["Age"]]).shape

In [None]:
CombineAllUnits = FunctionTransformer(CombineUnits)

units = ["Unit1", "Unit2"]

units_pipeline = Pipeline([
    ("combine", CombineAllUnits),
    ("encoder", OneHotEncoder(sparse=False))
])

units_pipeline.fit_transform(X_train[units]).shape

In [None]:
acidbase_features = ["BaseExcess", "PaCO2"]

def isAcidBaseDisturb(cols):
    cols = np.c_[cols, np.zeros(len(cols))]
    cols[:,2][(cols[:,0] < -2) & (cols[:,1] < 40)] = 1
    col = cols[:,2].reshape(len(cols), 1)
    return col

FindAcidosis = FunctionTransformer(isAcidBaseDisturb)
FindAcidosis.fit_transform(X_train[acidbase_features])

In [None]:
acidbase_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("acidosis", FindAcidosis)
])

acidbase_pipeline.fit_transform(X_train[acidbase_features]).shape

In [None]:
num_features = ["HR", "O2Sat", "Temp", "MAP", "Resp", "AST", "BUN",
                "Alkalinephos", "Calcium", "Creatinine", "Glucose", "Bilirubin_total", 
                "Hgb", "PTT", "WBC", "Fibrinogen", "Platelets", "ICULOS"]

num_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

num_pipeline.fit_transform(X_train[num_features]).shape

In [None]:
gender_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OrdinalEncoder())
])

gender_pipeline.fit_transform(X_train[["Gender"]])

In [None]:
preprocessing_pipeline = ColumnTransformer([
    ("numbers", num_pipeline, num_features),
    ("acidbase", acidbase_pipeline, acidbase_features),
    ("age", age_pipeline, ["Age"]),
    ("units", units_pipeline, units),
    ("gender", gender_pipeline, ["Gender"])
], verbose=True)

preprocessing_pipeline.fit_transform(X_train).shape

---

<a id="MS"></a>
# 3. Model Selection

Before we train our model we have to select an appropriate metric. If we recall that just 7% of all patients in our dataset had sepsis we should give a higher value to false negatives (penalizing our model for predicting all 0s), thus a good metric would be *f1 score* :), which is defined as:<br>

$$ f_{1} = 2 \cdot \frac{recall \cdot precision}{recall + precision} $$

In [None]:
X_train = preprocessing_pipeline.fit_transform(X_train)

X_test = preprocessing_pipeline.fit_transform(X_test)

In [None]:
logreg = LogisticRegression(verbose=1)
logreg.fit(X_train, y_train)
cv_logreg = cross_validate(logreg, X_train, y_train, cv=3, scoring="f1", return_train_score=True)
cv_logreg

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=50)
knn.fit(X_train, y_train)
cv_knn = cross_validate(knn, X_train, y_train, cv=3, scoring="f1", return_train_score=True)
cv_knn

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=10, verbose=1)
cv_rf = cross_validate(rf, X_train, y_train, cv=3, scoring="f1", return_train_score=True)
cv_rf

In [None]:
from xgboost import XGBClassifier

xgboost = XGBClassifier(n_estimators=150, use_label_encoder=False, scale_pos_weight=12, eval_metric="aucpr", verbosity=1, disable_default_eval_metric=1)
cv_xgboost = cross_validate(xgboost, X_train, y_train, cv=3, scoring="f1", return_train_score=True, verbose=1)
cv_xgboost

In [None]:
nn = MLPClassifier(max_iter=5000, hidden_layer_sizes=(50,50,50,50), verbose=0, learning_rate="adaptive")
cv_nn = cross_validate(nn, X_train, y_train, cv=3, scoring="f1", return_train_score=True, verbose=1)
cv_nn

It's evident that a simple logistic regression unit and k-means clustering underfit the data, while both ensemble models and neural network overfit.

<a id="FTSM"></a>
# 4. Tunning model's hyperparameters

In [None]:
params = {"n_estimators": [150, 200],"max_delta_step": [0.1], "subsample": [None, 0.5, 1], "reg_lambda": [1, 1.1], "alpha": [0, 0.1]}

In [None]:
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(estimator=xgboost, param_grid=params, verbose=2, scoring="f1", cv=2)
grid_search = grid_search.fit(X_train, y_train)

In [None]:
grid_search.best_score_

In [None]:
grid_search.best_params_

In [None]:
from sklearn.metrics import plot_confusion_matrix

xgboost = XGBClassifier(**grid_search.best_params_, use_label_encoder=False, scale_pos_weight=12, eval_metric="aucpr", verbosity=1, disable_default_eval_metric=1)
xgboost.fit(X_train, y_train)

In [None]:
fig, ax = plt.subplots(figsize=(6, 6))
plot_confusion_matrix(xgboost, X_test, y_test, cmap="Blues", ax=ax)
plt.savefig("conf.png", dpi=400)

In [None]:
y_pred = xgboost.predict(X_test)

In [None]:
from sklearn.metrics import recall_score


print(recall_score(y_test, y_pred))
print(f1_score(y_test, y_pred))

In the end we got a recall of 77% and f1 of 69%. Recall value of 77% means that we can succesfully identify almost 8 patients out of 10 that will develop a sepsis within the next 6 days after the day in which laboratory analyses were performed. Another detail that worth mentioning is that fine-tuning xgboost didn't help much in reducing overfiting. What can be tried next to improve prediction is to check whether fine-tuning of neural network reduces its overfitting.