 # **Feature Selection, SVM with parameter tuning and proper use of the time variable**

**Abstract:** In this exercise we will predict the heart failure of a patient. We will demonstrate the importance of understanding the features and use a Support Vector Machine (SVM) with parameter tuning to build a model. Most importantly we will construct a new feature from the *time* variable. This feature is strongly correlated with the response variable and in contrast to the original *time* variable it can be used properly as a feature.

# **1.) Introduction**

I understand this exercise as "Design a tool for the medical market". The tool should help a doctor who has collected the required data of a patient that was diagnosed  with CVD to make proper decisions about further special treatment (e.g. medical surgery, medicine).
Our aim is to use the data to train a machine learning model for classification. The response/target variable should be the *DEATH_EVENT*. 

Depending on the treatment a doctor might prefer a classifier with high recall or high presicion. Let's assume for a second that the treatment has a low risk for the patient but might indeed prevent death. In such a case false positives are more acceptable than false negatives and we might prefer a classifier with high recall over one with high precision. If the treatment has however a high risk to kill a patient who would have survivied without treatment, false negatives would be more acceptable and we might desire a model with higher precision. Hence the AUC score seems to be a good metric for measuring the performance of the classifier. 

Note that features have a medical and/or biophysical interpretation. Knowledge about the meaning and significance of the feature would be highly valuable.

I should mention before hand that I view this notebook as an exercise and that a full fledged solution requires significantly more effort. Thoughts on this can be found in the conclusion section.

Let's go.

In [None]:
import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt
from matplotlib.colors import Normalize
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import confusion_matrix, classification_report, plot_roc_curve, accuracy_score, f1_score,roc_auc_score

The file is located at

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# **2.) Preprocessing**

In [None]:
df = pd.read_csv("/kaggle/input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv")

df.info()

The data is already cleaned. There are only some binary features that we might prefer to be booleans.

In [None]:
print(df[['sex','smoking','diabetes','high_blood_pressure','anaemia','DEATH_EVENT']].head(5))

Let's convert them into booleans.

In [None]:
for feature in ['sex','smoking','diabetes','high_blood_pressure','anaemia','DEATH_EVENT']:
        df[feature]=df[feature].astype(bool)
df.info()

# **3.) Exploratory Data Analysis and Feature Selection**

In [None]:
print(df.describe())

Let us first count the number of death events.

In [None]:
print(df["DEATH_EVENT"].describe())

There are 203 (68%) patients that survived and 96 (32%) patients that died. Hence, the data set is skewed.

Next I would like to remark that some features are equipped with units. Even if we do not have a medical interpretation at hand this information provides us at least with some physical information. We might for example try to construct new physically reasonable features via dimensional analysis. We don't care about the precise units at the moment since we will rescale the features anyways. A short explenation of the features (from a quick google search) and dimensions are

**The features:**

[age] = $T$

**Creatine Phosphokinase** (a.k.a., creatine kinase, CPK, or CK) is an enzyme (a protein that helps to elicit chemical changes in your body) found in your heart, brain, and skeletal muscles. When muscle tissue is damaged, CPK leaks into your blood.

[creatinine_phosphokinase] = $M L^{-3}$

**Ejection fraction**, for example, refers to how well the left heart ventricle (or right ventricle) pumps blood with each heart beat. 

[ejection_fraction] = dimensionless


**Platelets** form a platelet plug to stop bleeding from an injured blood vessel. In cardiovascular disease, abnormal clotting occurs that can result in heart attacks or stroke. Blood vessels injured by smoking, cholesterol, or high blood pressure develop cholesterol-rich build-ups (plaques) that line the blood vessel; these plaques can rupture and cause the platelets to form a clot.

[platelets] = $L^{-3}$

The **serum creatinine** variable is a measure for kidney health. 
From this [paper](https://www.ahajournals.org/doi/full/10.1161/01.str.28.3.557):
"Stroke risk was significantly increased at levels above 116 Î¼mol/L (90th percentile) even after adjustment for a wide range of cardiovascular risk factors."

[serum_creatinine] = $ML^{-3}$


**Serum Sodium** variable is measure for sorium content in blood. From this paper [paper](https://pubmed.ncbi.nlm.nih.gov/29935992/):
"In the multivariate model, low-level serum sodium was associated with an increased risk of cardiovascular mortality (hazard ratio [HR], 1.10; 95% confidence interval [CI], 1.02-1.18 per standard deviation [SD]; P = 0.009), whereas a lower level of serum chloride was not (HR, 1.04; 95% CI, 0.97-1.12 per standard deviation; P = 0.278). Analyses with restrictive cubic splines yielded similar results."

[serum_sodium] = $L^{-3}mol$



More on **time** later

[time] = $T$ (surprise!)



where $T,L,$ $M$ and $mol$ are time, length, mass and mole respectively.

From Wikipedia: **Anemia** (also spelled anaemia) is a decrease in the total amount of red blood cells (RBCs) or hemoglobin in the blood,[3][4] or a lowered ability of the blood to carry oxygen.
Here is a [medical paper](https://pubmed.ncbi.nlm.nih.gov/14531771/).

The remaining features should be clear.



The first 11 variables in [14] or combinations thereof can be used as feature variables. The variable *DEATH_EVENT* will be the response/target variable. The *time* variable requires deeper discussion and analysis.




**The time variable**:

This variable appears to be the time which the patient already sees the doctor for treatment. 

In the discussion section it was argued that time is a response/target variable and that it should not be used as a feature variable. I would say that this is indeed up to interpretation. One might for example construct a new binary feature from time. Let's say, by asking if the treatment of the patient took already longer than time $x$. The constructed variable is then a feature and might help a doctor decide if additional treatment is required. If such new feature is helpful for our training should will be analysed now.

In [None]:
print(df.loc[df['DEATH_EVENT']==0]['time'].mean())
print(df.loc[df['DEATH_EVENT']==0]['time'].median())
print(df.loc[df['DEATH_EVENT']==1]['time'].mean()) 
print(df.loc[df['DEATH_EVENT']==1]['time'].median()) 
print(df.loc[df['DEATH_EVENT']==1]['time'].max())
print(df.loc[df['DEATH_EVENT']==1]['time'].max())

The surviving patients are an average of 158 days in treatment while those who died left this world already at around 71 days on average (we also computed the median because it's less sensitive to outliers). We can conclude that patients who already managed to keep alive for a certian time during treatment have a higher chance of survival. It is therefore reasonable to construct a new feature. For that purpose we use the intuitive idea to  search for a threshold which maximizes the absolute value of the linear correlation between the new variable and the variable *DEATH_EVENT*.
The new feature will be called *time_new*

In [None]:
best_correlation = 0
for threshold in np.arange(20,240,1):
    df['time_new'] = (df['time'] >= threshold)
    correlation = df[['time_new','DEATH_EVENT']].corr().to_numpy()[0,1]
    if np.abs(correlation) >  np.abs(best_correlation):
            best_correlation = df[['time_new','DEATH_EVENT']].corr().to_numpy()[0,1]
            best_threshold = threshold
                
print("Best threshold = " + str(best_threshold) + ' days' )
print("linear correlation = " + str(best_correlation) )

df['time_new'] = (df['time']>= best_threshold)
df = df.drop(['time'], axis=1)

The magic number is 74 days. The new feature is useful for a doctor and in addition it is strongly correlated with the response variable.

**Correlations:**

Let us now turn to selecting the other features. For that purpose we plot the linear correlations of all features with the response DEATH_EVENT.

In [None]:
print(df.corr()['DEATH_EVENT'])
plt.barh(np.arange(len(df.corr()['DEATH_EVENT'])), df.corr()['DEATH_EVENT'],align = 'center',tick_label = df.columns)
plt.xlim((-1,1))
plt.grid(axis='x')
plt.title('Pearson Correlation of features with response')
plt.show()

plt.clf()

The features that seem to provide the best potential to train a model are those that have highest linear correlation with response variable. We choose the following

In [None]:
selected_features = ['time_new','age','ejection_fraction','serum_creatinine','serum_sodium']
print(selected_features)

Note that throwing away features is potentially dangerous. Even if two features $x_i$ and $x_j$ are not strongly linearly correlated to the response $y$ there could exist some combination $f(x_i,x_j)$ that has a strong correlation with $y$. I am missing the medical knowledge for the construction of such features. 

Note that the Point-biserial correlation is a better measure for the correlation between a continous and a binary variable.

This is why I decide here to put some more effort into the feature selection.

**Random Forest for evaluation of feature importance:**
We can also use a Random Forest Classifier to evaluate the relevance of the features. Quick and dirty:

In [None]:
X = df.drop(['DEATH_EVENT'],axis=1)
y = df['DEATH_EVENT']



sc = StandardScaler()
X = sc.fit_transform(X)


rfc = RandomForestClassifier(n_estimators=1000, random_state=0)

rfc.fit(X, y)


importances = rfc.feature_importances_

print(df.drop('DEATH_EVENT', axis=1).columns)

plt.barh(np.arange(len(df.drop('DEATH_EVENT', axis=1).columns)), importances,align = 'center',tick_label = df.drop('DEATH_EVENT', axis=1).columns)
plt.xlim((0,0.5))
plt.grid(axis='x')
plt.title('Random Forest importances')
plt.show()

plt.clf()

The above analysis provides a slightly different picture than our analysis of the correlation. The Random Forest assigns more importance to the features *creatine_phosphokinase* and *platelets*. I have two possible explenations:
1. We used the wrong measure of correlation between continous and binary features 
2. A combination of several features including the above mentioned ones might be higher correlated with the response. This sounds reasonable since decision trees in the random forest might explicitely take such combinations into account.

I have trust in both the spirit of the forest and the medical papers we cited above and add the features to the feature list.

In [None]:
selected_features.append('creatinine_phosphokinase')
selected_features.append('platelets')
print(selected_features)

***smoking*, *high_blood_pressure*,*anaemia* and *sex***:

The remaining features *smoking*, *high_blood_pressure* and *anaemia* are known to increase the overall risk to be diagnosed with CVD. This does, however, not mean that these features necessarily increase the chance of a patient that has already been diagnosed to die.

There is also the *sex* feature that is not highly correlated with the *DEATH_EVENT*.

There might, however, be combinations of these features that have a higher correlation with the *DEATH_EVENT*. For example: What if a patient is a smoker that has high blood pressure at the same time? One should indeed check all such combinations. Note that two binary features $x_i$ and $x_j$ can be combined in several ways, we might, for example, choose the combination $x_i \land x_j$ or $\neg(x_i) \land x_j$, where $\neg$ is the negation operator.

I am too busy at the moment to write an algorithm that checks all such combinations and I will only check a few that appear reasonable to me.

In [None]:
df['new_feature']=df['smoking']&df['high_blood_pressure']
correlation = df[['new_feature','DEATH_EVENT']].corr().to_numpy()[0,1]
print('smoking and high_blood_pressure correlation='+str(correlation))

df['new_feature']=df['anaemia']&df['sex']
correlation = df[['new_feature','DEATH_EVENT']].corr().to_numpy()[0,1]
print('anaemia and male correlation='+str(correlation))

df['new_feature']=df['smoking']&df['sex']
correlation = df[['new_feature','DEATH_EVENT']].corr().to_numpy()[0,1]
print('smoking and male='+str(correlation))

df['new_feature']=df['smoking']&(~df['sex'])
correlation = df[['new_feature','DEATH_EVENT']].corr().to_numpy()[0,1]
print('smoking and female='+str(correlation))

df['new_feature']=df['high_blood_pressure']&df['anaemia']
correlation = df[['new_feature','DEATH_EVENT']].corr().to_numpy()[0,1]
print('high_blood_pressure and anaemia correlation='+str(correlation))

df['new_feature']=df['smoking']&df['high_blood_pressure']&df['sex']
correlation = df[['new_feature','DEATH_EVENT']].corr().to_numpy()[0,1]
print('smoking, high_blood_pressure and male correlation='+str(correlation))

df['new_feature']=df['smoking']&df['high_blood_pressure']&df['anaemia']
correlation = df[['new_feature','DEATH_EVENT']].corr().to_numpy()[0,1]
print('smoking, high_blood_pressure and anaemia correlation='+str(correlation))

df['new_feature']=df['smoking']&df['high_blood_pressure']&df['anaemia']&df['sex']
correlation = df[['new_feature','DEATH_EVENT']].corr().to_numpy()[0,1]
print('smoking, high_blood_pressure, anaemia and male correlation='+str(correlation))


The quick search by hand did not reveal anything too interesting. We therefore decide not to add any new features constructed from the features 'smoking, high_blood_pressure','sex' and 'anaemia'.

We might also check combinations of the features *smoking, high_blood_pressure,anaemia* and *sex* with those we already have in the selected feature list. We leave this task for future investigations. 

**Final plot of the correlation matrix:**

Let us plot the correlation matrix for the features selected so far. The linear correlations among the selected features are not too strong. Thus there seems to be no reason to drop any of them. 

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(df[selected_features].corr(), vmin=-1, vmax=1, cmap='seismic', annot=True)
plt.title('Correlation matrix')
plt.yticks(rotation=0)
plt.show()
plt.clf()

**Some scatter plots**:

We provide some scatter plots to visualize the problem.

In [None]:
# 2d scatter plot


plt.scatter(df.loc[df['DEATH_EVENT'] == 0]['age'],df.loc[df['DEATH_EVENT'] == 0]['ejection_fraction'], color='blue')
plt.scatter(df.loc[df['DEATH_EVENT'] == 1]['age'],df.loc[df['DEATH_EVENT'] == 1]['ejection_fraction'], color='red')
plt.legend(['DEATH_EVENT=1','DEATH_EVENT=0'])
plt.xlabel('age')
plt.ylabel('ejection_fraction')
plt.show()
plt.clf()

# 3d scatter plot

fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, projection='3d')

for farbe,b in [('blue',0),('red',1)]:
    xs = df.loc[df['DEATH_EVENT'] == b]['serum_creatinine']
    ys = df.loc[df['DEATH_EVENT'] == b]['ejection_fraction']
    zs = df.loc[df['DEATH_EVENT'] == b]['age']
    ax.scatter(xs, ys, zs, color = farbe)

ax.set_xlabel('serum_creatinine')
ax.set_ylabel('ejection_fraction')
ax.set_zlabel('age')

plt.show()
plt.clf()

# # **4.) Parameter tuned SVM with Gaussian kernel**
The number of features and the size of the data set suggests to use a Support Vector Machine (SVM) with a Gaussian Kernel. The performance of the model can be controlled by the inverse regularization coefficient $C$ and the width of the Gaussian kernel which is regulated through the paramter $\gamma$.

Since the data is skewed and optimization of AUC seems to be numerically too expensive we will try to optimize the model for the F1 score. 
Our strategy: 
1. Get insights of the performance on the parameter space spanned by $C$ and $\gamma$ by plotting the F1 score over a large but rough grid. For that we will use *sklearn.model_selection.GridSearchCV*.
2. Use these insights to select a smaller portion of the parameter space and perform a finer search by using *sklearn.model_selection.RandomizedSearchCV*

**4. a) Train-test split**

We choose the features selected in the third chapter and perform a random train test split. The method is not scale invariant therfore we also perform feature scaling by using *sc.fit_transform*.

In [None]:
X = df[selected_features]
y = df['DEATH_EVENT']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 6)

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

**4.b) Plot the grid** 

I found some of the code [here](https://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html#sphx-glr-auto-examples-svm-plot-rbf-parameters-py).

In [None]:
C_range = np.logspace(-2,3,11)
gamma_range = np.logspace(-4,1,11)


param_grid = dict(C=C_range,gamma=gamma_range)

# initialize the grid
svc = SVC(kernel='rbf')
grid = GridSearchCV(svc, param_grid, cv=10, scoring = 'f1')



#fit grid to training data
grid.fit(X_train,y_train)

cv_results_df = pd.DataFrame.from_dict(grid.cv_results_)

# ###########################################################################
# Plot heatmap of F1 score
# ###########################################################################
# Utility function to move the midpoint of a colormap to be around
# the values of interest.
# ###########################################################################
class MidpointNormalize(Normalize):

    def __init__(self, vmin=None, vmax=None, midpoint=None, clip=False):
        self.midpoint = midpoint
        Normalize.__init__(self, vmin, vmax, clip)

    def __call__(self, value, clip=None):
        x, y = [self.vmin, self.midpoint, self.vmax], [0, 0.5, 1]
        return np.ma.masked_array(np.interp(value, x, y))
# ############################################################################
scores = grid.cv_results_['mean_test_score'].reshape(len(C_range),
                                                     len(gamma_range))
# ############################################################################
# The parameters vmin and midpoint control the colorbar of the heatmap
plt.figure(figsize=(8, 6))
plt.subplots_adjust(left=.2, right=0.95, bottom=0.15, top=0.95)
plt.imshow(scores, interpolation='nearest', cmap=plt.cm.hot,
           norm=MidpointNormalize(vmin=0.2, midpoint=0.6))
plt.xlabel('gamma')
plt.ylabel('C')
plt.colorbar()

C_range_aux = np.logspace(-2,3,6)
gamma_range_aux = np.logspace(-4,1,6)
plt.xticks(np.arange(0,len(gamma_range),2), gamma_range_aux, rotation=45)
plt.yticks(np.arange(0,len(C_range),2), C_range_aux)
plt.title('Cross-validation F1 score')
plt.show()
plt.clf()

This is a typical figure for the cross-validation F1 score.

In [None]:
print('Best F1 score = ' + str(grid.best_score_)+ 'at')
print(grid.best_params_)

Lets see if we can do better than that.

**4.c) Random search on a smaller but finer grid**

In [None]:
C_range = np.logspace(-1,4,200)
gamma_range = np.logspace(-4, 0,200)

param_dist = dict(C=C_range,gamma=gamma_range)

rand = RandomizedSearchCV(svc, param_dist, cv=10, scoring = 'f1',n_iter = 150 , random_state=42)
rand.fit(X_train,y_train)

print('Best F1 score = ' + str(rand.best_score_) + 'at')
print(rand.best_params_)

1. # **5.) Model evaluation**

We evaluate the confusion matrix on the test set. 

In [None]:
pred_svc = rand.predict(X_test)
print(confusion_matrix(y_test, pred_svc))

The classification report:

In [None]:
print(classification_report(y_test, pred_svc))

And the ROC curve

In [None]:
plot_roc_curve(rand, X_test, y_test)

**Sanity Check**:

Something like this should be a standard on Kaggle in my opinion. 
It proves that I did not fine tune the random state variables to optimize the scores.

In [None]:
svc_best = SVC(kernel='rbf', C=0.4500557675700499 , gamma=0.017027691722258997)

f1_scores = []
accuracy_scores = []
roc_auc_scores = []

for i in range(1, 200):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = i)
    sc = StandardScaler()
    X_train = sc.fit_transform(X_train)
    X_test = sc.fit_transform(X_test)
    svc_best.fit(X_train,y_train)
    pred_svc = svc_best.predict(X_test)
    f1_scores.append(f1_score(y_test,pred_svc))
    accuracy_scores.append(accuracy_score(y_test,pred_svc))
    roc_auc_scores.append(roc_auc_score(y_test,pred_svc))
    
    
plt.hist(f1_scores, bins=15)
plt.title('Distribution of F1 scores')
plt.show()
plt.clf()


plt.hist(roc_auc_scores, bins=15)
plt.title('Distribution of AUC scores')
plt.show()
plt.clf()

plt.hist(accuracy_scores, bins=15)
plt.title('Distribution of accuracy_scores')
plt.show()
plt.clf()



# **6.) Diagnosis of SVM model**

We plot the relative error on the training and test set for an increasing number of data points. We might be able to see a general trend in the resulting diagram.

In [None]:
# randomly shuffle rows
df.sample(frac=1) 
X = df[selected_features].astype('float64').to_numpy()
y = df['DEATH_EVENT'].astype('float64').to_numpy()

error_train = []
error_test = []

I = range(50, 298)

for i in I:
    X_train, X_test, y_train, y_test = train_test_split(X[0:i,:], y[0:i], test_size = 0.2, random_state = 42)
    X_train = sc.fit_transform(X_train)
    X_test = sc.fit_transform(X_test)
    svc_best.fit(X_train,y_train)
    pred_svc_train = svc_best.predict(X_train)
    pred_svc_test  = svc_best.predict(X_test)
    error_train.append(np.linalg.norm(pred_svc_train-y_train)/(np.shape(X_train)[0]))
    error_test.append(np.linalg.norm(pred_svc_test-y_test)/(np.shape(X_test)[0]))

    
plt.plot(I,error_train, color='blue')
plt.plot(I,error_test,color='red')
plt.ylabel('Errors')
plt.xlabel('Number of data points')
plt.legend(['Error on training set','Error on test set'])
plt.show()
plt.clf()



# **7.) Conclusion**

The AUC score looks actually quite good. I am happy with the result.

What would help to improve the performance? Should we work more on the features or collect more data? The diagnosis suggests that we should work on the features instead of collecting more data.
Adding new features such as height and weight might definetly be worth a try.
I believe, however, that a true understanding, and hence, a better solution to the problem can be achived by acquiring deeper insights into the meaning and medical significance of the features. 
A quick google search and some skimming on wikipedia, however, reveals a rabbit hole which I don't want to enter at the moment. A proper solution, which might only be obtained after digging into the medical and biophysical processes, is beyond the scope of this simple exercise. 
In other words: If I planned to put more effort into this exercise I would probably work on improving the features. If this were a real situation I would maybe look for consultation by a medical expert on CVD who knows the statistics of the problem.

I also liked [Furkan Gulsen's idea](https://www.kaggle.com/codeblogger/step-by-step-support-vector-machine-svm). He uses SMOTE to generate new observations to "un-skew" the data.

I would also like to mention that a medical doctor might prefer a model that outputs probabilities. The SVM is not the perfect choice if this was expected. After all we might prefer ensemble learning to build the ultimate classifier taylored to the problem at hand.


Thanks for taking the time to read through my first Kaggle notebook. I learned a valuable lesson from this exercise: The importance of understanding the problem and the features. Machine Learning is just a tool. Understanding the problem, on the other hand, might be the key to achieve a proper solution.


Constructive feedback of any kind is highly appreciated. The problem in this notebook was certainly quite interesting and I might come back to it in the future.