# Preamble
This notebook aims to illustrate some feature selection methods using the [Heart Failure Prediction dataset](https://www.kaggle.com/andrewmvd/heart-failure-clinical-data). This dataset has 12 numerical features for each example and the target for each feature is whether or not death occurred (binary classification). **Although ML models will be trained and evaluated in this notebook, the primary focus is feature selection methods.** The topics covered and coded are as follows:

- Correlation plots
- L1 regularization for feature selection
- SelectKBest for feature selection
- Random forests for feature selection

*Note: some of this notebook and a [notebook](https://github.com/vvbauman/Feature-generation-selection/blob/master/Ad%20Click%20-%20Feature%20Generation%20and%20Selection.ipynb) I have published to GitHub use the same code and cover some of the same topics. I've also created a follow-up notebook that illustrates methods explaining feature importance using the same dataset and model developed in this notebook.*

# Data Overview
First we'll load the data, see the first few lines, and determine if there are any NaN values.

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

data= pd.read_csv('../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv', delimiter= ',')
print(np.shape(data))
data.head()

In [None]:
#check for NaNs...
data.isnull().sum().sum()

This dataset has 299 examples, each characterized by 12 features. There are no NaN values anywhere in the dataset.

# Correlation Plot
We can visualize the correlation between each feature value and the target to get an idea of how each feature is related to the target.

In [None]:
corr= data.corr().iloc[-1,:].to_numpy().reshape(13,1)
sns.heatmap(corr, yticklabels=data.columns, xticklabels= 0)

We can see there is a moderate positive correlation between the target and the features age and serum_creatinine. There is a moderate negative correlation between the target and the features ejection_fraction, serum_sodium, and time. When we try out our feature selection methods, let's see if these features (age, serum_creatinine, ejection_fraction, serum_sodium, and time) are the selected features. Before we get into the feature selection method, we'll split the data into train/validation sets (90/10 split). We'll use only the training set when doing our feature selection methods.

# Feature Selection - L1 Regularization
This method involves training a linear model that uses an L1 penalty. All features are used to train this model and the L1 penalty causes the weight/contribution of unimportant features to be zero. We then extract the non-zeroed features and use them in our machine learning model. This method considers all features and how they collectively contribute to each prediction.

In [None]:
#declare the feature values and labels then split data into training/validation sets
from sklearn.model_selection import train_test_split

feats= data.iloc[:,:-1]
labels= data.iloc[:,-1]

x_train, x_devel, y_train, y_devel= train_test_split(feats, labels, test_size= 0.1, random_state= 20)

#train linear model with L1 penalty
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel

lsvc= LinearSVC(C= 1.0, penalty= 'l1', dual= False).fit(x_train, y_train)
svc_mod= SelectFromModel(lsvc, prefit= True)

#get non-zeroed features 
x_train_svc= svc_mod.transform(x_train) #training set w/non-zeroed features
selected_feats_svc= pd.DataFrame(svc_mod.inverse_transform(x_train_svc), index= x_train.index, columns= x_train.columns)
selected_cols_svc= selected_feats_svc.columns[selected_feats_svc.var() != 0]

#get development set that has only the non-zeroed features
x_devel_svc= x_devel[selected_cols_svc]

#see which features were retained
print('Features retained: ', selected_cols_svc)

Using the L1 regularization method for feature selection results in 11 of the 12 features being retained! Once we've tried all feature selection methods, we'll try them each in a machine learning model and compare their performance to a model that use all features.

# Feature Selection - SelectKBest
This next feature selection method involves evaluating the linear relationship between each feature and the target. The top-k features with the strongest relationship with the target are identified and are used in the machine learning model. We will retain the top 5 features since this was the number of features with a moderate correlation with the target (as identified in the Correlation Plot section of this notebook). Unlike the L1 regularization method, this is a univariate method, meaning that each feature is considered individually/one-by-one.


In [None]:
from sklearn.feature_selection import SelectKBest, f_classif
kbest_feats= SelectKBest(f_classif, k=5)

#get top 5 best features
x_train_kbest= kbest_feats.fit_transform(x_train, y_train)
selected_feats_kbest= pd.DataFrame(kbest_feats.inverse_transform(x_train_kbest), index= x_train.index, columns= x_train.columns)
selected_cols_kbest= selected_feats_kbest.columns[selected_feats_kbest.var() != 0]

#get development set that has the top 5 features
x_devel_kbest= x_devel[selected_cols_kbest]

#see which features were retained
print('Features retained: ', selected_cols_kbest)

We can see that this feature selection method is equivalent to choosing the top-n features with the highest correlation with the target - we get the same features that we identified earlier that had the highest correlations with the target.

# Feature Selection - Random Forest
The final feature selection method considered in this notebook is the random forest (an ensemble of decision trees). Decision trees naturally rank features by how well they distinguish classes. Features that best distinguish classes are evaluated at nodes at the start of a tree. Based on this, if we prune a tree at a certain node, we can get a subset of the most informative features.

Implementing this feature selection method involves training a random forest and identifying the features that have an importance weight greater than some threshold. These features are the ones used in your machine learning model. We will use the median importance weight as our threshold, meaning that any features that have an importance weight greater than the median importance will be retained.

In [None]:
from sklearn.ensemble import RandomForestClassifier

#create and train a random forest
forest= RandomForestClassifier(n_estimators= 1000, random_state= 20)
forest.fit(x_train, y_train)

#get the most important features
forest_feats= SelectFromModel(forest, threshold= 'median')
forest_feats.fit(x_train, y_train)

#get training and development sets that have only the most important features
x_train_forest= forest_feats.transform(x_train)
x_devel_forest= forest_feats.transform(x_devel)

#see which features were retained
for i in forest_feats.get_support(indices= True):
    print(x_train.columns[i])

When we use a random forest as our feature selection method, 6 features are retained. Now that we've tried 3 different feature selection approaches, let's see which feature set gives us the best results for a machine learning model. The model we'll use is the Support Vector Machine with its default hyperparameter settings.

# SVM to Decide Feature Set

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import recall_score, precision_score

def eval_svm(train_feats, test_feats, train_labs, test_labs):
    """
    INPUT: train_feats and test_feats are either 2D numpy arrays or pd dataframes with the feature values for the train/test sets 
    train_labs, test_labs are either 1D numpy arrays or pd series with the corresponding labels to the train/test features
    
    OUTPUT: classification_results is a string of results, incl. precision, recall, and f1-score for each class
    """
    #scale features before using in SVM
    scaler= StandardScaler()
    train_feats_scale= scaler.fit_transform(train_feats)
    test_feats_scale= scaler.transform(test_feats)
    
    svm= SVC()
    svm.fit(train_feats_scale, train_labs)
    
    predicts= svm.predict(test_feats_scale)
    
    precision= precision_score(test_labs, predicts, average= None, zero_division= 0)
    recall= recall_score(test_labs, predicts, average= None, zero_division= 0)
    
    return precision, recall

#get performance of model that uses all features
prec_allfeats, rec_allfeats= eval_svm(x_train, x_devel, y_train, y_devel)

#get performance of model that uses features from L1 regularization
prec_svc, rec_svc= eval_svm(x_train_svc, x_devel_svc, y_train, y_devel)

#get performance of model that uses features from SelectKBest
prec_kbest, rec_kbest= eval_svm(x_train_kbest, x_devel_kbest, y_train, y_devel)

#get performance of model that uses features from random forest
prec_forest, rec_forest= eval_svm(x_train_forest, x_devel_forest, y_train, y_devel)

print('SVM precision and recall, all features: ', prec_allfeats, rec_allfeats, '\n'+'SVM precision and recall, L1 regularization features: ', prec_svc, rec_svc)
print('SVM precision and recall, top 5 features: ', prec_kbest, rec_kbest, '\n'+'SVM precision and recall, random forest features: ', prec_forest, rec_forest, '\n')   
print('Average recalls: ', np.mean(rec_allfeats), np.mean(rec_svc), np.mean(rec_kbest), np.mean(rec_forest))

To understand these results, we'll consider the first SVM that was trained using all of the features. The first two numbers are the precisions for the two classes: 0.9 is the precision for class 0 and 0.5 is the precision for class 1. The second two numbers are the recalls for the two classes: 0.78 is the recall for class 0 and 0.71 is the recall for class 1. 

Based on the average recall, the model that used all features is the best and should be retained as the final model. If the hospital wanted to reduce the number of features they measure for each patient but still be able to predict the occurrence of a death event, they can measure the features that were part of the top-5 feature set. The model that used the top 5 features achieved comparable prediction performance as the model that used all features (0.73 vs 0.75 average recall).

# Conclusions and Next Steps
This notebook illustrated and provided sample code for 3 different approaches to feature selection: L1 regularization, SelectKBest, and random forest. When we tested a SVM model using the different feature sets that resulted from these feature selection approaches, the feature set that resulted in the best model performance was when we used all features. The model that resulted in the second-best performance was the model that used the feature set from the SelectKBest method. 

**I've created a follow-up notebook that illustrates different ways to describe and quantify feature importance. In this follow-up notebook, the SVM model that used the features from the SelectKBest method developed in this notebook will be used.** Questions and feedback are always welcome in the comments!