<a href="https://colab.research.google.com/github/slowvak/FeatureSelectionTutorial/blob/main/FeatureSelection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is a notebook to explore feature selection methods.
It uses the Pima Indians Diabetes Database.
The first thing we do is import two libraries that provide important data handling capabilities--first being to read in a CSV (Comma Separated Values) file with our data. The file also has a first row which describes the data in each column. Pandas is 'smart' enough to recognize this. Numpy is a library for numerical computations on arrays. Intersting Side Note: It was written by Mayo Graduate Student Travis Oliphant. https://en.wikipedia.org/wiki/Travis_Oliphant

In [1]:
import pandas as pd
import numpy as np

In [2]:
!wget -O diabetes.csv "https://www.dropbox.com/s/hr9b5rkjblrtu44/pima-indians-diabetes.csv?dl=1"


--2021-02-03 21:02:17--  https://www.dropbox.com/s/hr9b5rkjblrtu44/pima-indians-diabetes.csv?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.1.18, 2620:100:601a:18::a27d:712
Connecting to www.dropbox.com (www.dropbox.com)|162.125.1.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/dl/hr9b5rkjblrtu44/pima-indians-diabetes.csv [following]
--2021-02-03 21:02:17--  https://www.dropbox.com/s/dl/hr9b5rkjblrtu44/pima-indians-diabetes.csv
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc063254f8d6277f6d43916ad61c.dl.dropboxusercontent.com/cd/0/get/BIT1dnpvxC82-sUhCUNt0bKuQKvJlEu2zBe0zyYX3f02fuENRupo01ZsWOnovEbTqDa7BatKSmx7LKcsqpTZ9WQ80vyLNAPAeseEggtEd5rbO4rsHKc6nMVh5BENqrs29Vw/file?dl=1# [following]
--2021-02-03 21:02:18--  https://uc063254f8d6277f6d43916ad61c.dl.dropboxusercontent.com/cd/0/get/BIT1dnpvxC82-sUhCUNt0bKuQKvJlEu2zBe0zyYX3f02fuENRupo01ZsWOnovEbTqDa7Ba

In [3]:
data = pd.read_csv("diabetes.csv")
# print out the first few lines to make sure we got the data and that it looks reasonable
data.head()


Unnamed: 0,Pregnances,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Convert the dataframe into a set of feature vectors (X) and the class (1-diabetic 0-not diabetic).


In [31]:
array = data.values
X = array[:,0:8]
y = array[:,8]

#print out the shape of the X feature array--number of features and number of rows/examples
X.shape

(768, 8)

Before we can really evaluate feature selection methods, we need a classifier to show the impact of feature selection. We will use a support vector machine (SVM). We also need to split data into a set used to train the SVM, and a hold-out set to see how well it worked (the 'test' set). 


In [37]:
# Import train_test_split function
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn import svm



In [44]:
#Import svm model
# define a function that will train a classifier and then measure performance

def train_and_measure_performance(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=109) # 70% training and 30% test
    #Create a svm Classifier
    clf = svm.SVC(kernel='linear') # Linear Kernel

    #Train the model using the training sets
    clf.fit(X_train, y_train)

    #Predict the response for test dataset
    y_pred = clf.predict(X_test)
    # Model Accuracy: how often is the classifier correct?
    return (metrics.accuracy_score(y_test, y_pred))

baseline_performance = train_and_measure_performance(X, y)
print (f'Baseline accruacy with all features is {baseline_performance}')

Baseline accruacy with all features is 0.7402597402597403


There are several 'scikit' libraries for scientific python. One of these is scikit-learn for machine learning. (Another that I use is scikit-image for image processing). 
The link to the library is: https://scikit-learn.org/stable/

We will start with filter method--Chi Square

In [45]:

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2


# Feature extraction--select the 4 best features
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, y)

# Summarize scores
np.set_printoptions(precision=3)

print ("Preg       Glucose	BP   SkinThk  Insulin	BMI	  DiabPed     Age")
print(fit.scores_)



Preg       Glucose	BP   SkinThk  Insulin	BMI	  DiabPed     Age
[ 111.52  1411.887   17.605   53.108 2175.565  127.669    5.393  181.304]


So this means that ranked feature importance is: Insulin, Glucose, Age, BMI, Pregnancies, Skin thickness, BP, Pedigree

In [51]:
# now measure the performance
# delete the columns we don't want

chi_df = data.drop(columns=["BloodPressure","SkinThickness","DiabetesPedigreeFunction","Age", "Class"])
d_chi = chi_df.values 
X_1 = d_chi[:,:]

perf = train_and_measure_performance(X_1, y)
print (f'Accuracy with best features using Chi-square is {perf}')

# TO Do: you try adding or removing features and see the impact on performance


Accuracy with 5 best features using Chi-square is 0.7619047619047619


Next we will apply a wrapper method--Recursive Feature Elimination or RFE

In [52]:
# Import your necessary dependencies
from sklearn.feature_selection import RFE
from sklearn.svm import SVR
from sklearn.linear_model import LogisticRegression

estimator = SVR(kernel="linear")
selector = RFE(estimator, n_features_to_select=5, step=1)
selector = selector.fit(X, y)
selector.support_
selector.ranking_


array([1, 1, 1, 3, 4, 1, 1, 2])