# Gerardo de Miguel González

## Feature Selection Proof of Concept

### References

**::GMG::** I've used the following main references:

  - [DataCamp](https://www.datacamp.com/community/tutorials/feature-selection-python) Beginner's Guide to Feature Selection in Python. *Learn about the basics of feature selection and how to implement and investigate various feature selection techniques in Python*. Sayak Paul. September 25th, 2018.
  - [Analytics Vidhya](https://www.analyticsvidhya.com/blog/2016/12/introduction-to-feature-selection-methods-with-an-example-or-how-to-select-the-right-variables/) Introduction to Feature Selection methods with an example (or how to select the right variables?). Saurav Kaushik, december 1, 2016.  
  
**::GMG::** I've already shown a filter method, *without success* at the moment, I have to say. Now I'm moving into *wrapper methods* (following DataCamp tutorial)
  

### Libraries

In [1]:
import pandas as pd
import numpy as np

### Dataset

**::GMG::** You may [download the data from kaggle](https://www.kaggle.com/uciml/pima-indians-diabetes-database/downloads/pima-indians-diabetes-database.zip/1) in zipped csv format (which *includes a header* with the column names by the way) using [the reference provided](https://www.kaggle.com/uciml/pima-indians-diabetes-database) in the Datacamp article *if you have an account in kaggle*. I haven't checked it but you should be able [to use the kaggle API](https://medium.com/@yvettewu.dw/tutorial-kaggle-api-google-colaboratory-1a054a382de0) too to automate the download from code *with an API key* created with your account. 

**::GMG::** I've already downloaded *manually from kaggle* (with my account) the csv dataset and placed it in a data folder.

In [2]:
#::GMG::Location of the downloaded dataset csv file
!ls data

pima-indians-diabetes.csv


In [3]:
#::GMG::Dataframe
data = pd.read_csv("data/pima-indians-diabetes.csv")

In [28]:
#::GMG::Show some samples
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
#::GMG::Show some samples
data.tail()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.34,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1
767,1,93,70,31,0,30.4,0.315,23,0


In [5]:
#::GMG::Get some statistics of the features and classification variable
data.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


### Wrapper: Recursive Feature Elimination (RFE)

**::GMG::** The Recursive Feature Elimination (or RFE) works by *recursively removing attributes* and *building a model* on those attributes that remain. It uses the *model accuracy* to identify which attributes (and combination of attributes) *contribute the most* to predicting the target attribute. See [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html). There's also a [user guide](https://scikit-learn.org/stable/modules/feature_selection.html#rfe) and two examples:

 - [RFE on pixels](https://scikit-learn.org/stable/auto_examples/feature_selection/plot_rfe_digits.html)
 - [Recursive feature elimination with cross-validation¶](https://scikit-learn.org/stable/auto_examples/feature_selection/plot_rfe_with_cross_validation.html)

In [6]:
# Import your necessary dependencies
from sklearn.feature_selection import RFE 
from sklearn.linear_model import LogisticRegression

In [7]:
# Feature extraction
# Logistic Regression classifier to select the top 3 features.
model = LogisticRegression(solver = 'liblinear')
rfe = RFE(estimator = model, n_features_to_select = 3)

In [8]:
array = data.values
X = array[:,0:8]
Y = array[:,8]

In [9]:
fit = rfe.fit(X, Y)

In [10]:
print("Num Features: %s" % (fit.n_features_))

Num Features: 3


In [30]:
print("Features: %s" % 
      [values for values in zip(data.columns, 
                                fit.support_, 
                                fit.ranking_)
      ]
     )

Features: [('Pregnancies', True, 1), ('Glucose', False, 2), ('BloodPressure', False, 3), ('SkinThickness', False, 5), ('Insulin', False, 6), ('BMI', True, 1), ('DiabetesPedigreeFunction', True, 1), ('Age', False, 4)]


In [31]:
print("Selected Features: %s" % 
      [value for value, supported in zip(data.columns, fit.support_) if supported]
     )

Selected Features: ['Pregnancies', 'BMI', 'DiabetesPedigreeFunction']


**::GMG::** RFE chose the top 3 features as `Pregnancies`, `BMI`, and `DiabetesPedigreeFunction`. These are *marked True* in the support array and *marked with a choice “1”* in the ranking array. This, in turn, indicates *the strength* of these features.

In [14]:
#::GMG::This is the estimator/model used for RFE
#       May I use hyperparameters like C, penalty, ...?
fit.estimator_

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)

In [20]:
fs = fit.transform(X = X)

In [21]:
fs.shape

(768, 3)

In [22]:
type(fs)

numpy.ndarray

In [24]:
#::GMG::Get the 'new' dataset with the selected features
df_fs = pd.DataFrame(data = fs, 
                     columns = ('Pregnancies', 'BMI', 'DiabetesPedigreeFunction'))

In [25]:
df_fs['Outcome'] = data['Outcome']

In [26]:
df_fs.head()

Unnamed: 0,Pregnancies,BMI,DiabetesPedigreeFunction,Outcome
0,6.0,33.6,0.627,1
1,1.0,26.6,0.351,0
2,8.0,23.3,0.672,1
3,1.0,28.1,0.167,0
4,0.0,43.1,2.288,1


In [27]:
df_fs.shape

(768, 4)

In [29]:
#::GMG::Getting to kow the ropes ...
fit.get_support()

array([ True, False, False, False, False,  True,  True, False])