#**Feature Selection Exercise Solution** 

Feature selection strategies can be divided into three main areas based on the type of strategy and
techniques employed:

* **Filter methods**: select features purely based on metrics like
correlation, mutual information and so on. Popular methods include threshold based
methods and statistical tests.
* **Wrapper methods**: capture interaction between multiple
features by using a recursive approach to build multiple models using feature
subsets and select the best subset of features giving us the best performing model.
Methods like backward selecting and forward elimination are popular wrapper
based methods.
* **Embedded methods**: combine the benefits of the other
two methods by leveraging Machine Learning models themselves to rank and score
feature variables based on their importance. Tree based methods like decision trees
and ensemble methods like random forests are popular examples of embedded
methods.

Adapted from Dipanjan Sarkar et al. 2018. [Practical Machine Learning with Python](https://link.springer.com/book/10.1007/978-1-4842-3207-1).

In [None]:
#Import necessary dependencies and settings
import numpy as np
import pandas as pd

# print floating point numbers using fixed point notation, in which case numbers equal to zero in the current precision will print as zero.
np.set_printoptions(suppress=True)

# Return the current print options.
pt = np.get_printoptions()['threshold']

# Threshold based methods
This is a filter based feature selection strategy, where you can use some form of cut-off or thresholding for
limiting the total number of features during feature selection.

## Variance based thresholding

Another way of using thresholds is to use variance based thresholding where features having low
variance (below a user-specified threshold) are removed.



###Ecoli Dataset

Ecoli dataset is for predicting Protein Localization Sites in Ecoli. 
```
Number of Instances:  336 
Number of Attributes: 8 ( 7 predictive, 1 name )
Attribute Information.
  1. Sequence Name: Accession number for the SWISS-PROT database
  2. mcg: McGeoch's method for signal sequence recognition.
  3. gvh: von Heijne's method for signal sequence recognition.
  4. lip: von Heijne's Signal Peptidase II consensus sequence score (Binary attribute).
  5. chg: Presence of charge on N-terminus of predicted lipoproteins (Binary attribute).
  6. aac: score of discriminant analysis of the amino acid content of outer membrane and periplasmic proteins.
  7. alm1: score of the ALOM membrane spanning region prediction program.
  8. alm2: score of ALOM program after excluding putative cleavable signal regions from the sequence.
Missing Attribute Values: None.
Class Distribution. The class is the localization site.
  cp  (cytoplasm)                                    143
  im  (inner membrane without signal sequence)        77               
  pp  (perisplasm)                                    52
  imU (inner membrane, uncleavable signal sequence)   35
  om  (outer membrane)                                20
  omL (outer membrane lipoprotein)                     5
  imL (inner membrane lipoprotein)                     2
  imS (inner membrane, cleavable signal sequence)      2
```

You can learn more about the dataset here:
* Ecoli Dataset ([ecoli.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/ecoli.data))
* Ecoli Dataset Description ([ecoli.names](https://raw.githubusercontent.com/jbrownlee/Datasets/master/ecoli.names))


In [None]:
# Download Ecoli dataset
!pip install wget
!python -m wget -o ecoli.csv "https://raw.githubusercontent.com/udel-cbcb/al_ml_workshop/main/data/ecoli.csv"

df = pd.read_csv('ecoli.csv')

In [None]:
df.shape

In [None]:
# Convert categorical variable into dummy/indicator variables.
ecoli_site = pd.get_dummies(df['site'])
ecoli_site.head()

In [None]:
from sklearn.feature_selection import VarianceThreshold
# Create a VarianceThreashold object to remove features from the one hot encoded 
# features where the variance is less than 0.15

vt = VarianceThreshold(threshold=.15)
vt.fit(ecoli_site)

In [None]:
# Show which features have been selected based on their True values and also their variance being above 0.15.
pd.DataFrame({'variance': vt.variances_,
              'select_feature': vt.get_support()},
            index=ecoli_site.columns).T

In [None]:
# Get the final subset of selected features
ecoli_site_subset = ecoli_site.iloc[:,vt.get_support()].head()
ecoli_site_subset

# Statistical Methods

This dataset is known as the Wisconsin
Diagnostic Breast Cancer dataset, which is also available in its native or raw format at https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic), which is the UCI Machine Learning
repository.

In [None]:
from sklearn.datasets import load_breast_cancer

bc_data = load_breast_cancer()
bc_features = pd.DataFrame(bc_data.data, columns=bc_data.feature_names)
bc_classes = pd.DataFrame(bc_data.target, columns=['IsMalignant'])

# build featureset and response class labels 
bc_X = np.array(bc_features)
bc_y = np.array(bc_classes).T[0]
print('Feature set shape:', bc_X.shape)
print('Response class shape:', bc_y.shape)

In [None]:
np.set_printoptions(threshold=30)
print('Feature set data [shape: '+str(bc_X.shape)+']')
print(np.round(bc_X, 2), '\n')
print('Feature names:')
print(np.array(bc_features.columns), '\n')
print('Predictor Class label data [shape: '+str(bc_y.shape)+']')
print(bc_y, '\n')
print('Predictor name:', np.array(bc_classes.columns))
np.set_printoptions(threshold=pt)

The response class variable is a binary
class where 1 indicates the tumor detected was benign and 0 indicates it was malignant. We can also see
the 30 features that are real valued numbers that describe characteristics of cell nuclei present in digitized
images of breast mass.

In [None]:
from sklearn.feature_selection import chi2, SelectKBest

# use the chi-square test on this feature set and select the top 15 best features out of the 30 features.
skb = SelectKBest(score_func=chi2, k=15)
skb.fit(bc_X, bc_y)

In [None]:
# sort the scores to see the most relevant features
feature_scores = [(item, score) for item, score in zip(bc_data.feature_names, skb.scores_)]
sorted(feature_scores, key=lambda x: -x[1])[:10]

In [None]:
# create a subset of the selected features obtained from our original feature set of features with the help of the chi-square test
select_features_kbest = skb.get_support()
feature_names_kbest = bc_data.feature_names[select_features_kbest]
feature_subset_df = bc_features[feature_names_kbest]
bc_SX = np.array(feature_subset_df)
print(bc_SX.shape)
print(feature_names_kbest)

In [None]:
# Selected feature subset of the Wisconsin Diagnostic Breast Cancer dataset using chi-square tests
np.round(feature_subset_df.iloc[20:25], 2)

Let’s now build a simple
classification model using logistic regression on the original feature set of 30 features and compare the
model accuracy performance with another model built using our selected 15 features. For model evaluation,
we will use the accuracy metric (percent of correct predictions) and use a five-fold cross-validation scheme. The main idea here is to compare the model
prediction performance between models trained on different feature sets.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
import warnings
warnings.filterwarnings('ignore')

# build logistic regression model with max_iter of 1000
lr = LogisticRegression(max_iter=1000)

# evaluating accuracy for model built on full featureset
full_feat_acc = np.average(cross_val_score(lr, bc_X, bc_y, scoring='accuracy', cv=5))
# evaluating accuracy for model built on selected featureset
sel_feat_acc = np.average(cross_val_score(lr, bc_SX, bc_y, scoring='accuracy', cv=5))

print('Model accuracy statistics with 5-fold cross validation')
print('Model accuracy with complete feature set', bc_X.shape, ':', full_feat_acc)
print('Model accuracy with selected feature set', bc_SX.shape, ':', sel_feat_acc)

The accuracy metrics clearly show us that we actually built a better model 
when trained on the selected 15 feature subset as compared to the model built with the original 30 features.

# Recursive Feature Elimination

Recursive Feature Elimination, also known as RFE, is a popular wrapper based feature selection technique,
which allows you to recursively keep eliminating lower scored features till you arrive at the specific feature subset count. The basic idea is to start off with a specific Machine Learning estimator
like the Logistic Regression algorithm we used for our classification needs. Next we take the entire feature set
of 30 features and the corresponding response class variables. RFE aims to assign weights to these features
based on the model fit. Features with the smallest weights are pruned out and then a model is fit again on the remaining features to obtain the new weights or scores. This process is recursively carried out multiple
times and each time features with the lowest scores/weights are eliminated, until the pruned feature subset
contains the desired number of features that the user wanted to select (this is taken as an input parameter at
the start). This strategy is also popularly known as backward elimination.

In [None]:
from sklearn.feature_selection import RFE

lr = LogisticRegression()
# select the top 15 features on our breast cancer dataset now using RFE.
rfe = RFE(estimator=lr, n_features_to_select=15, step=1)
rfe.fit(bc_X, bc_y)

In [None]:
# obtain the final selected features
select_features_rfe = rfe.get_support()
feature_names_rfe = bc_data.feature_names[select_features_rfe]
print(feature_names_rfe)

In [None]:
# compare this feature subset with the one we obtained using statistical tests 
# in the previous section and see which features are common among both these subsets
set(feature_names_kbest) & set(feature_names_rfe)

# Model based selection

Tree based models like decision trees and ensemble models like random forests (ensemble of trees) can
be utilized not just for modeling alone but for feature selection. These models can be used to compute
feature importances when building the model that can in turn be used for selecting the best features and
discarding irrelevant features with lower scores.

In [None]:
from sklearn.ensemble import RandomForestClassifier

# use the random forest model to score and rank features based on their importance.
rfc = RandomForestClassifier()
rfc.fit(bc_X, bc_y)

In [None]:
# Use random forest estimator to score the features based on their importance
# and we display the top 10 most important features based on this score
importance_scores = rfc.feature_importances_
feature_importances = [(feature, score) for feature, score in zip(bc_data.feature_names, importance_scores)]
sorted(feature_importances, key=lambda x: -x[1])[:10]

You can now use a threshold based parameter to filter out the top n features as needed or you can even
make use of the SelectFromModel meta-transformer provided by scikit-learn by using it as a wrapper on
top of this model.