# Feature Selection Exercise

Adapted from Dipanjan Sarkar et al. 2018. [Practical Machine Learning with Python](https://link.springer.com/book/10.1007/978-1-4842-3207-1).

## Overview

This module covers feature selection strategies in machine learning, focusing on the main approaches used to identify and select the most relevant features for model building.

## Learning Objectives

- Understand the three main categories of feature selection methods
  - Filter methods
  - Wrapper methods
  - Embedded methods
- Learn how different feature selection strategies evaluate and select features
- Recognize when to apply different feature selection approaches

### Tasks to be completed

- Review different feature selection methodologies
- Compare filter, wrapper and embedded approaches
- Understand the tradeoffs between selection methods
- Practice implementing feature selection techniques

## Prerequisites

- Python programming environment
- Basic understanding of statistical and machine learning concepts
- Familiarity with common ML libraries

## Get Started

### Set up conda environment

### Set up conda environment

Ensure that you have created then conda environment using the `conda_env.yml` file included in this repository. E.g.,

```
# Create conda environment
conda env create -f conda_env.yml

# Register the kernel
python -m ipykernel install --user \
    --name=nigms_sandbox_ud \
    --display-name "Python (NIGMS Sandbox UD)"
```

Then, when starting the notebook, select the `"Python (NIGMS Sandbox UD)"` kernel from the list.

Note that you may need to restart Jupyter Lab for these changes to take effect.

### Import necessary libraries


In [None]:
# Import necessary dependencies and settings
import warnings  # Import the warnings module to manage and control warning messages during script execution.

import numpy as np  # Import the NumPy library and alias it as 'np' for numerical operations, especially for handling arrays and matrices.
import pandas as pd  # Import the pandas library and alias it as 'pd' for data manipulation and analysis, particularly using DataFrames.
from sklearn.datasets import load_breast_cancer  # Import the load_breast_cancer function from scikit-learn to load the breast cancer dataset for demonstration or testing.
from sklearn.ensemble import RandomForestClassifier  # Import the RandomForestClassifier class from scikit-learn for using Random Forest, an ensemble learning method for classification.
from sklearn.feature_selection import RFE, SelectKBest, VarianceThreshold, chi2  # Import feature selection techniques from scikit-learn:
#   - RFE (Recursive Feature Elimination) for feature ranking and selection by recursively fitting a model and removing the weakest features.
#   - SelectKBest for feature selection based on univariate statistical tests, selecting the top k features.
#   - VarianceThreshold for feature selection by removing features with low variance.
#   - chi2 for chi-squared statistic, often used for feature selection in classification problems with non-negative features.
from sklearn.linear_model import LogisticRegression  # Import the LogisticRegression class from scikit-learn for using logistic regression, a linear model for binary classification.
from sklearn.model_selection import cross_val_score  # Import the cross_val_score function from scikit-learn for evaluating model performance using cross-validation.

### Import _E. coli_ Dataset


In [None]:
# Define the file path for the Ecoli dataset
ecoli_data = "../../Data/ecoli.csv"

# Load the dataset into a pandas DataFrame
df = pd.read_csv(ecoli_data)

Ecoli dataset is for predicting Protein Localization Sites in Ecoli.

```
Number of Instances:  336
Number of Attributes: 8 ( 7 predictive, 1 name )
Attribute Information.
  1. Sequence Name: Accession number for the SWISS-PROT database
  2. mcg: McGeoch's method for signal sequence recognition.
  3. gvh: von Heijne's method for signal sequence recognition.
  4. lip: von Heijne's Signal Peptidase II consensus sequence score (Binary attribute).
  5. chg: Presence of charge on N-terminus of predicted lipoproteins (Binary attribute).
  6. aac: score of discriminant analysis of the amino acid content of outer membrane and periplasmic proteins.
  7. alm1: score of the ALOM membrane spanning region prediction program.
  8. alm2: score of ALOM program after excluding putative cleavable signal regions from the sequence.
Missing Attribute Values: None.
Class Distribution. The class is the localization site.
  cp  (cytoplasm)                                    143
  im  (inner membrane without signal sequence)        77
  pp  (perisplasm)                                    52
  imU (inner membrane, uncleavable signal sequence)   35
  om  (outer membrane)                                20
  omL (outer membrane lipoprotein)                     5
  imL (inner membrane lipoprotein)                     2
  imS (inner membrane, cleavable signal sequence)      2
```

You can learn more about the dataset here:

- Ecoli Dataset ([ecoli.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/ecoli.data))
- Ecoli Dataset Description ([ecoli.names](https://raw.githubusercontent.com/jbrownlee/Datasets/master/ecoli.names))


## Introduction

Feature selection strategies can be divided into three main areas based on the type of strategy and
techniques employed:

- **Filter methods**: select features purely based on metrics like
  correlation, mutual information and so on. Popular methods include threshold based
  methods and statistical tests.
- **Wrapper methods**: capture interaction between multiple
  features by using a recursive approach to build multiple models using feature
  subsets and select the best subset of features giving us the best performing model.
  Methods like backward selecting and forward elimination are popular wrapper
  based methods.
- **Embedded methods**: combine the benefits of the other
  two methods by leveraging Machine Learning models themselves to rank and score
  feature variables based on their importance. Tree based methods like decision trees
  and ensemble methods like random forests are popular examples of embedded
  methods.


In [None]:
# Configure NumPy to suppress scientific notation when printing floating point numbers.
# This ensures that numbers close to zero are displayed as 0 instead of in scientific notation.
np.set_printoptions(suppress=True)

# Retrieve the current print options and store the value of the "threshold" setting.
# The threshold determines the number of array elements printed before summarization with "...".
pt = np.get_printoptions()["threshold"]

## Threshold based methods

This is a filter based feature selection strategy, where you can use some form of cut-off or thresholding for
limiting the total number of features during feature selection.


### Variance based thresholding

Another way of using thresholds is to use variance based thresholding where features having low
variance (below a user-specified threshold) are removed.


In [None]:
# Get the shape (number of rows and columns) of the DataFrame
df.shape  # Returns a tuple (num_rows, num_columns)

# Convert the categorical variable "site" into dummy/indicator variables,
# creating a separate binary column for each unique category in "site".
ecoli_site = pd.get_dummies(df["site"])

# Display the first few rows of the newly created dummy variables
ecoli_site.head()

In [None]:
# Create a VarianceThreshold object to filter out low-variance features  
# Features with a variance lower than 0.15 will be removed  

vt = # Your code goes here

# Fit the VarianceThreshold object to the dataset (ecoli_site)  
# This calculates the variance of each feature  
vt.fit(ecoli_site)  


In [None]:
# Create a DataFrame to display the variance of each feature and whether it was selected.  
pd.DataFrame(
    {   
        "variance": vt.variances_,  # Store the variance values of the features.
        "select_feature": vt.get_support()  # Boolean mask indicating selected features.
    },  
    index=ecoli_site.columns,  # Use feature names as index for better readability.
).T  # Transpose the DataFrame to display features as columns.

In [None]:
# Get the final subset of selected features  
# `vt.get_support()` returns a boolean mask indicating which features were selected  
# `iloc[:, vt.get_support()]` filters the dataset to keep only the selected features  
# `head()` retrieves the first few rows for preview  
ecoli_site_subset = ecoli_site.iloc[:, vt.get_support()].head()  
ecoli_site_subset  

## Statistical Methods


This dataset is known as the Wisconsin
Diagnostic Breast Cancer dataset, which is also available in its native or raw format at https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic), which is the UCI Machine Learning
repository.


In [None]:
# Load the breast cancer dataset from sklearn
bc_data = load_breast_cancer()

# Convert feature data into a Pandas DataFrame with appropriate column names
bc_features = pd.DataFrame(bc_data.data, columns=bc_data.feature_names)

# Convert target labels into a DataFrame with a descriptive column name
bc_classes = pd.DataFrame(bc_data.target, columns=["IsMalignant"])

# Build the feature set (X) and response class labels (y)
bc_X = np.array(bc_features)  # Convert features DataFrame to a NumPy array
bc_y = np.array(bc_classes).T[0]  # Convert target labels to a NumPy array and flatten it

# Print the shape of the feature set and response class labels for verification
print("Feature set shape:", bc_X.shape)
print("Response class shape:", bc_y.shape)

In [None]:
# Set the print threshold to limit the number of array elements displayed  
np.set_printoptions(threshold=30)

# Print the shape of the feature set  
print("Feature set data [shape: " + str(bc_X.shape) + "]")

# Print the feature set with values rounded to 2 decimal places  
print(np.round(bc_X, 2), "\n")

# Print the feature names  
print("Feature names:")
print(np.array(bc_features.columns), "\n")

# Print the shape of the predictor class label data  
print("Predictor Class label data [shape: " + str(bc_y.shape) + "]")

# Print the predictor class label data  
print(bc_y, "\n")

# Print the predictor class name  
print("Predictor name:", np.array(bc_classes.columns))

# Reset the print threshold to its previous value (stored in variable 'pt')  
np.set_printoptions(threshold=pt)


The response class variable is a binary
class where 1 indicates the tumor detected was benign and 0 indicates it was malignant. We can also see
the 30 features that are real valued numbers that describe characteristics of cell nuclei present in digitized
images of breast mass.


In [None]:
# Use the chi-square test to evaluate the importance of features  
# and select the top 15 best features out of the 30 available features.

skb = # Your code goes here

# Fit the SelectKBest model to the feature matrix (bc_X) and target variable (bc_y)  
# to determine the most relevant features based on the chi-square test.
skb.fit(bc_X, bc_y)

In [None]:
# Create a list of tuples where each tuple contains a feature name and its corresponding score
feature_scores = [
    (item, score) for item, score in zip(bc_data.feature_names, skb.scores_)
]

# Sort the list of feature scores in descending order based on the score value (x[1])
# and return the top 10 most relevant features
sorted(feature_scores, key=lambda x: -x[1])[:10]

In [None]:
# Create a subset of the selected features obtained from our original feature set 
# using the chi-square test (via SelectKBest)
select_features_kbest = skb.get_support()  # Get the mask of selected features (True for selected)
feature_names_kbest = bc_data.feature_names[select_features_kbest]  # Get feature names corresponding to selected features

# Create a DataFrame with the selected features using the column names from the original feature set
feature_subset_df = bc_features[feature_names_kbest]

# Convert the DataFrame of selected features into a numpy array for further processing
bc_SX = np.array(feature_subset_df)

# Print the shape of the resulting feature subset array (rows, columns)
print(bc_SX.shape)

# Print the names of the selected features
print(feature_names_kbest)

In [None]:
# Select a feature subset of the Wisconsin Diagnostic Breast Cancer dataset using chi-square tests
# The 'feature_subset_df' contains the result of feature selection based on chi-square tests
# We are selecting a specific subset of rows (from row 20 to 25) and rounding the values to 2 decimal places
np.round(feature_subset_df.iloc[20:25], 2)

Let’s now build a simple
classification model using logistic regression on the original feature set of 30 features and compare the
model accuracy performance with another model built using our selected 15 features. For model evaluation,
we will use the accuracy metric (percent of correct predictions) and use a five-fold cross-validation scheme. The main idea here is to compare the model
prediction performance between models trained on different feature sets.


In [None]:
# Suppress warnings to keep the output clean (e.g., avoid displaying convergence warnings)
warnings.filterwarnings("ignore")

# Build a Logistic Regression model with a maximum of 1000 iterations for convergence

lr = # Your code goes here

# Evaluate accuracy of the model built on the complete feature set using 5-fold cross-validation
# 'cross_val_score' performs cross-validation and calculates accuracy for each fold
full_feat_acc = np.average(cross_val_score(lr, bc_X, bc_y, scoring="accuracy", cv=5))

# Evaluate accuracy of the model built on the selected feature set using 5-fold cross-validation
sel_feat_acc = np.average(cross_val_score(lr, bc_SX, bc_y, scoring="accuracy", cv=5))

# Print out the results
print("Model accuracy statistics with 5-fold cross validation")
# Output the accuracy of the model using the full feature set, along with the shape of the input features
print("Model accuracy with complete feature set", bc_X.shape, ":", full_feat_acc)
# Output the accuracy of the model using the selected feature set, along with the shape of the input features
print("Model accuracy with selected feature set", bc_SX.shape, ":", sel_feat_acc)

The accuracy metrics clearly show us that we actually built a better model
when trained on the selected 15 feature subset as compared to the model built with the original 30 features.


## Recursive Feature Elimination

Recursive Feature Elimination, also known as RFE, is a popular wrapper based feature selection technique,
which allows you to recursively keep eliminating lower scored features till you arrive at the specific feature subset count. The basic idea is to start off with a specific Machine Learning estimator
like the Logistic Regression algorithm we used for our classification needs. Next we take the entire feature set
of 30 features and the corresponding response class variables. RFE aims to assign weights to these features
based on the model fit. Features with the smallest weights are pruned out and then a model is fit again on the remaining features to obtain the new weights or scores. This process is recursively carried out multiple
times and each time features with the lowest scores/weights are eliminated, until the pruned feature subset
contains the desired number of features that the user wanted to select (this is taken as an input parameter at
the start). This strategy is also popularly known as backward elimination.


In [None]:
# Initialize the Logistic Regression model to be used for feature selection
lr = LogisticRegression()

# Initialize RFE (Recursive Feature Elimination) with:
# - the Logistic Regression model (lr) as the estimator
# - n_features_to_select=15, meaning we want to select the top 15 features
# - step=1, which means we remove one feature at a time in each iteration

rfe = # Your code goes here

# Fit the RFE model on the breast cancer dataset (X: features, y: target)
rfe.fit(bc_X, bc_y)

In [None]:
# Obtain the final selected features from RFE (Recursive Feature Elimination)
select_features_rfe = rfe.get_support()  # Get a boolean array of selected features (True for selected, False for non-selected)

# Extract the names of the features that were selected by RFE
feature_names_rfe = bc_data.feature_names[select_features_rfe]  # Filter the feature names based on the selected ones

# Print the names of the selected features
print(feature_names_rfe)

# Fit the RFE model to the data (this step might be part of the RFE process, not needed if already fitted)
rfe.fit(bc_X, bc_y)  # Fit the RFE model to the input data (bc_X) and target labels (bc_y)

In [None]:
# Compare the feature subset selected using SelectKBest (feature_names_kbest) 
# with the feature subset selected using Recursive Feature Elimination (RFE) 
# (feature_names_rfe). The intersection will show the common features between 
# these two subsets.
set(feature_names_kbest) & set(feature_names_rfe)

## Model based selection

Tree based models like decision trees and ensemble models like random forests (ensemble of trees) can
be utilized not just for modeling alone but for feature selection. These models can be used to compute
feature importances when building the model that can in turn be used for selecting the best features and
discarding irrelevant features with lower scores.


In [None]:
# Importing the RandomForestClassifier model from sklearn
# Initialize a RandomForestClassifier object

rfc = # Your code goes here  

# Fitting the model on the training data (bc_X for features, bc_y for labels)
rfc.fit(bc_X, bc_y)  # Train the random forest model on the given feature set (bc_X) and target labels (bc_y)

In [None]:
# Use random forest estimator to score the features based on their importance
# The `rfc.feature_importances_` attribute provides the importance scores for each feature
importance_scores = rfc.feature_importances_

# Create a list of tuples where each tuple contains a feature name and its corresponding importance score
feature_importances = [
    (feature, score) for feature, score in zip(bc_data.feature_names, importance_scores)
]

# Sort the list of feature importances by the score in descending order and select the top 10 features
# The `lambda x: -x[1]` ensures the list is sorted in descending order based on the score
sorted(feature_importances, key=lambda x: -x[1])[:10]

You can now use a threshold based parameter to filter out the top n features as needed or you can even
make use of the SelectFromModel meta-transformer provided by scikit-learn by using it as a wrapper on
top of this model.


## Conclusion

Feature selection is a critical step in machine learning that helps identify the most relevant features for model building. We learned about three main approaches:

- Filter methods that use metrics and statistical tests
- Wrapper methods that evaluate feature subsets recursively
- Embedded methods that combine benefits of other approaches using ML models

## Clean up

Remember to shut down your Jupyter Notebook environment and delete any unnecessary files or resources once you've completed the tutorial.
