# Feature Selection

Author: Sahngyoon Rhee

Feature selection is, as the name suggests, a process of selecting a subset of all features so that only the more relevant features will be involved in makign a machine learning model. This can be useful for more reason than making the training more efficient. When we have a feature that we know is uncorrelated to the target variable, it makes sense to remove that feature.

For example, suppose that we are creating a machine learning model for a company that predicts the sale of various products for the next quarter based on the data of all previous transactions from its customers. The various features of the transaction could be customer age, name of the product sold, the number of units sold, date of transaction, sales amount, monthly-billing or lump payment, and weather of the date of transaction. While weather can certainly be a factor in a sale of a physical commodity (e.g. an umbrella or an instant noodle), it certainly won't be helpful - and indeed irrelevant - in predicting the sale of a software product from the company.

Feature selection can also be used to remove redundant information from the set of feature variables.

## Various methods for Feature Selection

There are various methods for feature selection. They can be supervised or unsupervised. 

The unsupervised feature selection methods do not use the target variable, and hence the reason why it's called unsupervised. A prime example of unsupervised feature selection is Principal Component Analysis, or PCA, which seeks to project a set of given points to a lower-dimensional hyperplane, effectively reducing the number of features.

Supoervised feature selection methods can be divided up into the following three:

- Filter methods: These methods evaluate the relevance of each feature independently of the model, by looking at measures such as correlation coefficients.
- Wrapper methods: These methods evaluate feature subsets based on model performance. Two examples are forward and backward feature selection.
- Embedded methods: These methods perform feature selection during the model training process. Random Forest is an example of this.

Next, we shall look at implementation of one method from each of the three class of supervised feature selection methods.

### Filter methods

We shall look at how we can use the correlation coefficient to filter out less relevant features. We will use the `breast_cancer` dataset from `sklearn`.

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer

# load the dataset
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

# create a pandas dataframe
df = pd.DataFrame(X, columns=cancer.feature_names)
df['target'] = y

# calculate the correlation matrix
corr_matrix = df.corr()

# Get the correlation of each feature with the target
target_corr = corr_matrix['target'].drop('target')

# Set a threshold for correlation (e.g., 0.2)
threshold = 0.2

# Select features with correlation above the threshold
selected_features = target_corr[abs(target_corr) > threshold].index.tolist()

# Print the selected features
print("Selected features:", selected_features)

# Create a new DataFrame with the selected features
X_selected = df[selected_features]

print("Original number of features:", X.shape[1])
print("Selected number of features:", X_selected.shape[1])

Selected features: ['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'radius error', 'perimeter error', 'area error', 'compactness error', 'concavity error', 'concave points error', 'worst radius', 'worst texture', 'worst perimeter', 'worst area', 'worst smoothness', 'worst compactness', 'worst concavity', 'worst concave points', 'worst symmetry', 'worst fractal dimension']
Original number of features: 30
Selected number of features: 25


We notice that this method of feature selection is independent of a choice of a model; indeed, we didn't even create a machine learning model in the above code.

### Wrapper Methods

In contrast to filter methods, which is independent of a choice of model, wrapper methods evaluate feature subsets based on model performance. We shall take a look at the forward feature selection, which starts out with an empty set of features and adds one feature at a time based on a specifiied criterion (such as the model performance).

Backward feature selection works a similar way, except that we start from a complete set of features in the beginning and *remove* a feature at a time based on a specific criterion.

Here, we take a look at an example using the `breast_cancer` dataset from `sklearn`.

In [2]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

# Load the Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

# Convert to DataFrame for easier manipulation
df = pd.DataFrame(X, columns=feature_names)

# Initialize the classifier
clf = RandomForestClassifier(n_estimators=50)

# Forward feature selection
selected_features = []
remaining_features = list(feature_names)
best_score = 0

while remaining_features:
    scores = []
    for feature in remaining_features:
        current_features = selected_features + [feature]
        X_subset = df[current_features]
        score = cross_val_score(clf, X_subset, y, cv=5).mean()
        scores.append((score, feature))
    
    # Find the best new feature
    best_new_score, best_new_feature = max(scores)
    
    if best_new_score > best_score:
        best_score = best_new_score
        selected_features.append(best_new_feature)
        remaining_features.remove(best_new_feature)
    else:
        break

print("Selected features:", selected_features)
print("Best cross-validation score:", best_score)

Selected features: ['worst concave points', 'worst radius', 'worst texture']
Best cross-validation score: 0.9613569321533924


### Embedded methods

Embedded methods perform feature selection while a chosen machine learning model is training. Random Forest is an example of an embedded method since it can provide feature importance scores as part of the model training.

Here is an example using the same dataset as above.

In [3]:
from sklearn.ensemble import RandomForestClassifier

# Load the Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

# Initialize the Random Forest classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)

# Fit the model
clf.fit(X, y)

# Get feature importances
feature_importances = clf.feature_importances_

# Create a DataFrame for feature importances
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature_importances
})

# Sort the DataFrame by importance
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# Print the sorted feature importances
print(feature_importance_df)

# Select the top N features (e.g., top 10)
top_n_features = feature_importance_df.head(10)['Feature'].tolist()

print("Top 10 features:", top_n_features)

                    Feature  Importance
23               worst area    0.139357
27     worst concave points    0.132225
7       mean concave points    0.107046
20             worst radius    0.082848
22          worst perimeter    0.080850
2            mean perimeter    0.067990
6            mean concavity    0.066917
3                 mean area    0.060462
26          worst concavity    0.037339
0               mean radius    0.034843
13               area error    0.029553
25        worst compactness    0.019864
21            worst texture    0.017485
1              mean texture    0.015225
10             radius error    0.014264
24         worst smoothness    0.012232
5          mean compactness    0.011597
12          perimeter error    0.010085
28           worst symmetry    0.008179
4           mean smoothness    0.007958
19  fractal dimension error    0.005942
16          concavity error    0.005820
15        compactness error    0.005612
14         smoothness error    0.004722
