<img src="./images/banner.png" width="800">

# Introduction to Feature Selection and Extraction

In machine learning, the quality and relevance of the features used to train models play a crucial role in determining the performance and generalization ability of the models. Feature selection and extraction are two important techniques that help in identifying and creating the most informative and discriminative features from raw data.


**Feature Selection** refers to the process of selecting a subset of relevant features from the original set of features. The goal is to reduce the dimensionality of the data by discarding irrelevant or redundant features, while retaining those that are most informative for the task at hand. Feature selection can be performed using various statistical, model-based, or domain-specific criteria.


**Feature Extraction**, on the other hand, involves transforming the original features into a lower-dimensional space using techniques such as dimensionality reduction or projection. The goal is to create new features that capture the most important information in the data while reducing its complexity. Feature extraction can be unsupervised or supervised, depending on whether the target variable is considered during the transformation process.


<img src="./images/feature-selection-extraction.png" width="800">

Both feature selection and extraction are essential steps in the machine learning pipeline, as they can lead to several benefits, including:

1. **Improved Model Performance**: By selecting the most relevant features and discarding irrelevant or redundant ones, we can improve the performance of machine learning models. This is because the models can focus on the most informative aspects of the data, leading to better generalization and accuracy.

2. **Reduced Overfitting**: When working with high-dimensional data, there is a risk of overfitting, where the model learns noise or irrelevant patterns in the training data. Feature selection helps in reducing the dimensionality of the data, mitigating the risk of overfitting and improving the model's ability to generalize to unseen data.

3. **Faster Training and Inference**: With fewer features, the training and inference times of machine learning models can be significantly reduced. This is especially important when dealing with large datasets or complex models, as it can lead to faster experimentation and deployment.

4. **Better Interpretability**: By selecting a subset of relevant features, we can gain insights into which factors are most important for the problem at hand. This enhances the interpretability of the model and helps in understanding the underlying patterns and relationships in the data.


The feature selection and extraction process typically involves the following steps:

1. **Data Preprocessing**: Before applying feature selection or extraction techniques, it is essential to preprocess the data. This may include handling missing values, scaling or normalizing the features, and encoding categorical variables.

2. **Feature Selection**:
   - **Filter Methods**: These methods assess the relevance of features based on statistical measures such as correlation, chi-square test, or mutual information. Features are ranked or selected based on their individual relevance to the target variable.
   - **Wrapper Methods**: These methods evaluate subsets of features by training and testing a model on each subset. The performance of the model is used as a criterion to select the best subset of features. Examples include recursive feature elimination and forward/backward feature selection.
   - **Embedded Methods**: These methods combine feature selection with the model training process. The model itself is used to assess the importance of features during training. Examples include Lasso and Ridge regression, which use regularization to penalize less important features.

3. **Feature Extraction**:
   - **Unsupervised Methods**: These methods aim to transform the original features into a lower-dimensional space while preserving the most important information. Examples include Principal Component Analysis (PCA) and t-SNE.
   - **Supervised Methods**: These methods consider the target variable when creating new features. Examples include Linear Discriminant Analysis (LDA) and supervised autoencoders.

4. **Model Training and Evaluation**: After selecting or extracting the most relevant features, the machine learning model is trained using the transformed dataset. The model's performance is evaluated using appropriate metrics and validation techniques to assess the effectiveness of the feature selection/extraction process.


Feature selection and extraction are iterative processes, and it may be necessary to experiment with different techniques and parameter settings to find the optimal set of features for a given problem.


In the following sections, we will dive deeper into specific feature selection and extraction techniques and explore practical examples using popular machine learning libraries.

**Table of contents**<a id='toc0_'></a>    
- [Feature Selection Techniques](#toc1_)    
  - [Filter Methods](#toc1_1_)    
    - [Pearson's Correlation](#toc1_1_1_)    
    - [Chi-Square Test](#toc1_1_2_)    
    - [Mutual Information](#toc1_1_3_)    
  - [Wrapper Methods](#toc1_2_)    
    - [Recursive Feature Elimination (RFE)](#toc1_2_1_)    
    - [Forward Feature Selection](#toc1_2_2_)    
    - [Backward Feature Elimination](#toc1_2_3_)    
  - [Embedded Methods](#toc1_3_)    
    - [Lasso Regression (L1 Regularization)](#toc1_3_1_)    
    - [Ridge Regression (L2 Regularization)](#toc1_3_2_)    
    - [Decision Tree-based Feature Importance](#toc1_3_3_)    
- [Feature Extraction Techniques](#toc2_)    
  - [Principal Component Analysis (PCA)](#toc2_1_)    
  - [Linear Discriminant Analysis (LDA)](#toc2_2_)    
  - [t-SNE (t-Distributed Stochastic Neighbor Embedding)](#toc2_3_)    
  - [Autoencoders](#toc2_4_)    
- [Summary and Takeaways](#toc3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_'></a>[Feature Selection Techniques](#toc0_)

Feature selection techniques help in identifying the most relevant features from a dataset, discarding irrelevant or redundant ones. There are three main categories of feature selection techniques: filter methods, wrapper methods, and embedded methods.


<img src="./images/feature-selection.png" width="800">

### <a id='toc1_1_'></a>[Filter Methods](#toc0_)


Filter methods assess the relevance of features based on statistical measures, independently of any machine learning algorithm. These methods rank or select features based on their individual relevance to the target variable.


#### <a id='toc1_1_1_'></a>[Pearson's Correlation](#toc0_)


Pearson's correlation coefficient measures the linear relationship between two variables. It ranges from -1 to +1, where -1 indicates a strong negative correlation, +1 indicates a strong positive correlation, and 0 indicates no correlation. Features with high absolute correlation values with the target variable are considered more relevant.


Example using scikit-learn:


In [13]:
from sklearn.feature_selection import SelectKBest, f_regression
import seaborn as sns

df = sns.load_dataset('titanic')
df.dropna(inplace=True)
X = df[['pclass', 'age', 'sibsp', 'parch', 'fare']]
y = df['survived']

selector = SelectKBest(score_func=f_regression, k=3)
X_selected = selector.fit_transform(X, y)

In [17]:
print(selector.scores_)

[ 0.25616762 12.10732158  1.86906751  0.06322858  3.12497225]


#### <a id='toc1_1_2_'></a>[Chi-Square Test](#toc0_)


The chi-square test is used for categorical features. It measures the dependence between a categorical feature and the target variable. Features with high chi-square values are considered more relevant.


Example using scikit-learn:


In [22]:
from sklearn.feature_selection import SelectKBest, chi2

selector = SelectKBest(score_func=chi2, k=3)
X_selected = selector.fit_transform(X, y)

In [23]:
print(selector.scores_)

[5.75328826e-02 7.86461550e+01 1.65701432e+00 7.59647047e-02
 2.28986385e+02]


#### <a id='toc1_1_3_'></a>[Mutual Information](#toc0_)


Mutual information measures the dependence between two variables, capturing both linear and non-linear relationships. Features with high mutual information scores with the target variable are considered more relevant.


Example using scikit-learn:


In [24]:
from sklearn.feature_selection import SelectKBest, mutual_info_classif

selector = SelectKBest(score_func=mutual_info_classif, k=5)
X_selected = selector.fit_transform(X, y)

In [25]:
print(selector.scores_)

[0.03945778 0.0519945  0.03532459 0.01871036 0.05609225]


### <a id='toc1_2_'></a>[Wrapper Methods](#toc0_)


Wrapper methods evaluate subsets of features by training and testing a machine learning model on each subset. The performance of the model is used as a criterion to select the best subset of features.


#### <a id='toc1_2_1_'></a>[Recursive Feature Elimination (RFE)](#toc0_)


RFE recursively removes the least important features based on a specified machine learning algorithm. It starts with all features and iteratively eliminates the least important ones until the desired number of features is reached.


Example using scikit-learn:


In [31]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

selector = RFE(estimator=LogisticRegression(), n_features_to_select=3)
X_selected = selector.fit_transform(X, y)

In [33]:
selector.ranking_

array([1, 2, 1, 1, 3])

#### <a id='toc1_2_2_'></a>[Forward Feature Selection](#toc0_)


Forward feature selection starts with an empty set of features and iteratively adds the most relevant features one at a time. It stops when adding more features does not improve the model's performance.


#### <a id='toc1_2_3_'></a>[Backward Feature Elimination](#toc0_)


Backward feature elimination starts with all features and iteratively removes the least relevant features one at a time. It stops when removing more features degrades the model's performance.


### <a id='toc1_3_'></a>[Embedded Methods](#toc0_)


Embedded methods combine feature selection with the model training process. The model itself is used to assess the importance of features during training.


#### <a id='toc1_3_1_'></a>[Lasso Regression (L1 Regularization)](#toc0_)


Lasso regression adds an L1 regularization term to the linear regression objective function. The L1 penalty encourages sparsity, driving the coefficients of irrelevant features to zero. Features with non-zero coefficients are considered important.


Example using scikit-learn:


In [34]:
from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.1)
lasso.fit(X, y)
selected_features = X.columns[lasso.coef_ != 0]

In [35]:
selected_features

Index(['age', 'fare'], dtype='object')

#### <a id='toc1_3_2_'></a>[Ridge Regression (L2 Regularization)](#toc0_)


Ridge regression adds an L2 regularization term to the linear regression objective function. The L2 penalty shrinks the coefficients of less important features towards zero, but does not eliminate them completely. Features with larger absolute coefficients are considered more important.


Example using scikit-learn:


In [36]:
from sklearn.linear_model import Ridge
import numpy as np

ridge = Ridge(alpha=0.1)
ridge.fit(X, y)
feature_importances = np.abs(ridge.coef_)

In [37]:
feature_importances

array([0.07436907, 0.00864391, 0.0327418 , 0.06740302, 0.00066198])

#### <a id='toc1_3_3_'></a>[Decision Tree-based Feature Importance](#toc0_)


Decision tree-based algorithms, such as Random Forests and Gradient Boosting, can provide feature importance scores based on how much each feature contributes to the reduction of impurity or error in the model. Features with higher importance scores are considered more relevant.


Example using scikit-learn:


In [39]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf.fit(X, y)
rf.feature_importances_

array([0.02328554, 0.4391113 , 0.0492425 , 0.06215571, 0.42620495])

These are just a few examples of feature selection techniques. The choice of technique depends on the nature of the problem, the type of data, and the machine learning algorithm being used. It is often beneficial to experiment with multiple techniques and compare their results to select the most effective set of features for a given task.

## <a id='toc2_'></a>[Feature Extraction Techniques](#toc0_)

Feature extraction techniques aim to transform the original features into a new, lower-dimensional space while preserving the most important information. These techniques can be useful when dealing with high-dimensional data or when the original features are not directly suitable for machine learning algorithms. Here, we will discuss four popular feature extraction techniques: PCA, LDA, t-SNE, and Autoencoders.


### <a id='toc2_1_'></a>[Principal Component Analysis (PCA)](#toc0_)


PCA is an unsupervised linear transformation technique that seeks to find a new set of orthogonal features (principal components) that capture the maximum variance in the data. The principal components are ordered by the amount of variance they explain, allowing for dimensionality reduction by selecting the top k components.


Example using scikit-learn:


In [40]:
from sklearn.decomposition import PCA

pca = PCA(n_components=3)
X_transformed = pca.fit_transform(X)

In [42]:
X_transformed.shape

(182, 3)

PCA is useful for visualizing high-dimensional data in a lower-dimensional space, identifying patterns, and reducing the dimensionality of the data while preserving the most important information.


### <a id='toc2_2_'></a>[Linear Discriminant Analysis (LDA)](#toc0_)


LDA is a supervised linear transformation technique that finds a new set of features that maximizes the separation between different classes. Unlike PCA, which focuses on capturing the maximum variance, LDA tries to find the directions that maximize the class separability.


Example using scikit-learn:


In [46]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda = LinearDiscriminantAnalysis(n_components=1)
X_transformed = lda.fit_transform(X, y)

LDA is particularly useful when dealing with classification problems where the goal is to find features that best discriminate between different classes.


### <a id='toc2_3_'></a>[t-SNE (t-Distributed Stochastic Neighbor Embedding)](#toc0_)


t-SNE is a non-linear dimensionality reduction technique that aims to preserve the local structure of the data in the low-dimensional space. It maps the high-dimensional data points to a lower-dimensional space such that similar points are close together and dissimilar points are far apart.


Example using scikit-learn:


In [48]:
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, perplexity=30)
X_transformed = tsne.fit_transform(X)

In [49]:
X_transformed.shape

(182, 2)

t-SNE is widely used for visualizing high-dimensional data in a 2D or 3D space, as it can reveal interesting patterns and clusters in the data. However, it is computationally expensive and may not always preserve the global structure of the data.


### <a id='toc2_4_'></a>[Autoencoders](#toc0_)


Autoencoders are neural network-based models that learn to compress and reconstruct the input data. They consist of an encoder network that maps the input to a lower-dimensional representation (latent space) and a decoder network that reconstructs the original input from the latent representation.


In [50]:
%pip install tensorflow

Note: you may need to restart the kernel to use updated packages.


In [51]:
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model

input_dim = X.shape[1]
encoding_dim = 32

input_layer = Input(shape=(input_dim,))
encoded = Dense(encoding_dim, activation='relu')(input_layer)
decoded = Dense(input_dim, activation='sigmoid')(encoded)

autoencoder = Model(input_layer, decoded)
encoder = Model(input_layer, encoded)

autoencoder.compile(optimizer='adam', loss='binary_crossentropy')
autoencoder.fit(X, X, epochs=50, batch_size=32)

X_transformed = encoder.predict(X)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [52]:
X_transformed.shape

(182, 32)

Autoencoders can learn non-linear transformations of the data and capture complex patterns. They are useful for dimensionality reduction, feature learning, and anomaly detection.


These feature extraction techniques can be applied depending on the nature of the data and the specific requirements of the problem. It is common to experiment with different techniques and evaluate their effectiveness in the context of the machine learning task at hand.

## <a id='toc3_'></a>[Summary and Takeaways](#toc0_)

In this lecture, we explored the concepts of feature selection and extraction, which are crucial steps in the machine learning pipeline. Let's recap the key points and emphasize the importance of these techniques in ML projects.


Feature selection and extraction play a vital role in the success of machine learning projects:

1. They help in improving model performance by focusing on the most informative and relevant features, reducing noise and irrelevant information.

2. They reduce the risk of overfitting by eliminating redundant or irrelevant features, enhancing the model's ability to generalize to unseen data.

3. They can speed up the training and inference processes by reducing the dimensionality of the data, making the models more computationally efficient.

4. They enhance the interpretability of the models by identifying the most important features, providing insights into the underlying patterns and relationships in the data.

5. They enable the effective handling of high-dimensional data, which is common in many real-world applications.


By incorporating feature selection and extraction techniques into your machine learning workflow, you can improve the quality of your models, reduce computational complexity, and gain valuable insights from your data.


To dive deeper into feature selection and extraction techniques, here are some recommended resources:

1. Scikit-learn documentation on feature selection: [https://scikit-learn.org/stable/modules/feature_selection.html](https://scikit-learn.org/stable/modules/feature_selection.html)

2. "Feature Engineering and Selection: A Practical Approach for Predictive Models" by Max Kuhn and Kjell Johnson: [https://bookdown.org/max/FES/](https://bookdown.org/max/FES/)

3. "An Introduction to Statistical Learning" by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani: [https://www.statlearning.com/](https://www.statlearning.com/)

4. "Pattern Recognition and Machine Learning" by Christopher M. Bishop: [https://www.springer.com/gp/book/9780387310732](https://www.springer.com/gp/book/9780387310732)


These resources provide in-depth explanations, practical examples, and theoretical foundations for feature selection and extraction techniques.


By understanding and applying these techniques effectively, you can enhance the quality and performance of your machine learning models and tackle real-world problems with greater success.