# Module 1: Data Analysis and Data Preprocessing

## Section 4: Feature selection

### Part 3: Recursive Feature Elimination (RFE)

In this part, we will explore the concept of Recursive Feature Elimination (RFE), a powerful technique used to recursively select the most important features by repeatedly fitting the model and removing the least significant features. Recursive Feature Elimination is particularly useful when dealing with complex datasets with a large number of features.

### 2.1 Using Recursive Feature Elimination (RFE)

Recursive Feature Elimination is a feature selection technique that works by recursively fitting the model with the remaining features and ranking them based on their importance. At each iteration, the least significant features are removed until the desired number of features is reached.

The key idea behind Recursive Feature Elimination is to identify the most relevant features that contribute the most to the model's performance. By iteratively eliminating features, this technique can help to improve model efficiency, reduce overfitting, and enhance interpretability.

Advantages:
- RFE is a powerful feature selection technique that recursively eliminates less important features, leading to a more parsimonious and interpretable model.
- It works well with models that provide feature importances, making it flexible to use with various algorithms.

Disadvantages:
- The computational complexity of RFE can be high for large datasets or complex models, as it involves fitting the model multiple times.
- RFE's performance may depend on the choice of the ranking model, and it may not be suitable for models without feature importance measures.

The RFE process involves the following steps:

1. Fit a model (e.g., linear regression or SVM) to the entire feature set.
2. Rank the features based on their importance scores obtained from the model.
3. Remove the least important feature(s) from the feature set.
4. Repeat steps 1-3 until the desired number of features is reached.


The primary parameters of the RFE class in scikit-learn are as follows:

- estimator: This is the base estimator or model used to determine the feature importance. It should be a supervised learning estimator with a coef_ or feature_importances_ attribute. Common choices are linear models, decision trees, and random forests.
- n_features_to_select: The number of top features to select. By default, it is set to None, which means half of the features will be selected.
- step: The number of features to remove at each iteration. The default value is 1, which means one feature is removed at each iteration. You can set it to an integer to remove a fixed number of features at each step.
- verbose: Controls the verbosity of the output during the feature selection process. If True, it will print updates during the selection process. If False, no output will be displayed.

Here's how you can use it:

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the Iris dataset
data = load_breast_cancer()
X, y = data.data, data.target
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Create the model for feature ranking (e.g., Logistic Regression)
model = LogisticRegression(max_iter=5000)
# Create the RFE object with the model and number of features to select
rfe = RFE(model, n_features_to_select=5)
# Fit the RFE object to the training data
rfe.fit(X_train, y_train)
# Get the selected features
selected_features = rfe.support_

# Print the selected feature indices and their names
print("All features:")
print(data.feature_names)
print("\nSelected features")
for x in np.where(selected_features)[0]:
    print("\t",data.feature_names[x])

# Transform the training and test sets to include only the selected features
X_train_selected = rfe.transform(X_train)
X_test_selected = rfe.transform(X_test)

# Train a classifier using the selected features
clf = LogisticRegression()
clf.fit(X_train_selected, y_train)
# Make predictions on the test data
y_pred = clf.predict(X_test_selected)
# Calculate accuracy on the test set
accuracy = accuracy_score(y_test, y_pred)
print("\nLogisticRegression model accuracy:", accuracy)


In this example, we used the Iris dataset and applied Recursive Feature Elimination (RFE) with a Logistic Regression model to select the top 2 features. We then trained a new Logistic Regression model using only the selected features and evaluated its accuracy on the test set. RFE helps us identify the most relevant features for classification tasks, improving model performance and interpretability.

### 2.2 RFE with Cross-Validation

To avoid overfitting and obtain a more reliable feature ranking, Recursive Feature Elimination is often combined with cross-validation. In each iteration, the model is trained and evaluated using cross-validation, which provides a more robust estimate of feature importance.

Scikit-Learn provides the RFECV class for performing Recursive Feature Elimination with Cross-Validation. 

RFECV Parameters:
- estimator: The machine learning estimator used to evaluate feature importance and perform feature selection. This estimator must have a coef_ or feature_importances_ attribute after fitting. For example, LogisticRegression, RandomForestClassifier, GradientBoostingClassifier, etc.
- step: The number of features to remove at each iteration. The default value is 1, which means one feature is removed at each iteration.
- cv: The cross-validation strategy used to evaluate the performance of different feature subsets. It can be an integer (to specify the number of folds for k-fold cross-validation) or a cross-validation object from sklearn.model_selection (e.g., KFold, StratifiedKFold, etc.).
- scoring: The scoring metric used to evaluate the performance during cross-validation. The default value is None, which uses the estimator's default scoring method (accuracy for classifiers and R-squared for regressors). You can specify any available scoring metric from sklearn.metrics.
- min_features_to_select: The minimum number of features to select. The RFECV process will stop when the number of selected features is less than or equal to this value.
- verbose: Controls the verbosity of the output during the fitting process. Set to 0 for no output, and higher values for more detailed output.

Advantages of RFECV:
- Automatic Feature Selection: RFECV automatically performs feature selection by recursively removing less important features, saving you from manual feature selection.
- Cross-Validation: It performs cross-validation to evaluate feature subsets, making it more robust and less prone to overfitting compared to simple feature ranking methods.
- Optimal Feature Subset: RFECV selects the optimal number of features that maximize the performance metric based on cross-validation.

Disadvantages of RFECV:
- Computationally Expensive: RFECV can be computationally expensive, especially for large datasets or when using complex models with high-dimensional feature spaces.
- Estimator Choice: The choice of estimator can influence the results, so it's essential to choose an appropriate estimator for the specific problem.

Here's an example of how to use it:

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the breast cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Create the logistic regression model (or any other estimator)
model = LogisticRegression(max_iter=5000)
# Create the RFECV object with the logistic regression model and perform 5-fold cross-validation
rfecv = RFECV(estimator=model, cv=5)
# Fit the RFECV object to the training data
rfecv.fit(X_train, y_train)
# Get the selected features
selected_features = rfecv.support_

# Print the selected feature indices and their names
print("All features:")
print(data.feature_names)
# Print the optimal number of features selected by RFECV
print("\nOptimal number of features:", rfecv.n_features_)
print("\nSelected features:")
for x in np.where(selected_features)[0]:
    print("\t", data.feature_names[x])

# Transform the training and test sets to include only the selected features
X_train_selected = rfecv.transform(X_train)
X_test_selected = rfecv.transform(X_test)

# Train a logistic regression classifier on the reduced set of features
clf = LogisticRegression()
clf.fit(X_train_selected, y_train)
# Make predictions on the test data
y_pred = clf.predict(X_test_selected)
# Calculate accuracy on the test set
accuracy = accuracy_score(y_test, y_pred)
print("\nLogisticRegression model accuracy:", accuracy)

In this example, we use RFECV instead of RFE, and we specify the number of folds for cross-validation using the cv parameter (in this case, 5-fold cross-validation). RFECV automatically selects the optimal number of features based on cross-validation performance, and you can access the number of selected features using rfecv.n_features_.

### 2.3 Summary

Recursive Feature Elimination is a powerful technique for selecting the most important features by iteratively fitting the model and eliminating the least significant features. It can help improve model efficiency, reduce overfitting, and enhance interpretability. Is a powerful tool especially when combined with cross-validation (RFECV).  Using RFECV ensures that the feature selection process is more robust and takes into account the performance on multiple folds of the data, leading to potentially better generalization and performance on unseen data. However, it may require more computation time, and the choice of estimator can impact the final results, so it's essential to experiment with different estimators and parameter settings to find the best combination for your specific task.
