# <center> Feature Selection </center>

Feature selection is one of the important preprocessing steps in any machine learning task. A feature in case of a dataset simply means a column. When we get any dataset, not necessarily every column (feature) is going to have an impact on the output variable. If we add these irrelevant features in the model, it will just make the model performance worse (i.e Garbage In Garbage Out). This gives rise to the need of doing feature selection.


# Forward Feature Selection:

* Forward feature selection starts with an empty set of features and iteratively adds the most relevant feature at each step.
* The algorithm evaluates each feature's performance individually using a chosen evaluation metric (e.g., accuracy, F1-score).
* It selects the feature that provides the best improvement in the evaluation metric and adds it to the selected features set.
* This process continues until a stopping criterion is met (e.g., a predefined number of features are selected or the performance stops improving).

Forward feature selection is typically used when you have a large pool of potential features, and you want to identify the most relevant subset for your model.


# Backward Feature Selection:

* Backward feature selection starts with all available features and iteratively removes the least relevant feature at each step.
* The algorithm evaluates the model's performance after removing each feature and selects the feature whose removal causes the least decrease in the evaluation metric.
* This process continues until a stopping criterion is met (e.g., a predefined number of features remain or the performance starts degrading significantly)

Backward feature selection is often used when you have a large number of features and you want to simplify the model or reduce the risk of overfitting.

In [27]:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Generate a toy dataset
X, y = make_classification(n_samples=1000, n_features=5, random_state=3)
print (X, y)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize an empty set to store selected features
selected_features = []

[[-0.41999526  0.38087892  1.05803637  0.05871946 -1.06514246]
 [-0.70561162  1.01345596  0.66848172  1.23482107 -0.78757691]
 [-1.2300397   0.56989039 -0.84436956  1.30745517  0.72045114]
 ...
 [-0.12468906 -1.41093398  0.99229634 -2.68524513 -0.73675032]
 [ 0.5895268  -0.3869628   1.24062782 -1.22303666 -1.12534416]
 [-1.0680554  -1.48418409  0.69252136 -2.64815779 -0.44010265]] [0 0 1 1 1 1 1 1 1 1 0 1 0 0 1 0 1 0 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0
 1 0 1 0 0 0 1 1 1 0 0 0 0 1 1 0 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 1 0 1 1 0 0
 1 1 1 0 1 1 0 0 1 0 1 1 0 1 1 0 1 1 1 1 1 1 0 1 0 0 0 1 1 1 0 1 1 1 0 1 1
 0 1 1 0 1 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 1 1 1 1 0 0 1 0 0 0 1 1 1 0 0 1 1
 0 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 0 0 1 0 1 1 1 0 1 1 1 0 0 0 1 1 1 0 0 0 0
 0 0 1 1 0 0 0 1 1 1 1 0 1 1 0 0 1 1 0 0 0 0 0 1 0 1 0 1 1 0 1 0 0 1 0 1 0
 0 0 1 0 0 1 0 0 1 0 0 1 1 1 0 1 0 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 0 0
 1 1 1 0 1 0 0 1 1 0 0 0 1 0 1 1 0 0 0 1 0 1 0 1 1 0 1 1 1 1 1 0 1 0 1 1 0
 0 1 0 1 0 0 0 0

In [30]:


# Perform Forward Feature Selection
while len(selected_features) < X_train.shape[1]:
    best_feature = None
    best_accuracy = 0

    for feature_idx in range(X_train.shape[1]):
        if feature_idx not in selected_features:
            # Combine the currently selected features with the new candidate
            features_to_use = selected_features + [feature_idx]

            # Train a model using only the selected features
            model = LogisticRegression(random_state=42)
            model.fit(X_train[:, features_to_use], y_train)

            # Make predictions on the test set
            y_pred = model.predict(X_test[:, features_to_use])

            # Evaluate the model's accuracy
            accuracy = accuracy_score(y_test, y_pred)
            print(accuracy)

            # Update the best feature if the accuracy is higher
            if accuracy > best_accuracy:
                best_accuracy = accuracy
                best_feature = feature_idx

    # Add the best feature to the selected features
    selected_features.append(best_feature)

print("Forward Selected Features:", selected_features)

Forward Selected Features: [4, 0, 1, 2, 3]


In [31]:


# Perform Backward Feature Selection
selected_features = list(range(X_train.shape[1]))  # Start with all features

while len(selected_features) > 1:
    worst_feature = None
    worst_accuracy = 1.0

    for feature_idx in selected_features:
        # Create a list of features excluding the current one
        features_to_use = [f for f in selected_features if f != feature_idx]

        # Train a model using the selected features
        model = LogisticRegression(random_state=42)
        model.fit(X_train[:, features_to_use], y_train)

        # Make predictions on the test set
        y_pred = model.predict(X_test[:, features_to_use])

        # Evaluate the model's accuracy
        accuracy = accuracy_score(y_test, y_pred)

        # Update the worst feature if the accuracy is lower
        if accuracy < worst_accuracy:
            worst_accuracy = accuracy
            worst_feature = feature_idx

    # Remove the worst feature from the selected features
    selected_features.remove(worst_feature)

print("Backward Selected Features:", selected_features)


Backward Selected Features: [3]
