# Basic example for the usage of reverse feature selection

This is a basic example of how to use the reverse feature selection algorithm. The example generates a synthetic dataset
with 100 irrelevant features and two relevant features. The relevant features have an increased effect size. The
algorithm selects the relevant features and prints the names of the selected features.

### Generate synthetic dataset

In [1]:
# Import required libraries to generate an example dataset
import numpy as np
import pandas as pd

# Number of relevant features to insert
n_relevant_features = 2

# Number of irrelevant features
n_irrelevant_features = 200

# Number of total samples (must be an even number for binary classification)
n_samples = 30

# Set up a random number generator
rng = np.random.default_rng()

# Create DataFrame with irrelevant features
data_df = pd.DataFrame({f"random_feature{i+1}": rng.random(n_samples) for i in range(n_irrelevant_features)})

# Insert relevant features with increased effect size
for i in range(n_relevant_features):
    regulated_class = rng.random(n_samples // 2) + (i + 1) * 2
    unregulated_class = rng.random(n_samples // 2) + (i + 1)
    # Concatenate the two classes to form a single relevant feature
    data_df.insert(i, f"relevant_feature{i+1}", np.concatenate((regulated_class, unregulated_class)))

### Insert labels

In [2]:
# Construct binary class labels (15 samples of class 0 and 15 of class 1)
label = np.concatenate((np.zeros(n_samples // 2), np.ones(n_samples // 2)))

# Insert label column at the beginning of the DataFrame
data_df.insert(0, "label", label)

data_df.head()

Unnamed: 0,label,relevant_feature1,relevant_feature2,random_feature1,random_feature2,random_feature3,random_feature4,random_feature5,random_feature6,random_feature7,...,random_feature191,random_feature192,random_feature193,random_feature194,random_feature195,random_feature196,random_feature197,random_feature198,random_feature199,random_feature200
0,0.0,2.707493,4.545599,0.587659,0.71221,0.122933,0.424262,0.076522,0.504659,0.484304,...,0.340841,0.72324,0.9966,0.660321,0.10586,0.979735,0.148285,0.607985,0.82363,0.733797
1,0.0,2.799007,4.884355,0.782353,0.971619,0.299618,0.67841,0.47752,0.035606,0.272679,...,0.788341,0.043351,0.618925,0.064209,0.425322,0.23561,0.431035,0.425155,0.783302,0.390538
2,0.0,2.413544,4.768177,0.163102,0.712957,0.573272,0.73344,0.687636,0.630334,0.994705,...,0.956702,0.061218,0.891728,0.265086,0.010259,0.08021,0.80034,0.695813,0.300579,0.119317
3,0.0,2.389899,4.36877,0.466263,0.959561,0.434126,0.282996,0.861042,0.208301,0.854664,...,0.350258,0.27592,0.230971,0.499132,0.861143,0.392174,0.203347,0.199382,0.294618,0.220623
4,0.0,2.668488,4.018868,0.553765,0.207994,0.77972,0.588837,0.374214,0.093316,0.517818,...,0.875388,0.005005,0.015254,0.33181,0.757989,0.08016,0.712073,0.317964,0.356069,0.082959


### Set training indices (simulate cross-validation)

In [3]:
# Simulate leave-one-out cross-validation by selecting 29 out of 30 samples for training
train_indices = rng.choice(data_df.index, size=29, replace=False)

### Define meta information

Meta data can be left at its default values. But "random_seeds" must be defined for reproducibility and “train_correlation_threshold” should be tuned if the results are not satisfactory. With this parameter the size of the feature subset selction can be adjusted.

In [4]:
# Generate a diverse list of integer random seeds to initialize the random forests for reproducibility
seeds = [29, 10, 17, 42, 213, 34, 1, 5, 19, 3, 23, 9, 7, 123, 234, 345, 456, 567, 678, 789, 890, 15, 333, 37, 45, 56]

# Meta configuration for the feature selection
meta_data = {
    "n_cpus": 4,
    "random_seeds": seeds,
    # train correlation threshold defines the features correlated to the target to be removed from the training data
    "train_correlation_threshold": 0.7,
}

### Run reverse feature selection

In [5]:
# Import the reverse feature selection function
from reverse_feature_selection.reverse_random_forests import select_feature_subset

# Run the reverse feature selection algorithm (could take a Minute or two)
result_df = select_feature_subset(data_df, train_indices, meta_data=meta_data)

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  64 tasks      | elapsed:   24.2s
[Parallel(n_jobs=4)]: Done 202 out of 202 | elapsed:  1.1min finished


### Display selected features

In [6]:
# Selected features are those with a score greater than 0, stored in the 'feature_subset_selection' column
print("Selected features:")
result_df[result_df["feature_subset_selection"] > 0]["feature_subset_selection"]

Selected features:


relevant_feature1    0.802864
relevant_feature2    0.933373
Name: feature_subset_selection, dtype: float64