https://towardsdatascience.com/evolutionary-feature-selection-for-machine-learning-7f61af2a8c12

### Evolutionary Feature Selection for Machine Learning
In general, it’s not a good idea to use brute force approaches to optimize a model, in the case of feature selection, using methods like forward selection or backward elimination, which can only vary one feature at a time and tends to have troubles when it comes to seeing how different subsets (with the same size) of features work together.

#### Model Representation:
We can model the features as follows:
- Each individual of the population represents the total subset of features.
- The gen of the individual represents one particular feature.
- Each gen value can be 0 or 1; zero means the algorithm did not select the feature, and one means the feature is included.
- The mutation is associated with swamping the bit value in the randomly selected position within a mutation probability.

#### Python Code:
For this experiment, I’m going to use a classification dataset. Still, I’m also going to add random noise as new “garbage features” that are not useful for the model and add more complexity. I expect the model to remove them and possibly some of the originals. Hence, the first step is to import the data and create these new features:

In [1]:
from sklearn.datasets import load_iris
import numpy as np

data = load_iris()
X, y = data["data"], data["target"]

# Add random non-important features
noise = np.random.uniform(0, 10, size=(X.shape[0], 5))
X = np.hstack((X, noise))
X.shape

(150, 9)

From the previous code, you can see there are nine features, four originals, and five dummies; we can plot them to check how they are related to the “y” variable, which we want to predict. Each color represents one of the categories.

We can see that the original features help to discriminate the observations of each class having a boundary that separates them. Still, the new features (dummies) don’t add value since they cannot “split” the data per category, just as expected.

Now, we will split the data into train and test and import the base model we want to use to select the features, in this case, a decision tree.


In [2]:

from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

clf = DecisionTreeClassifier()
cv = StratifiedKFold(n_splits=3, shuffle=True)

As a next step, let’s import and fit the feature selection model; as mentioned, it uses evolutionary algorithms to select the features; it uses a multi-objective function by optimizing the cross-validation score while also minimizing the number of features used.

In [3]:
!pip install sklearn-genetic-opt

Collecting sklearn-genetic-opt
  Downloading sklearn_genetic_opt-0.7.0-py3-none-any.whl (29 kB)
Collecting deap>=1.3.1
  Downloading deap-1.3.1-cp38-cp38-macosx_10_14_x86_64.whl (109 kB)
[K     |████████████████████████████████| 109 kB 4.6 MB/s eta 0:00:01
Installing collected packages: deap, sklearn-genetic-opt
Successfully installed deap-1.3.1 sklearn-genetic-opt-0.7.0


In [4]:
from sklearn_genetic import GAFeatureSelectionCV

evolved_estimator = GAFeatureSelectionCV(
    estimator=clf,
    cv=cv,
    scoring="accuracy",
    n_jobs=-1,
    verbose=True,
    keep_top_k=2,
    elitism=True,
)

evolved_estimator.fit(X, y)

gen	nevals	fitness 	fitness_std	fitness_max	fitness_min
0  	10    	0.874667	0.101008   	0.946667   	0.673333   
1  	14    	0.934   	0.0141264  	0.966667   	0.92       
2  	17    	0.952   	0.00884433 	0.966667   	0.94       
3  	18    	0.954667	0.00581187 	0.96       	0.946667   
4  	15    	0.962667	0.00442217 	0.966667   	0.953333   
5  	18    	0.962   	0.00426875 	0.966667   	0.953333   
6  	18    	0.96    	0.00788811 	0.966667   	0.94       
7  	18    	0.962   	0.00426875 	0.966667   	0.953333   
8  	14    	0.965333	0.00266667 	0.966667   	0.96       
9  	18    	0.965333	0.004      	0.966667   	0.953333   
10 	16    	0.966   	0.002      	0.966667   	0.96       
11 	19    	0.965333	0.004      	0.966667   	0.953333   
12 	19    	0.966   	0.002      	0.966667   	0.96       
13 	20    	0.964667	0.00305505 	0.966667   	0.96       
14 	18    	0.964   	0.00442217 	0.966667   	0.953333   
15 	18    	0.964667	0.006      	0.966667   	0.946667   
16 	19    	0.966667	1.11022e-16	0.966667   	0.96

GAFeatureSelectionCV(cv=StratifiedKFold(n_splits=3, random_state=None, shuffle=True),
                     estimator=DecisionTreeClassifier(), keep_top_k=2,
                     n_jobs=-1, return_train_score=True, scoring='accuracy')

Once the model is done, we can check which variables it chooses by using the best_features_ property, it will get an array of bools, where true means the feature at that index was selected.

In [5]:
evolved_estimator.best_features_

array([ True,  True,  True,  True, False, False, False,  True, False])