<a href="https://colab.research.google.com/github/wolfzxcv/ml-examples/blob/master/PSO_SBS_feature_selection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature selection using Particle Swarm Optimization
In this tutorial, Particle Swarm Optimization is used to find an optimal subset of features for an SVM classifier. We will be testing our implementation on the UCI ML Breast Cancer Wisconsin (Diagnostic) dataset.

# Dependencies
Before we get started, make sure you have the following packages installed:
* niapy: `pip install niapy`
* scikit-learn: `pip install scikit-learn`

# Defining the problem

We want to select a subset of relevant features for use in model construction, in order to make prediction faster and more accurate. We will be using Particle Swarm Optimization to search for the optimal subset of features.
Our solution vector will represent a subset of features:
```
x=[x1,x2,…,xd];xi∈[0,1]
```
Where d is the total number of features in the dataset. We will then use a threshold of 0.5 to determine whether the feature will be selected:
```
xi={1,if xi>0.50,otherwise
```

The function we’ll be optimizing is the classification accuracy penalized by the number of features selected, that means we’ll be minimizing the following function:
```
f(x)=α × (1−P)+(1−α) × (Nselected/Nfeatures)
```
Where α is the parameter that decides the tradeoff between classifier performance P (classification accuracy in our case) and the number of selected features with respect to the number of all features.



# Implementation

In [9]:
! pip install niapy==2.0.2
! pip install pandas==2.0.3



In [10]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.svm import SVC

from niapy.problems import Problem
from niapy.task import Task
from niapy.algorithms.basic import ParticleSwarmOptimization

First implement the Problem class, which implements the optimization function defined above. It takes the training dataset, and the α parameter, which is set to 0.99 by default.
For the objective function, the solution vector is first converted to binary, using the threshold value of 0.5. That gives us indices of the selected features. If no features were selected 1.0 is returned as the fitness. We then compute the mean accuracy of running 2-fold cross validation on the training set, and calculate the value of the optimization function defined above.

In [11]:
class SVMFeatureSelection(Problem):
    def __init__(self, X_train, y_train, alpha=0.99):
        super().__init__(dimension=X_train.shape[1], lower=0, upper=1)
        self.X_train = X_train
        self.y_train = y_train
        self.alpha = alpha

    def _evaluate(self, x):
        selected = x > 0.5
        num_selected = selected.sum()
        if num_selected == 0:
            return 1.0
        accuracy = cross_val_score(SVC(), self.X_train[:, selected], self.y_train, cv=2, n_jobs=-1).mean()
        score = 1 - accuracy
        num_features = self.X_train.shape[1]
        return self.alpha * score + (1 - self.alpha) * (num_selected / num_features)

Load the dataset, run the algorithm and compare the results.

In [12]:
dataset = load_breast_cancer()
X = dataset.data
y = dataset.target
feature_names = dataset.feature_names

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=1234)

problem = SVMFeatureSelection(X_train, y_train)
task = Task(problem, max_iters=100)
algorithm = ParticleSwarmOptimization(population_size=10, seed=1234)
best_features, best_fitness = algorithm.run(task)

selected_features = best_features > 0.5
print('Number of selected features:', selected_features.sum())
print('Selected features:', ', '.join(feature_names[selected_features].tolist()))

model_selected = SVC()
model_all = SVC()

model_selected.fit(X_train[:, selected_features], y_train)
print('Subset accuracy:', model_selected.score(X_test[:, selected_features], y_test))

model_all.fit(X_train, y_train)
print('All Features Accuracy:', model_all.score(X_test, y_test))

Number of selected features: 4
Selected features: mean smoothness, mean concavity, mean symmetry, worst area
Subset accuracy: 0.9210526315789473
All Features Accuracy: 0.9122807017543859


# Same data, but using Sequential backward selection (SBS)

In [13]:
# Export .csv to see the data, you don't need these few lines
import pandas as pd

df = pd.DataFrame(data=X, columns = feature_names)
df.to_csv('breast_cancer.csv', sep = ',', index = False)

In [14]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.metrics import accuracy_score

In [15]:
dataset = load_breast_cancer()
print(dataset)

{'data': array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]]), 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
 

In [16]:
X = dataset.data
y = dataset.target
feature_names = dataset.feature_names

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=1234)

# Define the classifier
clf = SVC()

# Initialize the sequential backward feature selector
# You can also modify the direction, it can be either 'backward' or 'forward', default='forward'. Forward selection: add, backward selection: remove.
# You can modify n_features_to_select, set it to a specific number, n_features_to_select=4, or, n_features_to_select='auto'
sbs = SequentialFeatureSelector(clf, direction='backward', n_features_to_select='auto')

# Fit the sequential backward feature selector to the training data
sbs.fit(X_train, y_train)

# Get selected feature names
selected_feature_names = feature_names[sbs.get_support()]

# Select features for training and testing data
X_train_selected = sbs.transform(X_train)
X_test_selected = sbs.transform(X_test)

# Train the classifier on the selected features
clf.fit(X_train_selected, y_train)

# Make predictions on the testing data using selected features
y_pred_selected = clf.predict(X_test_selected)

print('Number of selected features:', len(selected_feature_names))

# Print selected feature names
print('Selected features:', ', '.join(selected_feature_names))

# Calculate accuracy using selected features
accuracy_selected = accuracy_score(y_test, y_pred_selected)
print("Subset accuracy:", accuracy_selected)

# Train the classifier on all features
clf.fit(X_train, y_train)

# Make predictions on the testing data using all features
y_pred_all_features = clf.predict(X_test)

# Calculate accuracy using all features
accuracy_all_features = accuracy_score(y_test, y_pred_all_features)
print("All Features Accuracy:", accuracy_all_features)

Number of selected features: 15
Selected features: smoothness error, compactness error, concavity error, concave points error, symmetry error, fractal dimension error, worst radius, worst texture, worst perimeter, worst smoothness, worst compactness, worst concavity, worst concave points, worst symmetry, worst fractal dimension
Subset accuracy: 0.9473684210526315
All Features Accuracy: 0.9122807017543859
