# Quantum Feature Selection

Feature selection posed as an Ising type of problem can be tackled by many different options. As it was mentioned in the documentation, the idea is to define each feature importance as its importance and the quadratic relation weight as the redundancy between different pairs available.

![featureselection.png](../../assets/featureselection.png)

This what Falcondale builds and then offers different options to solve it.

In [1]:
from sklearn import datasets

# import some data to play with
X, y = datasets.load_breast_cancer(return_X_y=True, as_frame=True)
X

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


In [3]:
from falcondale import Project

dataset = X.copy()
dataset["target"] = y

myproject = Project(dataset, target="target")
myproject.preprocess()

By default, **Simulated Annealing** will be the technique chose to select our feature subset.

In [4]:
features = myproject.feature_selection(max_cols = 10)
features

['concavity error',
 'fractal dimension error',
 'mean area',
 'mean concavity',
 'mean smoothness',
 'mean texture',
 'symmetry error',
 'worst compactness',
 'worst fractal dimension',
 'worst perimeter']

In [5]:
myproject.show_features()

Unnamed: 0,concavity error,fractal dimension error,mean area,mean concavity,mean smoothness,mean texture,symmetry error,worst compactness,worst fractal dimension,worst perimeter
0,0.135682,0.183042,0.363733,0.703140,0.593753,0.022658,0.311645,0.619292,0.418864,0.668310
1,0.046970,0.091110,0.501591,0.203608,0.289880,0.272574,0.084539,0.154563,0.222878,0.539818
2,0.096768,0.127006,0.449417,0.462512,0.514309,0.390260,0.205690,0.385375,0.213433,0.508442
3,0.142955,0.287205,0.102906,0.565604,0.811321,0.360839,0.728148,0.814012,0.773711,0.241347
4,0.143636,0.145800,0.489290,0.463918,0.430351,0.156578,0.136179,0.172415,0.142595,0.506948
...,...,...,...,...,...,...,...,...,...,...
564,0.131263,0.115536,0.566490,0.571462,0.526948,0.428813,0.045843,0.178527,0.105667,0.576174
565,0.099747,0.055387,0.474019,0.337395,0.407782,0.626987,0.156160,0.159997,0.074315,0.520892
566,0.119444,0.103547,0.303118,0.216753,0.288165,0.621238,0.074548,0.273705,0.151909,0.379949
567,0.179722,0.182766,0.475716,0.823336,0.588336,0.663510,0.216103,0.815758,0.452315,0.668310


We can try to train a model and check how it would end up being with this selection.

In [6]:
model = myproject.evaluate("qnn")
model.print_report()

Iter:     1 | Cost: 0.0460781 | Acc train: 0.4615385
Iter:     2 | Cost: 0.0322793 | Acc train: 0.3846154
Iter:     3 | Cost: 0.0288865 | Acc train: 0.5384615
Iter:     4 | Cost: 0.0263864 | Acc train: 0.5384615
Iter:     5 | Cost: 0.0163124 | Acc train: 0.6153846
Iter:     6 | Cost: 0.0137854 | Acc train: 0.6666667
Iter:     7 | Cost: 0.0099398 | Acc train: 0.6923077
Iter:     8 | Cost: 0.0106560 | Acc train: 0.6153846
Iter:     9 | Cost: 0.0142279 | Acc train: 0.6666667
Iter:    10 | Cost: 0.0074605 | Acc train: 0.8205128
Iter:    11 | Cost: 0.0094176 | Acc train: 0.7435897
Iter:    12 | Cost: 0.0091387 | Acc train: 0.7948718
Iter:    13 | Cost: 0.0080738 | Acc train: 0.7179487
Iter:    14 | Cost: 0.0091742 | Acc train: 0.6923077
Iter:    15 | Cost: 0.0114912 | Acc train: 0.6666667
Iter:    16 | Cost: 0.0120174 | Acc train: 0.6153846
Iter:    17 | Cost: 0.0073769 | Acc train: 0.7435897
Iter:    18 | Cost: 0.0117163 | Acc train: 0.7179487
Iter:    19 | Cost: 0.0129730 | Acc train: 0.7

In [7]:
# Let's set the same starting point
myproject.preprocess()
features = myproject.feature_selection(max_cols = 10, method="sb")
features

['concavity error',
 'fractal dimension error',
 'mean area',
 'mean concavity',
 'mean smoothness',
 'mean texture',
 'symmetry error',
 'worst compactness',
 'worst fractal dimension',
 'worst perimeter']

Seems like both methods agree. It might be the problem is not challenging enough for them. Let's try by reducing the number of features further.

In [8]:
# Let's set the same starting point
myproject.preprocess()
features = myproject.feature_selection(max_cols = 3, method="sb")
features

['mean area',
 'mean concavity',
 'worst compactness',
 'worst fractal dimension',
 'worst perimeter']

In [9]:
# Let's set the same starting point
myproject.preprocess()
features = myproject.feature_selection(max_cols = 3, method="sa")
features

['worst compactness', 'worst fractal dimension', 'worst perimeter']

As we can see the regularization term did not have the same effect in both cases, where the **Simiulated Bifurcation** exceeded the maximum allowed number. Let's check their performance when training a model.

In [10]:
# Let's set the same starting point
myproject.preprocess()
features = myproject.feature_selection(max_cols = 3, method="sb")
model = myproject.evaluate("qsvc")
model.print_report()

              precision    recall  f1-score   support

           0       0.88      0.87      0.87        60
           1       0.93      0.94      0.93       111

    accuracy                           0.91       171
   macro avg       0.90      0.90      0.90       171
weighted avg       0.91      0.91      0.91       171



In [11]:
# Let's set the same starting point
myproject.preprocess()
features = myproject.feature_selection(max_cols = 3, method="sa")
model = myproject.evaluate("qsvc")
model.print_report()

              precision    recall  f1-score   support

           0       0.88      0.87      0.87        60
           1       0.93      0.94      0.93       111

    accuracy                           0.91       171
   macro avg       0.90      0.90      0.90       171
weighted avg       0.91      0.91      0.91       171



We could also check the response from DWave's actual hardware by just providing a token for it.

In [12]:
# Let's set the same starting point
myproject.preprocess()
features = myproject.feature_selection(max_cols = 3, token="DEV-...")
model = myproject.evaluate("qsvc")
model.print_report()

              precision    recall  f1-score   support

           0       0.87      0.78      0.82        60
           1       0.89      0.94      0.91       111

    accuracy                           0.88       171
   macro avg       0.88      0.86      0.87       171
weighted avg       0.88      0.88      0.88       171



Finally, **Quantum Approximate Optimization Algorithm** can also be used, in this case we need to take into account that being locally simulated we might be restricted by our machine capacity so we may need to do a pre-selection to decrease the original size of the dataset.

A warning will appear when a previous subset selection is already in force.

In [13]:
# Let's set the same starting point
myproject.preprocess()
features = myproject.feature_selection(max_cols = 12)
features = myproject.feature_selection(max_cols = 3, method="qaoa")
model = myproject.evaluate("qsvc")
model.print_report()

Your previous selection is in force, preprocess the dataset to start over.
              precision    recall  f1-score   support

           0       0.87      0.77      0.81        60
           1       0.88      0.94      0.91       111

    accuracy                           0.88       171
   macro avg       0.87      0.85      0.86       171
weighted avg       0.88      0.88      0.88       171

