### Setup

In [27]:
# imports
import numpy as np
from sklearn.datasets import make_classification
from sklearn.metrics import classification_report
from gsbfs.gsbfs import gso_rank, gso_boruta_select, get_expected_hits
# make this notebook's output stable across runs
np.random.seed(42)

### Feature Ranking
Let us create a data set consisting of 50 features, including only 10 informative features, and 5000 observations. The features will be ranked using the gso_rank() function.

In [28]:
# create instances
n_features = 50
n_informative = 10
X, y = make_classification(
    n_samples=5000,
    n_features=n_features,
    n_informative=n_informative,
    n_redundant=0,
    n_repeated=0,
    n_classes=3,
    shuffle=False,  # preserve ordering. first columns = informative features
)
# shuffle instances
p = np.random.permutation(y.size)
X, y = X[p, :], y[p]
# rank features
ranked_indexes, cos_sq_max = gso_rank(X, y)
print(f"Ranked Features (total={n_features}, informative=[0,{n_informative-1}]):")
print(ranked_indexes)

Ranked Features (total=50, informative=[0,9]):
[ 8  3  9  0  4  5  2  6 39 40  7 19  1 26 46 34 42 13 24 44 27 12 15 32
 31 36 20 17 49 28 38 35 48 37 21 41 22 47 11 25 10 14 23 30 16 18 33 43
 45 29]


### Expected Hits
The Boruta algorithm counts the number of "hits" for each feature. For instance, considering 20 trials and a PMF maximum probability of 0.5%, the get_expected_hits() function returns the number of hits to be selected.

In [29]:
n_trials = 20
proba = 0.5
pmf_max = 0.005
rejected_hits, selected_hits = get_expected_hits(n_trials, proba, pmf_max)
print(f"Hits to be selected (n_trials={n_trials}, proba={proba}, pmf_max={pmf_max}):")
print(selected_hits)

Hits to be selected (n_trials=20, proba=0.5, pmf_max=0.005):
[16 17 18 19 20]


### Feature Selection
Using the same data set consisting of 50 features, including only 10 informative features. Let us predict which features are informative using the gso_boruta_select() function (calling get_expected_hits() internally).

In [30]:
# select features
rejected_indexes, selected_indexes, indecisive_indexes = gso_boruta_select(X, y)
print(f"Selected Features (total={n_features}, informative=[0,{n_informative-1}]):")
print(selected_indexes)

Selected Features (total=50, informative=[0,9]):
[0 2 3 4 5 6 8 9]


Which gives the informative/noise classification report below.

In [31]:
informative_true = np.array([False]*n_features)
informative_true[0:n_informative] = True
informative_pred = np.array([False]*n_features)
informative_pred[selected_indexes] = True
print(classification_report(informative_true, informative_pred, target_names=['NOISE', 'INFORMATIVE'], digits=4))

              precision    recall  f1-score   support

       NOISE     0.9524    1.0000    0.9756        40
 INFORMATIVE     1.0000    0.8000    0.8889        10

    accuracy                         0.9600        50
   macro avg     0.9762    0.9000    0.9322        50
weighted avg     0.9619    0.9600    0.9583        50

