# 05. Dimension reduction algorithms

In this step, the number of features previously obtained will be reduced to ensure quick convergence of the model. Several methods exist, either by selecting the most important feature or creating new ones.

This notebook and the underlying processing steps follow closely the feature selection explanation from Scikit Learn, available at: https://scikit-learn.org/stable/modules/feature_selection.html.

## 05.a. Imports, logging configuration and dataset preparation

The first step is to perform the necessary imports and configure the program.

In [None]:
# Enable these line if live changes in the codebase are made
# %load_ext autoreload
# %autoreload 2

In [None]:
# Disable tensorflow logging
import os
import logging
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'  # or any {'0', '1', '2'}
logging.getLogger('tensorflow').setLevel(logging.FATAL)

In [None]:
# Specific instruction to run the notebooks from a sub-folder.
import sys
sys.path.append("..")

In [None]:
from bugfinder.settings import LOGGER
from bugfinder.dataset import CWEClassificationDataset as Dataset
from bugfinder.features.reduction.variance_threshold import FeatureSelector as VarianceThreshold
from bugfinder.features.reduction.univariate_select import FeatureSelector as UnivariateSelect
from bugfinder.features.reduction.select_from_model import FeatureSelector as SelectFromModel
from bugfinder.features.reduction.auto_encoder import FeatureSelector as AutoEncoder
from bugfinder.features.reduction.sequential_feature_selector \
    import FeatureSelector as SequentialFeatureSelector
from bugfinder.features.reduction.pca import FeatureSelector as PCA
from bugfinder.features.reduction.recursive_feature_elimination \
    import FeatureSelector as RecursiveFeatureElimination

In [None]:
# Setup logging to only output INFO level messages
LOGGER.setLevel(logging.INFO)

In [None]:
# Dataset directories (DO NOT EDIT)
cwe121_v__2_dataset_path = [
    "../data/cwe121_v112", "../data/cwe121_v122", "../data/cwe121_v212", "../data/cwe121_v222", 
#     "../data/cwe121_v312", "../data/cwe121_v322"
]
cwe121_v__3_dataset_path = [
    "../data/cwe121_v113", "../data/cwe121_v123", "../data/cwe121_v213", "../data/cwe121_v223", 
#     "../data/cwe121_v313", "../data/cwe121_v323"
]

## 05.b. Variance Threshold

This feature selector removes features that have low variation as defined by the `threshold` parameter.

In [None]:
reducer_params = {
    "threshold": 0.995,  # Should be between 0 and 1
    "dry_run": True
}

In [None]:
for dataset_path in cwe121_v__2_dataset_path[:1]:
    LOGGER.info("Processing %s..." % dataset_path)
    dataset = Dataset(dataset_path)
    dataset.queue_operation(VarianceThreshold, reducer_params)
    dataset.process()

## 05.c. Univariate feature selection

Select the best features based on a predefined statistical test.

In [None]:
scoring_functions = ["chi2", "f_classif", "mutual_info_classif"]
scoring_modes = ["k_best", "percentile", "fpr", "fdr", "fwe"]

reducer_params = {
    "function": scoring_functions[0],
    "mode": scoring_modes[0],
    "param": 200,  # `float` or `int`, depends on the selected mode
    "dry_run": True
}

In [None]:
for dataset_path in cwe121_v__2_dataset_path[:1]:
    LOGGER.info("Processing %s..." % dataset_path)
    dataset = Dataset(dataset_path)
    dataset.queue_operation(UnivariateSelect, reducer_params)
    dataset.process()

## 05.d. Select from model

This selector choose the best feature according to the training results of one of the estimators available.

In [None]:
estimators = [
    "LogisticRegression",
    "LogisticRegressionCV",
    "PassiveAggressive",
    "Perceptron",
    "Ridge",
    "RidgeCV",
    "SGD",
    "DecisionTree",
    "ExtraTree",
    "AdaBoost",    
    "ExtraTrees",
    "GradientBoosting",
    "RandomForest",
    "SVC",
    "SVR",
    "NuSVC",
    "NuSVR",
    "OneClassSVM"
]

reducer_params = {
    "model": estimators[0],
    "dry_run": True
}

In [None]:
for dataset_path in cwe121_v__2_dataset_path[:1]:
    LOGGER.info("Processing %s..." % dataset_path)
    dataset = Dataset(dataset_path)
    dataset.queue_operation(SelectFromModel, reducer_params)
    dataset.process()

## 05.e. Recursive feature elimination

This selector removes features with the least impact on the sum of squares error.

**Note:** Depending on the number of feature selected, the execution of this selector can be long.

In [None]:
reducer_params = {
    "model": estimators[0], 
    "cross_validation": False, 
    "features": 1000, 
    "dry_run": True
}

In [None]:
for dataset_path in cwe121_v__2_dataset_path[:1]:
    LOGGER.info("Processing %s..." % dataset_path)
    dataset = Dataset(dataset_path)
    dataset.queue_operation(RecursiveFeatureElimination, reducer_params)
    dataset.process()

## 05.f. Sequential feature selection

This selector adds features, given a particular `direction`, with the most impact on the sum of squares error.

**Note:** Depending on the parameters selected, the execution of this selector can be long.

In [None]:
directions = ["forward", "backward"]

reducer_params = {
    "model": estimators[0], 
    "direction": directions[0], 
    "features": 10, 
    "dry_run": True
}

In [None]:
for dataset_path in cwe121_v__2_dataset_path[:1]:
    LOGGER.info("Processing %s..." % dataset_path)
    dataset = Dataset(dataset_path)
    dataset.queue_operation(SequentialFeatureSelector, reducer_params)
    dataset.process()

## 05.g. Auto encoders

Defines a neural network with the same number of input and output neurons as the number of features, the hidden layers have a smaller number of neurons to perform dimension reduction.

In [None]:
reducer_params = {
    "dimension": 250, 
    "layers": "500,100,500", 
    "encoder_path": "/tmp/encoder.mdl", 
    "dry_run": True
}

In [None]:
for dataset_path in cwe121_v__2_dataset_path[:1]:
    LOGGER.info("Processing %s..." % dataset_path)
    dataset = Dataset(dataset_path)
    dataset.queue_operation(AutoEncoder, reducer_params)
    dataset.process()

## Conclusion

In this notebook, the number of features has been reduced to ease the training step described in the [next notebook](./06_models_training.ipynb).