# 04. Feature extraction 

In this notebook, features will be extracted from the [previously annotated](./03_neo4j_processing.ipynb). This step allows to train the neural network in the final step.

Please note that the current extractors retrieve counts of links between two nodes such as: **source**-**flow**-**sink**. 

Source and sink nodes are formated according to the AST markup that has been computed in step in the [previous notebook](./03_neo4j_processing.ipynb). For each extractors, a features list are created with `dataset.queue_operation(ExtractorClass, {"need_map_features": True})`. This feature list has to be computed only once and serves to determine the list of labels needed for the feature extractor.

## 04.a. Imports, logging configuration and dataset preparation

The first step is to perform the necessary imports and configure the program. Additionnally, the previously used datasets are copied into 3 different datasets to have their features extracted.

In [None]:
# Enable these line if live changes in the codebase are made
# %load_ext autoreload
# %autoreload 2

In [None]:
# Specific instruction to run the notebooks from a sub-folder.
import sys
sys.path.append("..")

In [None]:
import logging
from bugfinder.settings import LOGGER
from bugfinder.dataset import CWEClassificationDataset as Dataset
from bugfinder.dataset.processing.dataset_ops import CopyDataset, RightFixer
from bugfinder.features.extraction.any_hop.all_flows import FeatureExtractor as AnyHopAllFlowsExtractor
from bugfinder.features.extraction.any_hop.single_flow import FeatureExtractor as AnyHopSingleFlowExtractor
from bugfinder.features.extraction.single_hop.raw import FeatureExtractor as SingleHopRawExtractor
from bugfinder.features.extraction.hops_n_flows import FeatureExtractor as HopsNFlowsExtractor

In [None]:
# Setup logging to only output INFO level messages
LOGGER.setLevel(logging.INFO)

In [None]:
# Dataset directories (DO NOT EDIT)
v__0_dataset_path = [
    "../data/ai-dataset_v110", "../data/ai-dataset_v120", "../data/ai-dataset_v210", "../data/ai-dataset_v220", 
]
# v__1_dataset_path = [
#     "../data/ai-dataset_v111", "../data/ai-dataset_v121", "../data/ai-dataset_v211", "../data/ai-dataset_v221",
# ]
v__2_dataset_path = [
    "../data/ai-dataset_v112", "../data/ai-dataset_v122", "../data/ai-dataset_v212", "../data/ai-dataset_v222",
]
v__3_dataset_path = [
    "../data/ai-dataset_v113", "../data/ai-dataset_v123", "../data/ai-dataset_v213", "../data/ai-dataset_v223",
]

dataset_to_copy = [
#     v__1_dataset_path, v__2_dataset_path, v__3_dataset_path,
    v__2_dataset_path, v__3_dataset_path
]

In [None]:
# Create the necessary dataset clones
from os.path import basename

LOGGER.info("Starting datasets copy...")
LOGGER.setLevel(logging.WARNING)

for index in range(len(v__0_dataset_path)):
    dataset = Dataset(v__0_dataset_path[index])
    
    for dataset_paths in dataset_to_copy:
        dataset.queue_operation(CopyDataset, {"to_path": dataset_paths[index], "force": True})
        
    dataset.process()
    print("Dataset %s copied." % basename(v__0_dataset_path[index]))
    
LOGGER.setLevel(logging.INFO)
LOGGER.info("Datasets copied.")

## 04.b. AnyHop AllFlow (DEPRECATED)

This extractor retrieves any node labelled `UpstreamNode` (named **root1**) up to 5 hops away from the function entrypoint. Then, any node labelled `DownstreamNode` (named **root2**), located up to 3 hops away from **root1** is extracted. Node **root1** is designated as the source, **root2** is the sink and the flow is a chain of the relationships between nodes. This chain has the following format: **R1:R2:...:Rn**, where **Ri** could be `FLOWS_TO`, `REACHES` or `CONTROLS` relationship. Each extracted feature is then added to the feature map as single item.

**Warning:** The AnyHopAllFlowsExtractor is being deprecated and will be removed in a future version. This part of the notebook will be removed as well.

In [None]:
# for dataset_path in v__1_dataset_path:
#     LOGGER.info("Processing %s..." % dataset_path)
#     dataset = Dataset(dataset_path)
#     dataset.queue_operation(AnyHopAllFlowsExtractor, {"need_map_features": True})
#     dataset.process()

In [None]:
# for dataset_path in v__1_dataset_path:
#     LOGGER.info("Processing %s..." % dataset_path)
#     dataset = Dataset(dataset_path)
#     dataset.queue_operation(AnyHopAllFlowsExtractor)
#     dataset.queue_operation(PCA, {"final_dimension": 50})
#     dataset.process()

## 04.c. SingleHop

This extractor retrieves any node labelled `UpstreamNode` (named **root1**) 1 hop away from any node labelled `DownstreamNode` (named **root2**), and part of the function graph. Node **root1** is designated as the source, **root2** is the sink and the flow is the relationship between nodes. This chain has the following format: **Ri** which could be `FLOWS_TO`, `REACHES` or `CONTROLS` relationship. Each extracted feature is then counted and added to the feature map.

In [None]:
for dataset_path in v__2_dataset_path:
    LOGGER.info("Processing %s..." % dataset_path)
    dataset = Dataset(dataset_path)
    dataset.queue_operation(SingleHopRawExtractor, {"need_map_features": True})
    dataset.process()

In [None]:
for dataset_path in v__2_dataset_path:
    LOGGER.info("Processing %s..." % dataset_path)
    dataset = Dataset(dataset_path)
    dataset.queue_operation(SingleHopRawExtractor)
#     dataset.queue_operation(PCA, {"final_dimension": 50})
    dataset.process()

## 04.d. AnyHop SingleFlow

This extractor retrieves any node labelled `UpstreamNode` (named **root1**) n hop away from any node labelled `DownstreamNode` (named **root2**), and part of the function graph. Node **root1** is designated as the source, **root2** is the sink and the flow is the relationship between nodes. This chain has the following format: **Ri** which could be `FLOWS_TO`, `REACHES` or `CONTROLS` relationship. Each extracted feature is then counted and added to the feature map. This extractor normalizes the columns (`feat_col = feat_count / test_case_count`) for every entrypoint.

In [None]:
for dataset_path in v__3_dataset_path:
    LOGGER.info("Processing %s..." % dataset_path)
    dataset = Dataset(dataset_path)
    dataset.queue_operation(AnyHopSingleFlowExtractor, {"need_map_features": True})
    dataset.process()

In [None]:
for dataset_path in v__3_dataset_path:
    LOGGER.info("Processing %s..." % dataset_path)
    dataset = Dataset(dataset_path)
    dataset.queue_operation(AnyHopSingleFlowExtractor)
#     dataset.queue_operation(PCA, {"final_dimension": 50})
    dataset.process()

## 04.e. Hops and flows

This extractor retrieves any node labelled `UpstreamNode` (named **root1**) `n` hops away from any node labelled `DownstreamNode` (named **root2**), and part of the function graph. The number of hops `n` is an integer within the range [`min_hops`, `max_hops`], where `min_hops > 0` and `max_hops > min_hops`. If `max_hops` is -1, the extractor retrieves all possible relationship.

Node **root1** is designated as the source, **root2** is the sink and the flow is the relationship between nodes. This chain has the following format: **Ri** which could be `FLOWS_TO`, `REACHES` or `CONTROLS` relationship. Each extracted feature is then counted and added to the feature map. This extractor normalizes the columns (`feat_col = feat_count / test_case_count`) for every entrypoint.

In [None]:
for dataset_path in v__3_dataset_path:
    LOGGER.info("Processing %s..." % dataset_path)
    dataset = Dataset(dataset_path)
    dataset.queue_operation(HopsNFlowsExtractor, {"min_hops": 1, "max_hops": -1, "need_map_features": True})
    dataset.queue_operation(HopsNFlowsExtractor, {"max_hops": -1})
    dataset.process()

In [None]:
for dataset_path in v__3_dataset_path:
    LOGGER.info("Processing %s..." % dataset_path)
    dataset = Dataset(dataset_path)
    dataset.queue_operation(HopsNFlowsExtractor, {"min_hops": 1, "max_hops": -1})
    dataset.process()

## Conclusion

In this notebook, the previously annotated datasets had several feature extracted. The [next notebook](./05_dimension_reduction.ipynb) trains the models and retrieve the results obtained.