# 04. Feature extraction 

In this notebook, features will be extracted from the [previously annotated](./03_neo4j_processing.ipynb). This step allows to train the neural network in the final step.

Please note that the current extractors retrieve counts of links between two nodes such as: **source**-**flow**-**sink**. 

Source and sink nodes are formated according to the AST markup that has been computed in step in the [previous notebook](./03_neo4j_processing.ipynb). For each extractors, a features list are created with `dataset.queue_operation(ExtractorClass, {"need_map_features": True})`. This feature list has to be computed only once and serves to determine the list of labels needed for the feature extractor.

## 04.a. Imports, logging configuration and dataset preparation

The first step is to perform the necessary imports and configure the program. Additionnally, the previously used datasets are copied into 3 different datasets to have their features extracted.

In [78]:
# Enable these line if live changes in the codebase are made
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [79]:
# Specific instruction to run the notebooks from a sub-folder.
import sys
sys.path.append("..")

In [80]:
import logging
from bugfinder.settings import LOGGER
from bugfinder.dataset import CWEClassificationDataset as Dataset
from bugfinder.dataset.processing.dataset_ops import CopyDataset, RightFixer
from bugfinder.features.any_hop.all_flows import FeatureExtractor as AnyHopAllFlowsExtractor
from bugfinder.features.any_hop.single_flow import FeatureExtractor as AnyHopSingleFlowExtractor
from bugfinder.features.single_hop.raw import FeatureExtractor as SingleHopRawExtractor
from bugfinder.features.pca import FeatureExtractor as PCA

In [81]:
# Setup logging to only output INFO level messages
LOGGER.setLevel(logging.INFO)

In [82]:
# Dataset directories (DO NOT EDIT)
cwe121_v__0_dataset_path = [
    "../data/cwe121_v110", "../data/cwe121_v120", "../data/cwe121_v210", "../data/cwe121_v220", 
#     "../data/cwe121_v310", "../data/cwe121_v320"
]
cwe121_v__1_dataset_path = [
    "../data/cwe121_v111", "../data/cwe121_v121", "../data/cwe121_v211", "../data/cwe121_v221", 
#     "../data/cwe121_v311", "../data/cwe121_v321"
]
cwe121_v__2_dataset_path = [
    "../data/cwe121_v112", "../data/cwe121_v122", "../data/cwe121_v212", "../data/cwe121_v222", 
#     "../data/cwe121_v312", "../data/cwe121_v322"
]
cwe121_v__3_dataset_path = [
    "../data/cwe121_v113", "../data/cwe121_v123", "../data/cwe121_v213", "../data/cwe121_v223", 
#     "../data/cwe121_v313", "../data/cwe121_v323"
]
# cwe121_v__4_dataset_path = [
#     "../data/cwe121_v114", "../data/cwe121_v124", "../data/cwe121_v214", "../data/cwe121_v224", 
#     "../data/cwe121_v314", "../data/cwe121_v324"
# ]

dataset_to_copy = [
#     cwe121_v__1_dataset_path, cwe121_v__2_dataset_path, cwe121_v__3_dataset_path, cwe121_v__4_dataset_path
    cwe121_v__1_dataset_path, cwe121_v__2_dataset_path, cwe121_v__3_dataset_path
]

In [97]:
# Create the necessary dataset clones
from os.path import basename

LOGGER.info("Starting datasets copy...")
LOGGER.setLevel(logging.WARNING)

for index in range(len(cwe121_v__0_dataset_path)):
    cwe121_dataset = Dataset(cwe121_v__0_dataset_path[index])
    
    for dataset_paths in dataset_to_copy:
        cwe121_dataset.queue_operation(CopyDataset, {"to_path": dataset_paths[index], "force": True})
        
    cwe121_dataset.process()
    print("Dataset %s copied." % basename(cwe121_v__0_dataset_path[index]))
    
LOGGER.setLevel(logging.INFO)
LOGGER.info("Datasets copied.")

[2020-06-04 13:40:25][INFO] Starting datasets copy...
Dataset cwe121_v110 copied.
Dataset cwe121_v120 copied.
Dataset cwe121_v210 copied.
Dataset cwe121_v220 copied.
[2020-06-04 13:40:46][INFO] Datasets copied.


## 04.b. AnyHop AllFlow

This extractor retrieves any node labelled `UpstreamNode` (named **root1**) up to 5 hops away from the function entrypoint. Then, any node labelled `DownstreamNode` (named **root2**), located up to 3 hops away from **root1** is extracted. Node **root1** is designated as the source, **root2** is the sink and the flow is a chain of the relationships between nodes. This chain has the following format: **R1:R2:...:Rn**, where **Ri** could be `FLOWS_TO`, `REACHES` or `CONTROLS` relationship. Each extracted feature is then added to the feature map as single item.

In [98]:
for dataset_path in cwe121_v__1_dataset_path[:1]:
    LOGGER.info("Processing %s..." % dataset_path)
    dataset = Dataset(dataset_path)
    dataset.queue_operation(AnyHopAllFlowsExtractor, {"need_map_features": True})
    dataset.process()

[2020-06-04 13:40:46][INFO] Processing ../data/cwe121_v111...
[2020-06-04 13:40:46][INFO] Dataset index build in 44ms. 200 test_cases, 2 classes, 0 features (v0).
[2020-06-04 13:40:46][INFO] Running operation 1/1 (FeatureExtractor)...
[2020-06-04 13:40:46][INFO] Running operation 1/1 (RightFixer)...
[2020-06-04 13:40:48][INFO] 1 operations run in 1488ms.
[2020-06-04 13:41:13][INFO] Retrieved 416 entrypoints. Querying for flowgraphs...
[2020-06-04 13:41:22][INFO] Processed 10% of the dataset.
[2020-06-04 13:41:29][INFO] Processed 20% of the dataset.
[2020-06-04 13:41:35][INFO] Processed 30% of the dataset.
[2020-06-04 13:41:41][INFO] Processed 40% of the dataset.
[2020-06-04 13:41:46][INFO] Processed 50% of the dataset.
[2020-06-04 13:41:52][INFO] Processed 60% of the dataset.
[2020-06-04 13:41:58][INFO] Processed 70% of the dataset.
[2020-06-04 13:42:03][INFO] Processed 80% of the dataset.
[2020-06-04 13:42:09][INFO] Processed 90% of the dataset.
[2020-06-04 13:42:15][INFO] Processed 1

In [99]:
for dataset_path in cwe121_v__1_dataset_path[:1]:
    LOGGER.info("Processing %s..." % dataset_path)
    dataset = Dataset(dataset_path)
    dataset.queue_operation(AnyHopAllFlowsExtractor)
    dataset.queue_operation(PCA, {"final_dimension": 50})
    dataset.process()

[2020-06-04 13:42:16][INFO] Processing ../data/cwe121_v111...
[2020-06-04 13:42:16][INFO] Dataset index build in 16ms. 200 test_cases, 2 classes, 0 features (v0).
[2020-06-04 13:42:16][INFO] Running operation 1/2 (FeatureExtractor)...
[2020-06-04 13:42:16][INFO] Running operation 1/1 (RightFixer)...
[2020-06-04 13:42:17][INFO] 1 operations run in 1103ms.
[2020-06-04 13:42:41][INFO] Retrieved 416 entrypoints and 7983 labels. Querying for flowgraphs...
[2020-06-04 13:42:56][INFO] Processed 10% of the dataset.
[2020-06-04 13:43:05][INFO] Processed 20% of the dataset.
[2020-06-04 13:43:13][INFO] Processed 30% of the dataset.
[2020-06-04 13:43:24][INFO] Processed 40% of the dataset.
[2020-06-04 13:43:33][INFO] Processed 50% of the dataset.
[2020-06-04 13:43:42][INFO] Processed 60% of the dataset.
[2020-06-04 13:43:51][INFO] Processed 70% of the dataset.
[2020-06-04 13:43:58][INFO] Processed 80% of the dataset.
[2020-06-04 13:44:05][INFO] Processed 90% of the dataset.
[2020-06-04 13:44:13][I

## 04.c. SingleHop

This extractor retrieves any node labelled `UpstreamNode` (named **root1**) 1 hop away from any node labelled `DownstreamNode` (named **root2**), and part of the function graph. Node **root1** is designated as the source, **root2** is the sink and the flow is the relationship between nodes. This chain has the following format: **Ri** which could be `FLOWS_TO`, `REACHES` or `CONTROLS` relationship. Each extracted feature is then counted and added to the feature map.

In [100]:
for dataset_path in cwe121_v__2_dataset_path:
    LOGGER.info("Processing %s..." % dataset_path)
    dataset = Dataset(dataset_path)
    dataset.queue_operation(SingleHopRawExtractor, {"need_map_features": True})
    dataset.process()

[2020-06-04 13:44:18][INFO] Processing ../data/cwe121_v112...
[2020-06-04 13:44:18][INFO] Dataset index build in 22ms. 200 test_cases, 2 classes, 0 features (v0).
[2020-06-04 13:44:19][INFO] Running operation 1/1 (FeatureExtractor)...
[2020-06-04 13:44:19][INFO] Running operation 1/1 (RightFixer)...
[2020-06-04 13:44:20][INFO] 1 operations run in 1097ms.
[2020-06-04 13:44:44][INFO] Retrieved 416 entrypoints. Querying for flowgraphs...
[2020-06-04 13:44:46][INFO] Processed 10% of the dataset.
[2020-06-04 13:44:47][INFO] Processed 20% of the dataset.
[2020-06-04 13:44:48][INFO] Processed 30% of the dataset.
[2020-06-04 13:44:49][INFO] Processed 40% of the dataset.
[2020-06-04 13:44:50][INFO] Processed 50% of the dataset.
[2020-06-04 13:44:51][INFO] Processed 60% of the dataset.
[2020-06-04 13:44:52][INFO] Processed 70% of the dataset.
[2020-06-04 13:44:53][INFO] Processed 80% of the dataset.
[2020-06-04 13:44:55][INFO] Processed 90% of the dataset.
[2020-06-04 13:44:56][INFO] Processed 1

In [101]:
for dataset_path in cwe121_v__2_dataset_path:
    LOGGER.info("Processing %s..." % dataset_path)
    dataset = Dataset(dataset_path)
    dataset.queue_operation(SingleHopRawExtractor)
    dataset.queue_operation(PCA, {"final_dimension": 50})
    dataset.process()

[2020-06-04 13:46:55][INFO] Processing ../data/cwe121_v112...
[2020-06-04 13:46:55][INFO] Dataset index build in 20ms. 200 test_cases, 2 classes, 0 features (v0).
[2020-06-04 13:46:55][INFO] Running operation 1/2 (FeatureExtractor)...
[2020-06-04 13:46:55][INFO] Running operation 1/1 (RightFixer)...
[2020-06-04 13:46:56][INFO] 1 operations run in 1012ms.
[2020-06-04 13:47:24][INFO] Retrieved 416 entrypoints and 842 labels. Querying for flowgraphs...
[2020-06-04 13:47:25][INFO] Processed 10% of the dataset.
[2020-06-04 13:47:27][INFO] Processed 20% of the dataset.
[2020-06-04 13:47:28][INFO] Processed 30% of the dataset.
[2020-06-04 13:47:29][INFO] Processed 40% of the dataset.
[2020-06-04 13:47:30][INFO] Processed 50% of the dataset.
[2020-06-04 13:47:31][INFO] Processed 60% of the dataset.
[2020-06-04 13:47:32][INFO] Processed 70% of the dataset.
[2020-06-04 13:47:32][INFO] Processed 80% of the dataset.
[2020-06-04 13:47:33][INFO] Processed 90% of the dataset.
[2020-06-04 13:47:34][IN

## 04.d. AnyHop SingleFlow

This extractor retrieves any node labelled `UpstreamNode` (named **root1**) *n* hop away from any node labelled `DownstreamNode` (named **root2**), and part of the function graph. Node **root1** is designated as the source, **root2** is the sink and the flow is the relationship between nodes. This chain has the following format: **Ri** which could be `FLOWS_TO`, `REACHES` or `CONTROLS` relationship. Each extracted feature is then counted and added to the feature map. This extractor normalizes the columns (`feat_col = feat_count / test_case_count`) for every entrypoint.

In [102]:
for dataset_path in cwe121_v__3_dataset_path:
    LOGGER.info("Processing %s..." % dataset_path)
    dataset = Dataset(dataset_path)
    dataset.queue_operation(AnyHopSingleFlowExtractor, {"need_map_features": True})
    dataset.process()

[2020-06-04 13:49:25][INFO] Processing ../data/cwe121_v113...
[2020-06-04 13:49:25][INFO] Dataset index build in 9ms. 200 test_cases, 2 classes, 0 features (v0).
[2020-06-04 13:49:25][INFO] Running operation 1/1 (FeatureExtractor)...
[2020-06-04 13:49:25][INFO] Running operation 1/1 (RightFixer)...
[2020-06-04 13:49:26][INFO] 1 operations run in 1049ms.
[2020-06-04 13:49:50][INFO] Retrieved 416 entrypoints. Querying for flowgraphs...
[2020-06-04 13:49:55][INFO] Processed 10% of the dataset.
[2020-06-04 13:49:58][INFO] Processed 20% of the dataset.
[2020-06-04 13:50:00][INFO] Processed 30% of the dataset.
[2020-06-04 13:50:03][INFO] Processed 40% of the dataset.
[2020-06-04 13:50:05][INFO] Processed 50% of the dataset.
[2020-06-04 13:50:07][INFO] Processed 60% of the dataset.
[2020-06-04 13:50:10][INFO] Processed 70% of the dataset.
[2020-06-04 13:50:11][INFO] Processed 80% of the dataset.
[2020-06-04 13:50:13][INFO] Processed 90% of the dataset.
[2020-06-04 13:50:15][INFO] Processed 10

In [103]:
for dataset_path in cwe121_v__3_dataset_path:
    LOGGER.info("Processing %s..." % dataset_path)
    dataset = Dataset(dataset_path)
    dataset.queue_operation(AnyHopSingleFlowExtractor)
    dataset.queue_operation(PCA, {"final_dimension": 50})
    dataset.process()

[2020-06-04 13:52:49][INFO] Processing ../data/cwe121_v113...
[2020-06-04 13:52:49][INFO] Dataset index build in 27ms. 200 test_cases, 2 classes, 0 features (v0).
[2020-06-04 13:52:49][INFO] Running operation 1/2 (FeatureExtractor)...
[2020-06-04 13:52:49][INFO] Running operation 1/1 (RightFixer)...
[2020-06-04 13:52:50][INFO] 1 operations run in 968ms.
[2020-06-04 13:53:14][INFO] Retrieved 416 entrypoints and 2610 labels. Querying for flowgraphs...
[2020-06-04 13:53:19][INFO] Processed 10% of the dataset.
[2020-06-04 13:53:22][INFO] Processed 20% of the dataset.
[2020-06-04 13:53:24][INFO] Processed 30% of the dataset.
[2020-06-04 13:53:27][INFO] Processed 40% of the dataset.
[2020-06-04 13:53:30][INFO] Processed 50% of the dataset.
[2020-06-04 13:53:33][INFO] Processed 60% of the dataset.
[2020-06-04 13:53:35][INFO] Processed 70% of the dataset.
[2020-06-04 13:53:38][INFO] Processed 80% of the dataset.
[2020-06-04 13:53:40][INFO] Processed 90% of the dataset.
[2020-06-04 13:53:42][IN

## 04.e. DoubleHop v01

In [None]:
# Coming soon™...
# for dataset_path in cwe121_v__4_dataset_path:
#     ...

## Conclusion

In this notebook, the previously annotated datasets had several feature extracted. The [next notebook](./05_models_training.ipynb) trains the models and retrieve the results obtained.