# IEEE STC 2022 - A Modular and Expandable Testbed for Evaluating ML-Based Bug Finders - Software demo

This notebook will pre-process a C/C++ dataset specifically designed for bugfinding classification to ensure correct formatting before the Joern parsing.

Download the dataset using the script at `../scripts/setup_ai_dataset.sh`. A new folder **data/ai-dataset_orig** should appear, containing the classified dataset with *bad* (buggy) and *good* (fixed) classes.

## 01 Pre-processing

### 01.a. Imports and logging configuration

The first step is to perform the necessary imports and configure the program. Additionally, if the dataset need to be downloaded, it can be done in the last cell of this section.

In [None]:
# Enable these line if live changes in the codebase are made
# %load_ext autoreload
# %autoreload 2

In [None]:
# Disable tensorflow logging
import os
import logging
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'  # or any {'0', '1', '2'}
logging.getLogger('tensorflow').setLevel(logging.FATAL)

In [None]:
# Specific instruction to run the notebooks from a sub-folder.
import sys
sys.path.append("..")

In [None]:
import logging
from os.path import join
from bugfinder.settings import LOGGER
from bugfinder.base.dataset import CodeWeaknessClassificationDataset as Dataset
from bugfinder.processing.dataset.copy import CopyDataset
from bugfinder.processing.dataset.fix_rights import RightFixer
from bugfinder.processing.dataset.extract import ExtractSampleDataset
from bugfinder.processing.cleaning.remove_main_function import RemoveMainFunction
from bugfinder.processing.cleaning.replace_litterals import ReplaceLitterals
from bugfinder.processing.cleaning.remove_cpp_files import RemoveCppFiles
from bugfinder.processing.cleaning.remove_interproc_files import RemoveInterprocFiles
from bugfinder.processing.joern.v040 import JoernProcessing as Joern040DatasetProcessing
from bugfinder.processing.neo4j.importer import Neo4J3Importer
from bugfinder.processing.neo4j.annot import Neo4JAnnotations
from bugfinder.processing.ast.v02 import Neo4JASTMarkup as Neo4JASTMarkupV02
from bugfinder.features.extraction.bag_of_words.hops_n_flows import FeatureExtractor as HopsNFlowsExtractor
from bugfinder.features.reduction.pca import FeatureSelector as PCA
from bugfinder.models.dnn_classifier import DNNClassifierTraining
from bugfinder.models.linear_classifier import LinearClassifierTraining

In [None]:
# Setup logging to only output INFO level messages
LOGGER.setLevel(logging.INFO)
LOGGER.propagate = False

In [None]:
# Dataset directories (DO NOT EDIT)
classified_dataset_path = "../data/ai-dataset_orig"
cleaned_dataset_path = "../data/dataset_01_clean"
sample_dataset_path = "../data/dataset_01_extract"

#### Optional Step: Download the dataset

Use the following cell to download the dataset. The cell needs to be run only if the dataset is not present or has been tampered with.

In [None]:
# Download the dataset and classify the samples between good and bad classes.
import subprocess
from os import listdir
from os.path import isdir

force_download = False  # Change to True if the dataset has been tampered with
download_dir = join(classified_dataset_path, "bad")
need_download = (not isdir(download_dir) or len(listdir(download_dir)) != 6507)

if need_download or force_download:
    LOGGER.info("Downloading dataset...")
    subprocess.run("../scripts/setup_ai_dataset.sh")

LOGGER.info("Dataset has been downloaded.")

### 01.b. Cleanup

Cleanup the downloaded data to ensure correct parsing in the future steps. The dataset will be stored in **./data/ai-dataset_cleaned**.

In [None]:
# Create a copy of the annotated dataset to avoid overwriting
classified_dataset = Dataset(classified_dataset_path)
classified_dataset.queue_operation(CopyDataset, {"to_path": cleaned_dataset_path, "force": True})
classified_dataset.process()

In [None]:
# Cleanup new dataset
cleaned_dataset = Dataset(cleaned_dataset_path)

cleaned_dataset.queue_operation(RemoveCppFiles)
cleaned_dataset.queue_operation(RemoveInterprocFiles)
cleaned_dataset.queue_operation(RemoveMainFunction)
cleaned_dataset.queue_operation(ReplaceLitterals)

cleaned_dataset.process()

### 01.c. Subset extraction

Extract a subset of the data for testing purposes at **./data/ai-dataset_v000**.

In [None]:
# Number of sample to test (edit this number, performances will be impacted)
sample_nb = 50

cleaned_dataset = Dataset(cleaned_dataset_path)
cleaned_dataset.queue_operation(
    ExtractSampleDataset, {"to_path": sample_dataset_path, "sample_nb": sample_nb, "force": True}
)
cleaned_dataset.process()

## 02. Pre-processing

In this part, the previously created dataset will be parsed using various version of Joern. The parsed data will then be imported or converted into a Neo4J v3 database for further processing. Once the data is in a Neo4J database, an AST representation is extracted to be used by feature extraction algorithms.

### 02.a. Dataset preparation

A copy of the dataset is created before peforming any of the changes.

In [None]:
joern_dataset_path = "../data/dataset_02_joern"

sample_dataset = Dataset(sample_dataset_path)
sample_dataset.queue_operation(CopyDataset, {"to_path": joern_dataset_path, "force": True})

sample_dataset.process()

### 02.b. Joern v0.4.0

In [None]:
# Build the dataset that is going to be used
joern_dataset = Dataset(joern_dataset_path)

# Apply Joern 4.0 conversion and import into Neo4J v3
joern_dataset.queue_operation(Joern040DatasetProcessing)
joern_dataset.queue_operation(Neo4J3Importer)
joern_dataset.queue_operation(Neo4JAnnotations)
joern_dataset.queue_operation(Neo4JASTMarkupV02)

joern_dataset.process()

# 03. Feature extraction and reduction

In this notebook, features will be extracted from the previously annotated dataset. This step allows to train the neural network in the final step.

Please note that the current extractors retrieve counts of links between two nodes such as: **source**-**flow**-**sink**. 

Source and sink nodes are formated according to the AST markup that has been computed in the previous step. For each extractors, a features list are created with `dataset.queue_operation(ExtractorClass, {"need_map_features": True})`. This feature list has to be computed only once and serves to determine the list of labels needed for the feature extractor.

## 03.a. Dataset preparation

In [None]:
fe_dataset_path = "../data/dataset_03_feat_extraction"

joern_dataset = Dataset(joern_dataset_path)
joern_dataset.queue_operation(CopyDataset, {"to_path": fe_dataset_path, "force": True})

joern_dataset.process()

## 03.b. Feature extraction

This extractor retrieves any node labelled `UpstreamNode` (named **root1**) `n` hops away from any node labelled `DownstreamNode` (named **root2**), and part of the function graph. The number of hops `n` is an integer within the range [`min_hops`, `max_hops`], where `min_hops > 0` and `max_hops > min_hops`. If `max_hops` is -1, the extractor retrieves all possible relationships.

Node **root1** is designated as the source, **root2** is the sink and the flow is the relationship between nodes. This chain has the following format: **Ri** which could be `FLOWS_TO`, `REACHES` or `CONTROLS` relationship. Each extracted feature is then counted and added to the feature map. This extractor normalizes the columns (`feat_col = feat_count / entrypoint_count`) for every entrypoint.

In [None]:
fe_dataset = Dataset(fe_dataset_path)
fe_dataset.queue_operation(HopsNFlowsExtractor, {"min_hops": 1, "max_hops": -1, "need_map_features": True})
fe_dataset.queue_operation(HopsNFlowsExtractor, {"min_hops": 1, "max_hops": -1})
fe_dataset.process()

### 03.c. Dimension reduction (using PCA)

In this step, the number of features previously obtained will be reduced to ensure quick convergence of the model. Several methods exist, either by selecting the most important feature or creating new ones. Here, Principal Component Analysis (PCA) will be used to create a limited number of independant feautures.

In [None]:
fe_dataset.queue_operation(PCA, {"dimension": 90, "dry_run": True})
fe_dataset.process()

## 04. Models training

In the previous steps, the dataset was curated and several feature extracted to train machine learning models. The models will now be initialized and trained with the curated dataset. The dataset is split 80/20 for training and testing, respectively.

### 04.a. Dataset preparation

In [None]:
# Dataset directories
lin_dataset_path = "../data/dataset_04a_lin_cls"
dnn_dataset_path = "../data/dataset_04b_dnn_cls"

fe_dataset = Dataset(fe_dataset_path)
fe_dataset.queue_operation(CopyDataset, {"to_path": lin_dataset_path, "force": True})
fe_dataset.queue_operation(CopyDataset, {"to_path": dnn_dataset_path, "force": True})

fe_dataset.process()

### 04.b. Linear Regression

In [None]:
lin_dataset = Dataset(lin_dataset_path)
lin_dataset.queue_operation(
    LinearClassifierTraining, {"name": "lin-cls", "max_items": 1000, "epochs": 5, "reset": True}
)
lin_dataset.process()

### 04.c. Multilayer Perceptron

In [None]:
dnn_dataset = Dataset(dnn_dataset_path)
dnn_dataset.queue_operation(DNNClassifierTraining, {"name": "dnn-default", "epochs": 10, "reset": True})
dnn_dataset.process()

## 05. Conclusion

For more information, please refer to the documentation, available at https://pages.nist.gov/ai-bugfinder-testbed/readme.html