# 03. Neo4J Processing

In this notebook, the [previously created](./02_joern_processing.ipynb) datasets will be marked up in Neo4J. This step allows to perform feature extraction in the next step.

## 03.a.  Imports, logging configuration and dataset preparation

The first step is to perform the necessary imports and configure the program. Additionnally, the previously used datasets are copied into 2 different datasets to be processed by the various AST markup versions.

In [1]:
# Enable these line if live changes in the codebase are made
# %load_ext autoreload
# %autoreload 2

In [2]:
# Specific instruction to run the notebooks from a sub-folder.
import sys
sys.path.append("..")

In [3]:
import logging
from bugfinder.settings import LOGGER
from bugfinder.dataset import CWEClassificationDataset as Dataset
from bugfinder.dataset.processing.dataset_ops import CopyDataset, RightFixer
from bugfinder.ast.v01 import Neo4JASTMarkup as Neo4JASTMarkupV01
from bugfinder.ast.v02 import Neo4JASTMarkup as Neo4JASTMarkupV02

In [4]:
# Setup logging to only output INFO level messages
LOGGER.setLevel(logging.INFO)

In [5]:
# Dataset directories (DO NOT EDIT)
cwe121_v100_dataset_path = "../data/cwe121_v100"
cwe121_v110_dataset_path = "../data/cwe121_v110"
cwe121_v120_dataset_path = "../data/cwe121_v120"
cwe121_v200_dataset_path = "../data/cwe121_v200"
cwe121_v210_dataset_path = "../data/cwe121_v210"
cwe121_v220_dataset_path = "../data/cwe121_v220"
cwe121_v300_dataset_path = "../data/cwe121_v300"
cwe121_v310_dataset_path = "../data/cwe121_v310"
cwe121_v320_dataset_path = "../data/cwe121_v320"

# Number of sample to test (edit this number, performances will be impacted, max. 6288)
sample_nb = 200

In [6]:
# Copy the existing dataset into 2 sub-dataset for future use.
cwe121_v100_dataset = Dataset(cwe121_v100_dataset_path)
cwe121_v100_dataset.queue_operation(CopyDataset, {"to_path": cwe121_v110_dataset_path, "force": True})
cwe121_v100_dataset.queue_operation(CopyDataset, {"to_path": cwe121_v120_dataset_path, "force": True})

cwe121_v200_dataset = Dataset(cwe121_v200_dataset_path)
cwe121_v200_dataset.queue_operation(CopyDataset, {"to_path": cwe121_v210_dataset_path, "force": True})
cwe121_v200_dataset.queue_operation(CopyDataset, {"to_path": cwe121_v220_dataset_path, "force": True})

cwe121_v300_dataset = Dataset(cwe121_v300_dataset_path)
cwe121_v300_dataset.queue_operation(CopyDataset, {"to_path": cwe121_v310_dataset_path, "force": True})
cwe121_v300_dataset.queue_operation(CopyDataset, {"to_path": cwe121_v320_dataset_path, "force": True})

cwe121_v100_dataset.process()
cwe121_v200_dataset.process()
cwe121_v300_dataset.process()

[2020-06-03 15:52:17][INFO] Dataset index build in 10ms. 200 test_cases, 2 classes, 0 features (v0).
[2020-06-03 15:52:17][INFO] Dataset index build in 10ms. 200 test_cases, 2 classes, 0 features (v0).
[2020-06-03 15:52:17][INFO] Dataset index build in 21ms. 200 test_cases, 2 classes, 0 features (v0).
[2020-06-03 15:52:17][INFO] Running operation 1/2 (CopyDataset)...
[2020-06-03 15:52:18][INFO] Dataset index build in 9ms. 200 test_cases, 2 classes, 0 features (v0).
[2020-06-03 15:52:18][INFO] Running operation 2/2 (CopyDataset)...
[2020-06-03 15:52:18][INFO] Dataset index build in 19ms. 200 test_cases, 2 classes, 0 features (v0).
[2020-06-03 15:52:18][INFO] 2 operations run in 380ms.
[2020-06-03 15:52:18][INFO] Running operation 1/2 (CopyDataset)...
[2020-06-03 15:52:18][INFO] Dataset index build in 6ms. 200 test_cases, 2 classes, 0 features (v0).
[2020-06-03 15:52:18][INFO] Running operation 2/2 (CopyDataset)...
[2020-06-03 15:52:18][INFO] Dataset index build in 11ms. 200 test_cases, 

<DatasetQueueRetCode.OK: 0>

## 03.b. AST Markup v01


### Joern v0.3.1

In [7]:
cwe121_v110_dataset = Dataset(cwe121_v110_dataset_path)
cwe121_v110_dataset.queue_operation(Neo4JASTMarkupV01)
cwe121_v110_dataset.process()

[2020-06-03 15:52:49][INFO] Dataset index build in 10ms. 200 test_cases, 2 classes, 0 features (v0).
[2020-06-03 15:52:49][INFO] Running operation 1/1 (Neo4JASTMarkup)...
[2020-06-03 15:52:50][INFO] Running operation 1/1 (RightFixer)...
[2020-06-03 15:52:51][INFO] 1 operations run in 1362ms.
[2020-06-03 15:53:07][INFO] Retrieving AST...
[2020-06-03 15:53:23][INFO] 16620 item found. Creating command bundles...
[2020-06-03 15:53:23][INFO] 9 bundles prepared.
[2020-06-03 15:53:41][INFO] AST command successfully run.
[2020-06-03 15:53:46][INFO] AST command successfully run.
[2020-06-03 15:53:51][INFO] AST command successfully run.
[2020-06-03 15:53:55][INFO] AST command successfully run.
[2020-06-03 15:53:58][INFO] AST command successfully run.
[2020-06-03 15:54:02][INFO] AST command successfully run.
[2020-06-03 15:54:06][INFO] AST command successfully run.
[2020-06-03 15:54:10][INFO] AST command successfully run.
[2020-06-03 15:54:13][INFO] AST command successfully run.
[2020-06-03 15:54

<DatasetQueueRetCode.OK: 0>

### Joern v0.4.0

In [8]:
cwe121_v210_dataset = Dataset(cwe121_v210_dataset_path)
cwe121_v210_dataset.queue_operation(Neo4JASTMarkupV01)
cwe121_v210_dataset.process()

[2020-06-03 15:54:16][INFO] Dataset index build in 175ms. 200 test_cases, 2 classes, 0 features (v0).
[2020-06-03 15:54:16][INFO] Running operation 1/1 (Neo4JASTMarkup)...
[2020-06-03 15:54:18][INFO] Running operation 1/1 (RightFixer)...
[2020-06-03 15:54:20][INFO] 1 operations run in 1900ms.
[2020-06-03 15:54:35][INFO] Retrieving AST...
[2020-06-03 15:54:47][INFO] 16630 item found. Creating command bundles...
[2020-06-03 15:54:47][INFO] 9 bundles prepared.
[2020-06-03 15:54:58][INFO] AST command successfully run.
[2020-06-03 15:55:03][INFO] AST command successfully run.
[2020-06-03 15:55:06][INFO] AST command successfully run.
[2020-06-03 15:55:11][INFO] AST command successfully run.
[2020-06-03 15:55:14][INFO] AST command successfully run.
[2020-06-03 15:55:18][INFO] AST command successfully run.
[2020-06-03 15:55:21][INFO] AST command successfully run.
[2020-06-03 15:55:24][INFO] AST command successfully run.
[2020-06-03 15:55:26][INFO] AST command successfully run.
[2020-06-03 15:5

<DatasetQueueRetCode.OK: 0>

### Joern v1.0.62

In [9]:
# Coming soon™...
# cwe121_v310_dataset = Dataset(cwe121_v310_dataset_path)
# ...

## 03.c. AST Markup v02

### Joern 0.3.1

In [10]:
cwe121_v120_dataset = Dataset(cwe121_v120_dataset_path)
cwe121_v120_dataset.queue_operation(Neo4JASTMarkupV02)
cwe121_v120_dataset.process()

[2020-06-03 15:55:28][INFO] Dataset index build in 156ms. 200 test_cases, 2 classes, 0 features (v0).
[2020-06-03 15:55:28][INFO] Running operation 1/1 (Neo4JASTMarkup)...
[2020-06-03 15:55:30][INFO] Running operation 1/1 (RightFixer)...
[2020-06-03 15:55:31][INFO] 1 operations run in 1810ms.
[2020-06-03 15:55:47][INFO] Retrieving AST...
[2020-06-03 15:56:55][INFO] 4314 item found. Creating command bundles...
[2020-06-03 15:56:55][INFO] 3 bundles prepared.
[2020-06-03 15:57:03][INFO] AST command successfully run.
[2020-06-03 15:57:06][INFO] AST command successfully run.
[2020-06-03 15:57:07][INFO] AST command successfully run.
[2020-06-03 15:57:07][INFO] AST updated
[2020-06-03 15:57:08][INFO] 1 operations run in 99639ms.


<DatasetQueueRetCode.OK: 0>

### Joern v0.4.0

In [11]:
cwe121_v220_dataset = Dataset(cwe121_v220_dataset_path)
cwe121_v220_dataset.queue_operation(Neo4JASTMarkupV02)
cwe121_v220_dataset.process()

[2020-06-03 15:57:08][INFO] Dataset index build in 114ms. 200 test_cases, 2 classes, 0 features (v0).
[2020-06-03 15:57:08][INFO] Running operation 1/1 (Neo4JASTMarkup)...
[2020-06-03 15:57:09][INFO] Running operation 1/1 (RightFixer)...
[2020-06-03 15:57:11][INFO] 1 operations run in 2366ms.
[2020-06-03 15:57:26][INFO] Retrieving AST...
[2020-06-03 15:58:31][INFO] 4264 item found. Creating command bundles...
[2020-06-03 15:58:31][INFO] 3 bundles prepared.
[2020-06-03 15:58:40][INFO] AST command successfully run.
[2020-06-03 15:58:44][INFO] AST command successfully run.
[2020-06-03 15:58:44][INFO] AST command successfully run.
[2020-06-03 15:58:44][INFO] AST updated
[2020-06-03 15:58:46][INFO] 1 operations run in 98454ms.


<DatasetQueueRetCode.OK: 0>

### Joern v1.0.62

In [None]:
# Coming soon™...
# cwe121_v320_dataset = Dataset(cwe121_v320_dataset_path)
# ...

## Conclusion

In this notebook, the previously parsed datasets were annotated to ease feature extraction. The [next notebook](./04_feature_extraction.ipynb) performs the feature extraction to finally train the various models.