# 03. Neo4J Processing

In this notebook, the [previously created](./02_joern_processing.ipynb) datasets will be marked up in Neo4J. This step allows to perform feature extraction in the next step.

## 03.a.  Imports, logging configuration and dataset preparation

The first step is to perform the necessary imports and configure the program. Additionnally, the previously used datasets are copied into 2 different datasets to be processed by the various AST markup versions.

In [1]:
# Specific instruction to run the notebooks from a sub-folder.
import sys
sys.path.append("..")

In [2]:
import logging
from tools.settings import LOGGER
from tools.dataset import CWEClassificationDataset as Dataset
from tools.dataset.processing.dataset_ops import CopyDataset, RightFixer
from tools.ast.v01 import Neo4JASTMarkup as Neo4JASTMarkupV01
from tools.ast.v02 import Neo4JASTMarkup as Neo4JASTMarkupV02

In [3]:
# Setup logging to only output INFO level messages
LOGGER.setLevel(logging.INFO)

In [4]:
# Dataset directories (DO NOT EDIT)
cwe121_v100_dataset_path = "../data/cwe121_v100"
cwe121_v110_dataset_path = "../data/cwe121_v110"
cwe121_v120_dataset_path = "../data/cwe121_v120"
cwe121_v200_dataset_path = "../data/cwe121_v200"
cwe121_v210_dataset_path = "../data/cwe121_v210"
cwe121_v220_dataset_path = "../data/cwe121_v220"
cwe121_v300_dataset_path = "../data/cwe121_v300"
cwe121_v310_dataset_path = "../data/cwe121_v310"
cwe121_v320_dataset_path = "../data/cwe121_v320"

# Number of sample to test (edit this number, performances will be impacted, max. 6288)
sample_nb = 200

In [5]:
# Copy the existing dataset into 2 sub-dataset for future use.
cwe121_v100_dataset = Dataset(cwe121_v100_dataset_path)
cwe121_v100_dataset.queue_operation(CopyDataset, {"to_path": cwe121_v110_dataset_path, "force": True})
cwe121_v100_dataset.queue_operation(CopyDataset, {"to_path": cwe121_v120_dataset_path, "force": True})

cwe121_v200_dataset = Dataset(cwe121_v200_dataset_path)
cwe121_v200_dataset.queue_operation(CopyDataset, {"to_path": cwe121_v210_dataset_path, "force": True})
cwe121_v200_dataset.queue_operation(CopyDataset, {"to_path": cwe121_v220_dataset_path, "force": True})

cwe121_v300_dataset = Dataset(cwe121_v300_dataset_path)
cwe121_v300_dataset.queue_operation(CopyDataset, {"to_path": cwe121_v310_dataset_path, "force": True})
cwe121_v300_dataset.queue_operation(CopyDataset, {"to_path": cwe121_v320_dataset_path, "force": True})

cwe121_v100_dataset.process()
cwe121_v200_dataset.process()
cwe121_v300_dataset.process()

[2019-12-06 14:43:36][INFO] Dataset index build in 10ms. 200 test_cases, 2 classes, 0 features (v0).
[2019-12-06 14:43:36][INFO] Dataset index build in 10ms. 200 test_cases, 2 classes, 0 features (v0).
[2019-12-06 14:43:36][INFO] Dataset index build in 18ms. 200 test_cases, 2 classes, 0 features (v0).
[2019-12-06 14:43:36][INFO] Running operation 1/2 (CopyDataset)...
[2019-12-06 14:43:36][INFO] Dataset index build in 6ms. 200 test_cases, 2 classes, 0 features (v0).
[2019-12-06 14:43:36][INFO] Running operation 1/1 (RightFixer)...
[2019-12-06 14:43:37][INFO] 1 operations run in 1026ms.
[2019-12-06 14:43:38][INFO] Running operation 2/2 (CopyDataset)...
[2019-12-06 14:43:38][INFO] Dataset index build in 7ms. 200 test_cases, 2 classes, 0 features (v0).
[2019-12-06 14:43:38][INFO] Running operation 1/1 (RightFixer)...
[2019-12-06 14:43:39][INFO] 1 operations run in 897ms.
[2019-12-06 14:43:39][INFO] 2 operations run in 2298ms.
[2019-12-06 14:43:39][INFO] Running operation 1/2 (CopyDataset).

## 03.b. AST Markup v01


### Joern v0.3.1

In [6]:
cwe121_v110_dataset = Dataset(cwe121_v110_dataset_path)
cwe121_v110_dataset.queue_operation(RightFixer, {"command_args": "neo4j_v3.db 101 101"})
cwe121_v110_dataset.queue_operation(Neo4JASTMarkupV01)
cwe121_v110_dataset.process()

[2019-12-06 14:43:47][INFO] Dataset index build in 20ms. 200 test_cases, 2 classes, 0 features (v0).
[2019-12-06 14:43:47][INFO] Running operation 1/2 (RightFixer)...
[2019-12-06 14:43:48][INFO] Running operation 2/2 (Neo4JASTMarkup)...
[2019-12-06 14:43:59][INFO] Retrieving AST...
[2019-12-06 14:44:03][INFO] 15148 item found. Slicing AST...
[2019-12-06 14:44:03][INFO] 8 commands prepared
[2019-12-06 14:44:03][INFO] Prepping command...
[2019-12-06 14:44:03][INFO] Updating AST...
[2019-12-06 14:44:08][INFO] Prepping command...
[2019-12-06 14:44:08][INFO] Updating AST...
[2019-12-06 14:44:11][INFO] Prepping command...
[2019-12-06 14:44:11][INFO] Updating AST...
[2019-12-06 14:44:13][INFO] Prepping command...
[2019-12-06 14:44:13][INFO] Updating AST...
[2019-12-06 14:44:15][INFO] Prepping command...
[2019-12-06 14:44:15][INFO] Updating AST...
[2019-12-06 14:44:18][INFO] Prepping command...
[2019-12-06 14:44:18][INFO] Updating AST...
[2019-12-06 14:44:20][INFO] Prepping command...
[2019-12

### Joern v0.4.0

In [8]:
cwe121_v210_dataset = Dataset(cwe121_v210_dataset_path)
cwe121_v210_dataset.queue_operation(RightFixer, {"command_args": "neo4j_v3.db 101 101"})
cwe121_v210_dataset.queue_operation(Neo4JASTMarkupV01)
cwe121_v210_dataset.process()

[2019-12-06 14:44:59][INFO] Dataset index build in 23ms. 200 test_cases, 2 classes, 0 features (v0).
[2019-12-06 14:44:59][INFO] Running operation 1/2 (RightFixer)...
[2019-12-06 14:45:00][INFO] Running operation 2/2 (Neo4JASTMarkup)...
[2019-12-06 14:45:11][INFO] Retrieving AST...
[2019-12-06 14:45:16][INFO] 15164 item found. Slicing AST...
[2019-12-06 14:45:16][INFO] 8 commands prepared
[2019-12-06 14:45:16][INFO] Prepping command...
[2019-12-06 14:45:16][INFO] Updating AST...
[2019-12-06 14:45:23][INFO] Prepping command...
[2019-12-06 14:45:23][INFO] Updating AST...
[2019-12-06 14:45:25][INFO] Prepping command...
[2019-12-06 14:45:25][INFO] Updating AST...
[2019-12-06 14:45:28][INFO] Prepping command...
[2019-12-06 14:45:28][INFO] Updating AST...
[2019-12-06 14:45:30][INFO] Prepping command...
[2019-12-06 14:45:31][INFO] Updating AST...
[2019-12-06 14:45:33][INFO] Prepping command...
[2019-12-06 14:45:33][INFO] Updating AST...
[2019-12-06 14:45:35][INFO] Prepping command...
[2019-12

### Joern v1.0.62

In [None]:
# Coming soon™...
# cwe121_v310_dataset = Dataset(cwe121_v310_dataset_path)
# ...

## 03.c. AST Markup v02

### Joern 0.3.1

In [7]:
cwe121_v120_dataset = Dataset(cwe121_v120_dataset_path)
cwe121_v120_dataset.queue_operation(RightFixer, {"command_args": "neo4j_v3.db 101 101"})
cwe121_v120_dataset.queue_operation(Neo4JASTMarkupV02)
cwe121_v120_dataset.process()

[2019-12-06 14:44:31][INFO] Dataset index build in 12ms. 200 test_cases, 2 classes, 0 features (v0).
[2019-12-06 14:44:31][INFO] Running operation 1/2 (RightFixer)...
[2019-12-06 14:44:32][INFO] Running operation 2/2 (Neo4JASTMarkup)...
[2019-12-06 14:44:42][INFO] Connected to Neo4j. Retrieving nodes...
[2019-12-06 14:44:44][INFO] 3918 nodes found. Processing...
[2019-12-06 14:44:44][INFO] Querying nodes...
[2019-12-06 14:44:45][INFO] Node info retrieved. Querying links...
[2019-12-06 14:44:49][INFO] Links info retrieved. Building tree...
[2019-12-06 14:44:51][INFO] Tree built. Uploading ASTs...
[2019-12-06 14:44:51][INFO] Update dict generated (3918 entries). Uploading...
[2019-12-06 14:44:51][INFO] 2 commands prepared
[2019-12-06 14:44:51][INFO] Prepping command...
[2019-12-06 14:44:51][INFO] Updating AST...
[2019-12-06 14:44:55][INFO] Prepping command...
[2019-12-06 14:44:55][INFO] Updating AST...
[2019-12-06 14:44:58][INFO] Processing completed.
[2019-12-06 14:44:59][INFO] 2 operat

### Joern v0.4.0

In [9]:
cwe121_v220_dataset = Dataset(cwe121_v220_dataset_path)
cwe121_v220_dataset.queue_operation(RightFixer, {"command_args": "neo4j_v3.db 101 101"})
cwe121_v220_dataset.queue_operation(Neo4JASTMarkupV02)
cwe121_v220_dataset.process()

[2019-12-06 14:46:27][INFO] Dataset index build in 7ms. 200 test_cases, 2 classes, 0 features (v0).
[2019-12-06 14:46:27][INFO] Running operation 1/2 (RightFixer)...
[2019-12-06 14:46:28][INFO] Running operation 2/2 (Neo4JASTMarkup)...
[2019-12-06 14:46:36][INFO] Connected to Neo4j. Retrieving nodes...
[2019-12-06 14:46:38][INFO] 3838 nodes found. Processing...
[2019-12-06 14:46:38][INFO] Querying nodes...
[2019-12-06 14:46:39][INFO] Node info retrieved. Querying links...
[2019-12-06 14:46:42][INFO] Links info retrieved. Building tree...
[2019-12-06 14:46:44][INFO] Tree built. Uploading ASTs...
[2019-12-06 14:46:44][INFO] Update dict generated (3838 entries). Uploading...
[2019-12-06 14:46:44][INFO] 2 commands prepared
[2019-12-06 14:46:44][INFO] Prepping command...
[2019-12-06 14:46:44][INFO] Updating AST...
[2019-12-06 14:46:47][INFO] Prepping command...
[2019-12-06 14:46:47][INFO] Updating AST...
[2019-12-06 14:46:50][INFO] Processing completed.
[2019-12-06 14:46:51][INFO] 2 operati

### Joern v1.0.62

In [None]:
# Coming soon™...
# cwe121_v320_dataset = Dataset(cwe121_v320_dataset_path)
# ...

## Conclusion

In this notebook, the previously parsed datasets were annotated to ease feature extraction. The [next notebook](./04_feature_extraction.ipynb) performs the feature extraction to finally train the various models.