# 01. Pre-processing

This notebook will pre-process a classified C/C++ dataset extracted from [Juliet 1.3](https://samate.nist.gov/SRD/testsuite.php) to ensure correct formatting before the Joern parsing.

Download the dataset using the script at `../scripts/download_cwe121.sh`. In the **data** directory, the following directories should be present:
* **cwe121_annot**: classified dataset with bad (buggy) and good (fixed) classes.
* **cwe121_orig**: original dataset, unzipped version of the next file.
* **Juliet_Test_Suite_v1.3_for_C_Cpp.zip**: dataset downloaded from the SARD website.

Use the following cell to download the dataset. Beware that <u>it will use all available CPUs</u>. The cell needs to be run only if the dataset is not present or has been tampered with.

In [18]:
# Download the CWE-121 from Juliet 1.3 and classify the samples between good and bad classes.
# /!\ RUN ONLY ONCE /!\
import subprocess
from os import listdir
from os.path import isdir

force_download = False
download_dir = "../data/cwe121_annot/bad"
need_download = (not isdir(download_dir) or len(listdir(download_dir)) != 4944)

if need_download or force_download:
    print("Downloading CWE-121...")
    subprocess.run("../scripts/download_cwe121.sh")

print("CWE-121 downloaded")

CWE-121 downloaded


## 01.a. Imports and logging configuration

Once the dataset is downloaded, the next step is to perform the necessary imports and configure the program.

In [19]:
# Specific instruction to run the notebooks from a sub-folder.
import sys
sys.path.append("..")

In [20]:
import logging
from tools.settings import LOGGER
from tools.dataset import CWEClassificationDataset as Dataset
from tools.dataset.processing.content_ops import RemoveMainFunction, ReplaceLitterals
from tools.dataset.processing.dataset_ops import CopyDataset, ExtractSampleDataset
from tools.dataset.processing.file_ops import RemoveCppFiles, RemoveInterproceduralTestCases

In [21]:
# Setup logging to only output INFO level messages
LOGGER.setLevel(logging.INFO)

In [22]:
# Dataset directories (DO NOT EDIT)
classified_dataset_path = "../data/cwe121_annot"
cleaned_dataset_path = "../data/cwe121_dataset"
cwe121_dataset_path = "../data/cwe121_training_orig"

# Number of sample to test (edit this number, performances will be impacted, max. 6288)
sample_nb = 200

## 01.b. Cleanup

Cleanup the downloaded data to ensure correct parsing in the future steps. The dataset will be stored in **./data/cwe121_dataset**.

In [23]:
# Create a copy of the annotated dataset to avoid overwriting
classified_dataset = Dataset(classified_dataset_path)
classified_dataset.queue_operation(CopyDataset, {"to_path": cleaned_dataset_path, "force": True})
classified_dataset.process()

[2019-12-02 18:07:44][INFO] Dataset index build in 429ms. 9888 test_cases, 2 classes, 0 features (v0).
[2019-12-02 18:07:44][INFO] Running operation 1/1 (CopyDataset)...
[2019-12-02 18:07:44][INFO] Dataset index build in 54ms. 1666 test_cases, 2 classes, 0 features (v0).
[2019-12-02 18:07:44][INFO] Running operation 1/1 (RightFixer)...
[2019-12-02 18:07:45][INFO] 1 operations run in 1064ms.
[2019-12-02 18:07:48][INFO] 1 operations run in 4345ms.


In [24]:
# Cleanup new dataset
cleaned_dataset = Dataset(cleaned_dataset_path)

cleaned_dataset.queue_operation(RemoveCppFiles)
cleaned_dataset.queue_operation(RemoveInterproceduralTestCases)
cleaned_dataset.queue_operation(RemoveMainFunction)
cleaned_dataset.queue_operation(ReplaceLitterals)

cleaned_dataset.process()

[2019-12-02 18:07:53][INFO] Dataset index build in 413ms. 9888 test_cases, 2 classes, 0 features (v0).
[2019-12-02 18:07:53][INFO] Running operation 1/4 (RemoveCppFiles)...
[2019-12-02 18:07:54][INFO] Dataset index build in 316ms. 8684 test_cases, 2 classes, 0 features (v0).
[2019-12-02 18:07:54][INFO] Running operation 2/4 (RemoveInterproceduralTestCases)...
[2019-12-02 18:07:55][INFO] Dataset index build in 231ms. 6288 test_cases, 2 classes, 0 features (v0).
[2019-12-02 18:07:55][INFO] Running operation 3/4 (RemoveMainFunction)...
[2019-12-02 18:07:57][INFO] Dataset index build in 208ms. 6288 test_cases, 2 classes, 0 features (v0).
[2019-12-02 18:07:57][INFO] Running operation 4/4 (ReplaceLitterals)...
[2019-12-02 18:08:04][INFO] 4 operations run in 11119ms.


## 01.c. Subset extraction

Extract a subset of the data for testing purposes at **./data/cwe121_training_orig**.

In [25]:
# Extract a subset of 1000 samples for training, test and validation purposes. 
cleaned_dataset.queue_operation(
    ExtractSampleDataset, {"to_path": cwe121_dataset_path, "sample_nb": sample_nb, "force": True}
)
cleaned_dataset.process()

# Open the dataset to see its statistics.
cwe121_dataset = Dataset(cwe121_dataset_path)

[2019-12-02 18:08:10][INFO] Running operation 1/1 (ExtractSampleDataset)...
[2019-12-02 18:08:10][INFO] 1 operations run in 119ms.
[2019-12-02 18:08:10][INFO] Dataset index build in 8ms. 200 test_cases, 2 classes, 0 features (v0).


## Conclusion

In this part, the initial dataset wascleaned and is now ready to be processed by Joern. The [next notebook](./02_joern_processing.ipynb) details the step to run Joern and import the dataset into a Neo4J database.