# 01. Pre-processing

This notebook will pre-process a classified C/C++ dataset extracted from [Juliet 1.3](https://samate.nist.gov/SRD/testsuite.php) to ensure correct formatting before the Joern parsing.

Download the dataset using the script at `../scripts/download_cwe121.sh`. In the **data** directory, the following directories should be present:
* **cwe121_annot**: classified dataset with bad (buggy) and good (fixed) classes.
* **cwe121_orig**: original dataset, unzipped version of the next file.
* **Juliet_Test_Suite_v1.3_for_C_Cpp.zip**: dataset downloaded from the SARD website.

## 01.a. Imports and logging configuration

The first step is to perform the necessary imports and configure the program. Additionally, if the dataset need to be downloaded, it can be done in the last cell of this section.

In [None]:
# Enable these line if live changes in the codebase are made
# %load_ext autoreload
# %autoreload 2

In [None]:
# Specific instruction to run the notebooks from a sub-folder.
import sys
sys.path.append("..")

In [None]:
import logging
from bugfinder.settings import LOGGER
from bugfinder.dataset import CWEClassificationDataset as Dataset
from bugfinder.dataset.processing.content_ops import RemoveMainFunction, ReplaceLitterals
from bugfinder.dataset.processing.dataset_ops import CopyDataset, ExtractSampleDataset
from bugfinder.dataset.processing.file_ops import RemoveCppFiles, RemoveInterproceduralTestCases

In [None]:
# Setup logging to only output INFO level messages
LOGGER.setLevel(logging.INFO)

In [None]:
# Dataset directories (DO NOT EDIT)
classified_dataset_path = "../data/cwe121_annot"
cleaned_dataset_path = "../data/cwe121_dataset"
cwe121_dataset_path = "../data/cwe121_v000_orig"

# Number of sample to test (edit this number, performances will be impacted, max. 6288)
sample_nb = 200

### Optional Step: Download the dataset

Use the following cell to download the dataset. Beware that <u>it will use all available CPUs</u> and can take a long time. The cell needs to be run only if the dataset is not present or has been tampered with.

In [None]:
# Download the CWE-121 from Juliet 1.3 and classify the samples between good and bad classes.
import subprocess
from os import listdir
from os.path import isdir

force_download = False  # Change to True if the dataset has been tampered with
download_dir = "../data/cwe121_annot/bad"
need_download = (not isdir(download_dir) or len(listdir(download_dir)) != 4944)

if need_download or force_download:
    LOGGER.info("Downloading CWE-121 dataset...")
    subprocess.run("../scripts/download_cwe121.sh")

LOGGER.info("CWE-121 dataset has been downloaded.")

## 01.b. Cleanup

Cleanup the downloaded data to ensure correct parsing in the future steps. The dataset will be stored in **./data/cwe121_dataset**.

In [None]:
# Create a copy of the annotated dataset to avoid overwriting
classified_dataset = Dataset(classified_dataset_path)
classified_dataset.queue_operation(CopyDataset, {"to_path": cleaned_dataset_path, "force": True})
classified_dataset.process()

In [None]:
# Cleanup new dataset
cleaned_dataset = Dataset(cleaned_dataset_path)

cleaned_dataset.queue_operation(RemoveCppFiles)
cleaned_dataset.queue_operation(RemoveInterproceduralTestCases)
cleaned_dataset.queue_operation(RemoveMainFunction)
cleaned_dataset.queue_operation(ReplaceLitterals)

cleaned_dataset.process()

## 01.c. Subset extraction

Extract a subset of the data for testing purposes at **./data/cwe121_training_orig**.

In [None]:
# Extract a subset of 1000 samples for training, test and validation purposes. 
cleaned_dataset = Dataset(cleaned_dataset_path)
cleaned_dataset.queue_operation(
    ExtractSampleDataset, {"to_path": cwe121_dataset_path, "sample_nb": sample_nb, "force": True}
)
cleaned_dataset.process()

## Conclusion

In this part, the initial dataset was cleaned and is now ready to be processed by Joern. The [next notebook](./02_joern_processing.ipynb) details the step to run Joern and import the dataset into a Neo4J database.