# 01. Pre-processing

This notebook will pre-process a classified C/C++ dataset specifically designed for bugfinding classification to ensure correct formatting before the Joern parsing.

Download the dataset using the script at `../scripts/setup_ai_dataset.sh`. A new folder **data/ai-dataset_orig** should appear, containing the classified dataset with *bad* (buggy) and *good* (fixed) classes.

## 01.a. Imports and logging configuration

The first step is to perform the necessary imports and configure the program. Additionally, if the dataset need to be downloaded, it can be done in the last cell of this section.

In [None]:
# Enable these line if live changes in the codebase are made
# %load_ext autoreload
# %autoreload 2

In [None]:
# Specific instruction to run the notebooks from a sub-folder.
import sys
sys.path.append("..")

In [None]:
import logging
from os.path import join
from bugfinder.settings import LOGGER
from bugfinder.base.dataset import CodeWeaknessClassificationDataset as Dataset
from bugfinder.processing.dataset.copy import CopyDataset
from bugfinder.processing.dataset.extract import ExtractSampleDataset
from bugfinder.processing.cleaning.remove_main_function import RemoveMainFunction
from bugfinder.processing.cleaning.replace_litterals import ReplaceLitterals
from bugfinder.processing.cleaning.remove_cpp_files import RemoveCppFiles
from bugfinder.processing.cleaning.remove_interproc_files import RemoveInterprocFiles

In [None]:
# Setup logging to only output INFO level messages
LOGGER.setLevel(logging.INFO)

In [None]:
# Dataset directories (DO NOT EDIT)
classified_dataset_path = "../data/ai-dataset_orig"
cleaned_dataset_path = "../data/ai-dataset_cleaned"
subset_dataset_path = "../data/ai-dataset_v000"

# Number of sample to test (edit this number, performances will be impacted)
sample_nb = 200

### Optional Step: Download the dataset

Use the following cell to download the dataset. The cell needs to be run only if the dataset is not present or has been tampered with.

In [None]:
# Download the dataset and classify the samples between good and bad classes.
import subprocess
from os import listdir
from os.path import isdir

force_download = False  # Change to True if the dataset has been tampered with
download_dir = join(classified_dataset_path, "bad")
need_download = (not isdir(download_dir) or len(listdir(download_dir)) != 6507)

if need_download or force_download:
    LOGGER.info("Downloading dataset...")
    subprocess.run("../scripts/setup_ai_dataset.sh")

LOGGER.info("Dataset has been downloaded.")

## 01.b. Cleanup

Cleanup the downloaded data to ensure correct parsing in the future steps. The dataset will be stored in **./data/ai-dataset_cleaned**.

In [None]:
# Create a copy of the annotated dataset to avoid overwriting
classified_dataset = Dataset(classified_dataset_path)
classified_dataset.queue_operation(CopyDataset, {"to_path": cleaned_dataset_path, "force": True})
classified_dataset.process()

In [None]:
# Cleanup new dataset
cleaned_dataset = Dataset(cleaned_dataset_path)

cleaned_dataset.queue_operation(RemoveCppFiles)
cleaned_dataset.queue_operation(RemoveInterprocFiles)
cleaned_dataset.queue_operation(RemoveMainFunction)
cleaned_dataset.queue_operation(ReplaceLitterals)

cleaned_dataset.process()

## 01.c. Subset extraction

Extract a subset of the data for testing purposes at **./data/ai-dataset_v000**.

In [None]:
# Extract a subset of 1000 samples for training, test and validation purposes. 
cleaned_dataset = Dataset(cleaned_dataset_path)
cleaned_dataset.queue_operation(
    ExtractSampleDataset, {"to_path": subset_dataset_path, "sample_nb": sample_nb, "force": True}
)
cleaned_dataset.process()

## Conclusion

In this part, the initial dataset was cleaned and is now ready to be processed by Joern. The [next notebook](./02_joern_processing.ipynb) details the step to run Joern and import the dataset into a Neo4J database.