[part of the FAVD replication package]
Our input dataset builds on two existing datasets:
-
TransferRepresentationLearning Dataset
These vulnerabilities are from 6-open source projects: FFmpeg, LibTIFF, LibPNG, Pidgin, Asterisk and VLC Media Player, and vulnerability information was collected from NVD until the end of July 2017.
-
This set contains the function names extracted from the source code of 1.27 million functions mined from open source software, labeled by static analysis for potential vulnerabilities. The data is distributed in three files corresponding to an 80:10:10 train/validate/test split, but for our purpose, we concatenate all.
The collect-data.py
script downloads the original datafiles as zips and
uncompresses them in respectively 6-projects-raw
and VDISC-raw
.
It then parses all code fragments using Lizard,
and stores it in 7 pairs of files in the directory processed
:
Asterisk_benign.txt
andAsterisk_vulnerable.txt
FFmpeg_benign.txt
andFFmpeg_vulnerable.txt
LibPNG_benign.txt
andLibPNG_vulnerable.txt
LipTIFF_benign.txt
andLipTIFF_vulnerable.txt
Pidgin_benign.txt
andPidgin_vulnerable.txt
VLC_benign.txt
andVLC_vulnerable.txt
VDISC_benign.txt
andVDISC_vulnerable.txt
The script is 'smart' in that it will not overwrite files that were already collected (thus, you'll need to remove them to download/generate again).
python3 -m venv venv
source venv/bin/activate
python3 -m pip install -r requirements.txt
The input data that was used for the paper was collected using srcML instead of lizard. Since srcML is an external tool, it required spawning a sub-process for each code fragment, making extraction last several days. This is addressed by switching to lizard, which runs in less than an hour. However, as a result of using different parsing techniques, the collected datasets are slightly different (the total number of functions successfully parsed using Lizard is somewhat larger, but there may also be functions missing that srcML could parse and Lizard could not).
Overall, the differences are so small that we do not expect them to have
any significant impact on the findings. For replication purposes, we
provide copies of the two datasets in resp. processed-srcML
and
processed-lizard
, and processed
initially contains the same content
as processed-srcML
.