The source code samples for reproducing the experimental results mentioned in our paper "An Information-Theoretic and Contrastive Learning-based Approach for Identifying Code Statements Causing Software Vulnerability". Refer to http://arxiv.org/abs/2209.10414 for details.
We used three real-world datasets including the CWE-399 dataset with 1,010 and 1,313 vulnerable/non-vulnerable functions for resource management error vulnerabilities, the CWE-119 dataset with 5,582 and 5,099 vulnerable/non-vulnerable functions for the buffer error vulnerabilities, and a big C/C++ dataset provided by Fan et al. (2020) containing many types of vulnerabilities such as Out-of-bounds Write, Improper Input Validation, and Path Traversal. For the CWE-399 and CWE-199 datasets collected by Li et al. (2018), we used the ones processed by Nguyen et al. (2021). Additionally, the Fan et al.’s dataset is considered as one of the largest vulnerability datasets that includes the ground truth at the statement level. The dataset is collected from 348 open-source Github projects from 2002 to 2019. It consists of 188,636 C/C++ source code functions where a ratio of vulnerable functions is 5.7% (i.e., 10,900 vulnerable functions).
We implemented our LiVu-ITCL method using Tensorflow (Abadi et al. 2016) (version 2.5) and Python (version 3.8). Other required packages are scikit-learn, numpy, scipy, and pickle.
Here, we provide the instructions for using the source code samples of our LiVu-ITCL method the on the Fan et al.’s dataset.
The folder named “Fan_dataset” consists of all of the necessary files containing the Fan et al.’s dataset. The file named “Fan_data_train_evaluate.py” is the source code for our proposed LiVu-ITCL method in both training and evaluating processes. The file named “Fan_data_VCP_VCA_TopK_IFA.py” is the source code for computing the main measures of fine-grained vulnerability detection including VCP, VCA, Top-10 accuracy (Top-10 ACC), and IFA.
The file named “Utils.py” is a collection of supported Python functions used in the training and evaluating processes of the model.
To train our model, please use the following command, for example, " python Fan_data_train_evaluate.py --lr=1e-4 --sigma=1e-1 --tau=0.5 --temp=0.5 --dim_dnn=300 --clusters=7 --train_epochs=5 --home_dir=./Fan_data_results/ --do_train ".
To evaluate our proposed LiVu-ITCL method performance, please use the following command, for example, " python Fan_data_train_evaluate.py --lr=1e-4 --sigma=1e-1 --tau=0.5 --temp=0.5 --dim_dnn=300 --clusters=7 --home_dir=./Fan_data_results/ --do_eval ".
To get the results for the main measures of fine-grained vulnerability detection including VCP, VCA, Top-10 accuracy (Top-10 ACC), and IFA, please use the following command, for example, " python Fan_data_VCP_VCA_TopK_IFA.py './Fan_data_results/' ".
For the LiVu-ITCL model configuration, please read the Model configuration section in the appendix of our paper for details.
If you reference our paper (and/or) use our source code samples in your work, please kindly cite our paper.
@article{vannguyen-livuitcl-2022,
doi = {10.48550/ARXIV.2209.10414},
url = {https://arxiv.org/abs/2209.10414 },
author = {Nguyen, Van and Le, Trung and Tantithamthavorn, Chakkrit and Grundy, John and Nguyen, Hung and Camtepe, Seyit and Quirk, Paul and Phung, Dinh},
title = {An Information-Theoretic and Contrastive Learning-based Approach for Identifying Code Statements Causing Software Vulnerability},
publisher = {arXiv},
year = {2022},
copyright = {Creative Commons Attribution 4.0 International}
}