Contrastive and counterfactuals explanations for test case prioritization: Ideas and challenges

This repository contains additional material for the paper "Contrastive and counterfactuals explanations for test case prioritization: Ideas and challenges", accepted for presentation in the AI4SE track of JISBD (SISTEDES 2023).

Experiment description

In the paper, we describe different practical scenarios in which the use of constrastive and counterfactual explanations could be of interest to understand learning-to-rank (LTR) models for test case prioritisation (TCP). This short papers includes an example, whose code is available as a Python notebook. We train and test a LTR model (see methodology below), then we generate counterfactuals explanations for a particular test case using DiCE.

Organisation

This repository is organised as follows:

The code folder includes the Python notebook and additional classes to replicate the experiment.
The results folder includes the CSV file with the counterfactuals generated by DiCE.

Dependencies

Our experimental study uses resources from another paper [1]. Please, visit their repository for details about the original data and preprocessing code. The dataset provided by the authors of this paper is hosted on Zenodo. Follow the instructions to download and unzip the artifact TCP-CI-main-dataset.tar.gz. The extracted folder (datasets) and our code folder should be at the same level.

Our code has been developed on Python 3.10. To install the dependencies required to run the notebooks, use the following command:

pip install -r requirements.txt

For training the learning-to-rank model, we use the LambdaMart algorithm from LightGBM. Explanations are generated by the DiCE method.

Methodology

Here, we give additional details on the choice of system (and build), the algorithm setup, and the training process:

Dataset

We use the TCP dataset collected by Yaraghi et al. [1], which is comprised of 25 open-source software systems coded in Java. The systems were carefully selected from public repositories, ensuring a relevant number of failed test cases and average execution times of at least 5 minutes. From the 25 systems, we select the system with highest build failure rate after excluding those that contain frequent-failing test cases due to configuration issues. The selected system, angel contains 308 builds, with 124 failed builds (40% failure rate), 33 test cases on average per build, and an average test regression time of 20 minutes. The authors collected 150 features, including test case execution records, source code metrics, and coverage information (see their paper appendix for the feature descriptions). To obtain test case prioritisations to train with, the test cases were sorted on each failed build based on their verdict and, in case of a tie, execution time in ascending order.

Algorithm

LambdaMART is the LTR algorithm used for prediction based on the 150 features. LambdaMART combines LambdaRank, a pairwise LTR method frequently used in information retrieval, with the idea of boosted classification trees. We use the implementation available in LightGBM, an efficient Python library focused on gradient boosting decision tree approaches [2]. We keep the default algorithm setting, which means that 100 estimators are fitted and the maximum tree depth is not limited.

Training and evaluation

We follow a hold-out strategy to train and test the LTR model, using one build for testing, and all previous builds as the training partition. Since LightGBM requires a validation partition, we preserve the last 20% of builds in the training partition for such a purpose. Model performance is evaluated on the validation partition every 10 iterations in terms of the NDGC (Normalised Discounted Cumulative Gain) metric. This metric does not only evaluate that the highest relevant items (test cases that fail in our experiment) appear at top of the ranking, but also takes item relative relevance in the whole ranking into account. We sort test cases based on the relevance prediction returned by LamdbaMART, using ascending execution time to break ties.

We run LambdaMART for every failed build of the angel system to choose one for our analysis based on its performance. Notice that the system has 124 failed builds, but we discard the first two builds to have at least two builds for training and one build for validation. After sorting all LTR models by NDGC value (in validation) and inspecting the predicted ranking, we select the build 572642875 (35 test cases, 3 of them failed) for testing and the previous 33 builds for training. The corresponding model has the 5th best performance (NDGC=0.9896), being the first one able to rank all failed test cases at the top of the ranking. Moreover, the median difference between the true position and the predicted position of all test cases is 1.

The performance of the algorithm for all builds is available as CSV file: lambdaMART_TCP_build_performance.csv.

Explainable method

After we get the ranking in the test partition, we use DiCE [3] to create counterfactuals explanations. For a given sample (test case in our paper), DiCE generates a configurable number of variants of such sample in which feature values are perturbed so that the prediction changes.

In our experiment, we split the ranking in percentiles to locate test cases that could be of interest for inspection. We generate counterfactuals for the test case appearing right below the top-25% with the aim to study how it could enter in the top-25%. Notice that the LTR model generates a continuous target value (ranking relevance), so we configure DiCE for regression tasks and indicate the desired range of values instead of the desired class as in classification.

The counterfactuals for the example described in the paper are available as CSV file: example_counterfactuals.csv.

References

[1] A. S. Yaraghi, M. Bagherzadeh, N. Kahani and L. Briand, "Scalable and Accurate Test Case Prioritization in Continuous Integration Contexts," in IEEE Transactions on Software Engineering, 2022. DOI:10.1109/TSE.2022.3184842.

[2] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu, "LightGBM: a highly efficient gradient boosting decision tree. " in Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS), pp. 3149–3157, 2017. Link.

[3] "Explaining machine learning classifiers through diverse counterfactual explanations". Proc. 2020 Conference on Fairness, Accountability, and Transparency, pp. 607-617. DOI:10.1145/3351095.3372850.

Contributors

Work developed by:

Aurora Ramírez, University of Córdoba (Spain)
Mario Berrios, University of Córdoba (Spain)
Robert Feldt, Chalmers University of Technology (Sweden)
José Raúl Romero, University of Córdoba (Spain)

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
code		code
results		results
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

results

results

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Contrastive and counterfactuals explanations for test case prioritization: Ideas and challenges

Experiment description

Organisation

Dependencies

Methodology

Dataset

Algorithm

Training and evaluation

Explainable method

References

Contributors

About

Releases

Packages

Languages

tepia-taxonomy/sistedes23-xai4tcp

Folders and files

Latest commit

History

Repository files navigation

Contrastive and counterfactuals explanations for test case prioritization: Ideas and challenges

Experiment description

Organisation

Dependencies

Methodology

Dataset

Algorithm

Training and evaluation

Explainable method

References

Contributors

About

Resources

Stars

Watchers

Forks

Languages