Skip to content

tepia-taxonomy/aist23-workshop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Towards Explainable Test Case Prioritisation with Learning-to-Rank Models

This repository contains additional material for the paper "Towards Explainable Test Case Prioritisation with Learning-to-Rank Models", presented at the 3rd International Workshop on Artificial Intelligence in Software Testing (AIST@ICST 2023).

Experiment description

In the paper, we describe different practical scenarios in which the use of eXplainable AI (XAI) could be of interest to understand learning-to-rank (LTR) models for test case prioritisation (TCP). Our experiment covers the following scenarios described in the paper:

  • Notebook aist_scenario1a1b.ipynb: We train and test a LTR model (see methodology below), then we generate global and local explanations to compare feature importance.
  • Notebook aist_scenario1c.ipynb: We compare the local explanations of all pairs of test cases to analyse their similarities.

Organisation

This repository is organised as follows:

  • The data folder includes the CSV files with the different analyses carried out in our experiment.
  • The code folder includes the Python notebooks and additional classes to replicate the experiment.
  • The slides folder includes the slides presented at the workshop.

Dependencies

Our experimental study uses resources from another paper [1]. Please, visit their repository for details about the original data and preprocessing code. The dataset provided by the authors of this paper is hosted on Zenodo. Follow the instructions to download and unzip the artifact TCP-CI-main-dataset.tar.gz. The extracted folder (datasets) and our code folder should be at the same level.

Our code has been developed on Python 3.10. To install the dependencies required to run the notebooks, use the following command:

pip install -r requirements.txt

For training the learning-to-rank model, we use the LambdaMart algorithm from LightGBM. Explanations are generated by the Break Down method available from Dalex.

Methodology

Here, we give additional details on the choice of system (and build), the algorithm setup, and the training process:

Dataset

We use the TCP dataset collected by Yaraghi et al. [1], which is comprised of 25 open-source software systems coded in Java. The systems were carefully selected from public repositories, ensuring a relevant number of failed test cases and average execution times of at least 5 minutes. From the 25 systems, we select the system with highest build failure rate after excluding those that contain frequent-failing test cases due to configuration issues. The selected system, angel contains 308 builds, with 124 failed builds (40% failure rate), 33 test cases on average per build, and an average test regression time of 20 minutes. The authors collected 150 features, including test case execution records, source code metrics, and coverage information (see their paper appendix for the feature descriptions). To obtain test case prioritisations to train with, the test cases were sorted on each failed build based on their verdict and, in case of a tie, execution time in ascending order.

Algorithm

LambdaMART is the LTR algorithm used for prediction based on the 150 features. LambdaMART combines LambdaRank, a pairwise LTR method frequently used in information retrieval, with the idea of boosted classification trees. We use the implementation available in LightGBM, an efficient Python library focused on gradient boosting decision tree approaches [2]. We keep the default algorithm setting, which means that 100 estimators are fitted and the maximum tree depth is not limited. Like Yaraghi et al., we set the option to compute global feature contributions (scenario 1A) in the trained model based on their frequency of occurrence in the split nodes.

Training and evaluation

We follow a hold-out strategy to train and test the LTR model, using one build for testing, and all previous builds as the training partition. Since LightGBM requires a validation partition, we preserve the last 20% of builds in the training partition for such a purpose. Model performance is evaluated on the validation partition every 10 iterations in terms of the NDGC (Normalised Discounted Cumulative Gain) metric. This metric does not only evaluate that the highest relevant items (test cases that fail in our experiment) appear at top of the ranking, but also takes item relative relevance in the whole ranking into account. We sort test cases based on the relevance prediction returned by LamdbaMART, using ascending execution time to break ties.

We run LambdaMART for every failed build of the angel system to choose one for our analysis based on its performance. Notice that the system has 124 failed builds, but we discard the first two builds to have at least two builds for training and one build for validation. After sorting all LTR models by NDGC value (in validation) and inspecting the predicted ranking, we select the build 572642875 (35 test cases, 3 of them failed) for testing and the previous 33 builds for training. The corresponding model has the 5th best performance (NDGC=0.9896), being the first one able to rank all failed test cases at the top of the ranking. Moreover, the median difference between the true position and the predicted position of all test cases is 1.

The performance of the algorithm for all builds is available as CSV file: lambdaMART_TCP_build_performance.csv. The global feature importance of the selected model is available as CSV file: lambdaMART_TCP_global_explanation.csv.

Explainable method

After we get the ranking in the test partition, we use Break Down [3] to create local explanations for particular test cases (scenarios 1B and 1C). Break Down assigns a positive or negative contribution to each feature, so that adding all contributions yields the prediction value. The goal of this method is to make it easier to understand the explanations by focusing on the contribution of just a few features. For scenario 1C, we will calculate the cosine similarity of the feature contributions for all pairs of test cases in the ranking. We scale the feature contribution values based on the sum of all contributions for each test case prediction, and we keep the contribution signs after adjusting them.

The explanation similarities for all pair of test cases in the build are available as CSV file: lambdaMART_TCP_explanation_similarities.csv.

References

[1] A. S. Yaraghi, M. Bagherzadeh, N. Kahani and L. Briand, "Scalable and Accurate Test Case Prioritization in Continuous Integration Contexts," in IEEE Transactions on Software Engineering, 2022. DOI:10.1109/TSE.2022.3184842.

[2] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu, "LightGBM: a highly efficient gradient boosting decision tree. " in Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS), pp. 3149–3157, 2017. Link.

[3] M. Staniak and Przemysław Biecek, "Explanations of Model Predictions with live and breakDown Packages," The R Journal, vol. 10(2), pp. 395-409, 2018.

Contributors

Work developed by:

About

Additional material for the paper submitted to AIST@ICST 2023

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published