# Contrastive and couterfactual explanations for test case prioritization: Ideas and challenges

*Accepted for presentation in the Artificial Intelligence for Software Engineering track at Spanish Software Engineering and Database Conference (AISE@JISBD)*

**Contributors**:
- Aurora Ramírez (Univ. Córdoba)
- Mario Berrios (Univ. Córdoba)
- Jose Raúl Romero (Univ. Córdoba)
- Robert Feldt (Chalmers Univ.)

## Dependencies
- Dataset source: https://zenodo.org/record/6415365
- Original scripts: https://github.com/Ahmadreza-SY/TCP-CI
- Paper describing the dataset (Yaraghi et al. 2022): https://doi.org/10.1109/TSE.2022.3184842
- LightGBM for training the Learning-to-Rank model: https://lightgbm.readthedocs.io/
- DiCE for counterfactual generation: http://interpret.ml/DiCE/

In [1]:
import pandas as pd
import numpy as np
import lightgbm
from pathlib import Path 
import Dataset
import dice_ml
import os

### Data preparation

1. Open the repository dataset (it already contains only failed builds)
2. Convert the dataset to create the target variable (test cases sorted by verdict and duration)
3. Split the dataset into train/test partitions for each build (so that we can test on build n, train with n-1 builds)
4. Configure data partitions for lightgbm

In [2]:
# Settings
root_path = '../datasets'
repo_name = 'Angel-ML@angel'
path = root_path + '/' + repo_name
output_path = Path('../output/' + repo_name)  
output_path.parent.mkdir(parents=True, exist_ok=True)

In [3]:
# Open dataset
dataset = Dataset.Dataset(path)
df_repo = pd.read_csv(path + '/dataset.csv')
feature_names = df_repo.columns
feature_names = feature_names[~np.isin(feature_names, ['Build', 'Test', 'Verdict', 'Duration'])]

In [4]:
# Convert data for lightgbm
dataset.save_feature_id_map()
df_converted = dataset.convert_to_lightGBM_dataset(df_repo)
feature_ids = df_converted.columns

In [5]:
# This function creates the train/test splits by build (the folder name is the build id)
dataset.create_ranklib_training_sets(df_converted, output_path)

Creating training sets: 100%|██████████| 124/124 [00:01<00:00, 94.63it/s]


In [6]:
# Open one train/test partition, the build id is the build that will be used to predict
build_id = 572642875
partitions = output_path.joinpath(Path(str(build_id)))
df_train = pd.read_table(partitions.joinpath(Path('train.txt')), sep=' ', header=None)
df_test = pd.read_table(partitions.joinpath(Path('test.txt')), sep=' ', header=None)

# Assign the column names to allow indexing by name
df_train.columns = feature_ids
df_test.columns = feature_ids

In [7]:
# In lightgbm, we need an array with the number of samples in each group of samples that represent one ranking
# In our case, each build in the training partition has its own ranking, so the array should contain the number of test cases in each build
# For the test partition, we only have one ranking with all test cases in the last build
groups_builds = df_train.groupby('i_build')['i_build'].count().to_numpy()
groups_test = [len(df_test)]

# Use 20% of training groups as validation
num_builds = len(groups_builds)
val_index = int(0.8*len(groups_builds))
val_build = pd.unique(df_train['i_build'])[val_index]
groups_train = groups_builds[:val_index]
groups_validation = groups_builds[val_index:]

In [8]:
# Split the first dataset into train and validation
df_val = df_train[df_train['i_build'] >= val_build]
df_train = df_train[df_train['i_build'] < val_build]

# Save target (ranking) as the variable to be predicted in all partitions
y_train = df_train['target'].values
y_val = df_val['target'].values
y_test = df_test['target'].values

# Save test case duration, ids and true verdict in the test partition to evaluate model performance
y_test_durations = df_test['i_duration']
y_test_ids = df_test['i_test']
y_test_verdicts = df_test['i_verdict']

# Remove identifier columns and duration (not used for learning) in all partitions
columns_to_remove = ['target', 'hashtag', 'qid', 'i_target', 'i_verdict', 'i_duration', 'i_test', 'i_build']
X_train = df_train.drop(columns_to_remove, axis=1)
X_val = df_val.drop(columns_to_remove, axis=1)
X_test = df_test.drop(columns_to_remove, axis=1)

# Remove from feature ids too
feature_ids = feature_ids[~np.isin(feature_ids, columns_to_remove)]

### Learning-to-rank model training

We use the LGBMRanker algorithm from the LightGBM Python library
See: https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRanker.html

1. Algorithm configuration:
- Objective: type of task, for ranking = 'lambdarank'
- Boosting type: gbdt (gradient boosting decision tree)
- Importante_type: 'splits' or 'gain' as criterion to weight feature importance in the trained model
- Random state: to reproduce the results
- Label gain: to specify the relative importance of the ordering of the different items (here, samples are sorted by relevance in ascending order)
2. Training the model with one validation partition.
3. Obtain feature importance based on the configured criterion.

In [9]:
# Configuration
ranker = lightgbm.LGBMRanker(
                    objective="lambdarank",
                    boosting_type = "gbdt",
                    importance_type = "split",
                    metric = "ndcg",
                    random_state = 0,
                    max_depth = -1,
                    n_estimators = 100,
                    label_gain = [i for i in range(0, max(y_train.max(), y_val.max())+1)]
                    )


In [10]:
# Train the model
ranker.fit(
        X=X_train, 
        y=y_train,
        group=groups_train,
        eval_set=[(X_val, y_val)],
        eval_group=[groups_validation],
        eval_at = 10,
        callbacks=[lightgbm.log_evaluation(period=10)]
        )

[10]	valid_0's ndcg@10: 0.988447
[20]	valid_0's ndcg@10: 0.988462
[30]	valid_0's ndcg@10: 0.988695
[40]	valid_0's ndcg@10: 0.993337
[50]	valid_0's ndcg@10: 0.993446
[60]	valid_0's ndcg@10: 0.993597
[70]	valid_0's ndcg@10: 0.99211
[80]	valid_0's ndcg@10: 0.989952
[90]	valid_0's ndcg@10: 0.98993
[100]	valid_0's ndcg@10: 0.989575


### Generate predictions for the last build
1. The predictions are obtained for all test cases in last build (test partition)
2. The test cases are sorted based on the predictions and duration

In [11]:
# Predict using the test partition
test_predictions = ranker.predict(X_test)

In [12]:
# Create a data frame with the information of the test cases in the test partition
# Note: relevance is inverse to ranking, so higher relevance is actually better for the learning-to-rank approach
df_predictions = pd.DataFrame({'test_id': y_test_ids, 'true_relevance': y_test, 'true_verdict': y_test_verdicts, 
                                'duration': y_test_durations, 'prediction': test_predictions})
display(df_predictions)

Unnamed: 0,test_id,true_relevance,true_verdict,duration,prediction
0,5141,35,1,28,6.345247
1,5140,34,1,60,5.548885
2,2953,33,1,70,5.593372
3,2732,32,0,6,2.653386
4,2161,31,0,24,2.264818
5,2735,30,0,232,1.803198
6,2749,29,0,1352,1.195901
7,2720,28,0,1630,0.746339
8,1204,27,0,19598,0.813204
9,4611,26,0,20171,0.152361


In [13]:
# Add the predicted relevance after ordering based on the prediction and test case duration
df_predictions.sort_values(
                ['prediction', 'duration'],
                ascending=[False, True],
                inplace=True,
                ignore_index=True,
            )
pred_relevance = [i for i in range(df_predictions.shape[0], 0, -1)]
df_predictions.insert(df_predictions.shape[1], 'pred_relevance', pred_relevance)
display(df_predictions)

Unnamed: 0,test_id,true_relevance,true_verdict,duration,prediction,pred_relevance
0,5141,35,1,28,6.345247,35
1,2953,33,1,70,5.593372,34
2,5140,34,1,60,5.548885,33
3,2732,32,0,6,2.653386,32
4,2161,31,0,24,2.264818,31
5,2735,30,0,232,1.803198,30
6,2749,29,0,1352,1.195901,29
7,1204,27,0,19598,0.813204,28
8,2720,28,0,1630,0.746339,27
9,510,18,0,22270,0.590176,26


### Counterfactual explanations for the learning-to-rank model
1. Inspect the test results to determine relevance percentiles
2. Configure the explainer in DiCE
3. Generate counterfactuals for some test cases (change the percentile position)

In [14]:
# Build a data frame with the correspondence between feature names and id for easier management
df_feature_names = pd.DataFrame({'feature':feature_names, 'feature_id': feature_ids})
display(df_feature_names)

Unnamed: 0,feature,feature_id
0,TES_COM_CountDeclFunction,f1
1,TES_COM_CountLine,f2
2,TES_COM_CountLineBlank,f3
3,TES_COM_CountLineCode,f4
4,TES_COM_CountLineCodeDecl,f5
...,...,...
145,COD_COV_PRO_IMP_MinorContributorCount,f146
146,COD_COV_PRO_IMP_OwnersExperience,f147
147,COD_COV_PRO_IMP_AllCommitersExperience,f148
148,DET_COV_C_Faults,f149


In [15]:
# Get predictions and split them in the desired number of percentiles
# Example, n_perc = 4 will divide the prediction values into [0-25%], [25%-50%], [50%-75%], [75%-100%] (100% is the top of the ranking)
n_perc = 4
step = 100/n_perc
pred_values = df_predictions['prediction'].values
ranking_percentiles = np.zeros(n_perc)
for i in range(0, n_perc):
    ranking_percentiles[i] = np.percentile(pred_values, (i+1)*step, method="closest_observation")
print(ranking_percentiles)

[-4.78575217 -3.35700171  0.59017627  6.34524726]


In [16]:
# Get the ids of the test cases right below each percentile position
test_ids = np.zeros(n_perc)
sorted_tests = df_predictions['test_id'].values
for i in range(0, n_perc):
    test_id = df_predictions[df_predictions['prediction']==ranking_percentiles[i]]['test_id'].values
    test_index = np.where(sorted_tests==test_id)[0]
    test_ids[i] = sorted_tests[test_index+1]

print(test_ids)

[4821.  506. 4611. 2953.]


In [17]:
# Configure the data model in DiCE
df_train_dice = X_train.copy()
df_train_dice['target'] = y_train
d = dice_ml.Data(dataframe=df_train_dice,
                 continuous_features=feature_ids.to_list(),
                 outcome_name='target')

In [18]:
# Configure the decision model in DiCE
m = dice_ml.Model(model=ranker, backend="sklearn", model_type='regressor')

In [19]:
# Configure the explainer object using the genetic algorithm as method to generate the counterfactuals
exp = dice_ml.Dice(d, m, method="genetic")

In [20]:
# Generate counterfactuals for different test cases
# Example 1: Changes to move the test case in second percentile to the first percentile, generate 5 counterfactuals without constraints
num_cfs = 5
origin_perc = 2
target_perc = 1
desired_range = [ranking_percentiles[n_perc-origin_perc], ranking_percentiles[n_perc-target_perc]]

# Get the test case features from the test partition
test_id = test_ids[n_perc-origin_perc]
test_features = df_test[df_test['i_test']==test_id].drop(columns_to_remove, axis=1)

# Generate and visualize the counterfactuals
counterfactuals = exp.generate_counterfactuals(test_features,
                                  total_CFs=num_cfs,
                                  desired_range=desired_range)
counterfactuals.visualize_as_dataframe(show_only_changes=True)

# Save as dataframe
df_counterfactuals = counterfactuals.cf_examples_list[0].final_cfs_df
output_dir = '../results/'
if not os.path.exists(output_dir):
    os.mkdir(output_dir)
df_counterfactuals.to_csv(output_dir + '/example_counterfactuals.csv', index=False)


100%|██████████| 1/1 [00:01<00:00,  1.50s/it]

Query instance (original outcome : 0)





Unnamed: 0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,...,f142,f143,f144,f145,f146,f147,f148,f149,f150,target
0,0.071429,0.064088,0.076923,0.069164,0.181579,0.060521,0.0,0.057792,0.110333,0.030832,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.152361



Diverse Counterfactual set (new outcome: [0.5901762683231326, 6.345247261144724])


Unnamed: 0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,...,f142,f143,f144,f145,f146,f147,f148,f149,f150,target
0,0.07142857,0.0640878,0.07692308,0.06916427,0.18157895,0.06052142,-,0.05779221,0.11033275,0.030832477,...,-,-,-,-,-,-,-,-,-,0.6882060930237757
0,0.07142857,0.0640878,0.07692308,0.06916427,0.18157895,0.06052142,-,0.05779221,0.11033275,0.030832477,...,-,-,-,-,-,-,-,-,-,0.7309189405503256
0,0.0,0.0,0.0,0.00072046,0.01315789,0.0009311,0.06779661,0.00064935,0.00525394,0.002055499,...,-,-,-,-,-,-,-,-,-,2.333559823348604
0,0.0,0.0,0.0,0.00072046,0.01315789,0.0009311,0.06779661,0.00064935,0.00525394,0.002055499,...,-,-,-,-,-,-,-,-,-,3.0014401544884834
0,0.0,0.0,0.0,0.00072046,0.01315789,0.0009311,0.06779661,0.00064935,0.00525394,0.002055499,...,-,-,-,-,-,-,-,-,-,2.838915709722678


In [21]:
# Some statistics of the generated counterfactuals
num_features = len(test_features)
cfs_index = list()
num_feature_changes = list()
min_feature_changes = list()
min_feature_names = list()
max_feature_changes = list()
max_feature_names = list()

for i in range(0, num_cfs):
    cfs_index.append(i+1)
    counterfactual_features = df_counterfactuals.iloc[i].drop('target')
    dif_features = np.abs([test_features - counterfactual_features])
    num_feature_changes.append(np.count_nonzero(dif_features))
    nonzero_difs = dif_features[dif_features>0]
    
    min_feature_changes.append(np.min(nonzero_difs))
    min_feature_pos = np.where(dif_features == min_feature_changes[i])[2][0]
    min_feature_names.append(df_feature_names.iloc[min_feature_pos]['feature'])
    
    max_feature_changes.append(np.max(nonzero_difs))
    max_feature_pos = np.where(dif_features == max_feature_changes[i])[2][0]
    max_feature_names.append(df_feature_names.iloc[max_feature_pos]['feature'])
    

cfs_stats = pd.DataFrame({'counterfactual': cfs_index, 'num_changes': num_feature_changes, 'min_feat_change': min_feature_changes, 
                                'min_feat_name': min_feature_names, 'max_feat_change': max_feature_changes, 'max_feat_name': max_feature_names})
display(cfs_stats)

Unnamed: 0,counterfactual,num_changes,min_feat_change,min_feat_name,max_feat_change,max_feat_name
0,1,35,1.110223e-16,REC_RecentTransitionRate,0.032763,REC_RecentMaxExeTime
1,2,35,1.110223e-16,REC_RecentTransitionRate,0.039088,REC_Age
2,3,37,1.110223e-16,REC_RecentTransitionRate,1.0,TES_COM_CountDeclClassVariable
3,4,37,1.110223e-16,REC_RecentTransitionRate,1.0,TES_COM_CountDeclClassVariable
4,5,37,1.110223e-16,REC_RecentTransitionRate,1.0,TES_COM_CountDeclClassVariable
