# Towards Explainable Test Case Prioritisation with Learning-to-Rank Models

*Submitted to the 3rd International Workshop on Artificial Intelligence in Software Testing (AIST@ICST)*

**Contributors**:
- Aurora Ramírez (Univ. Córdoba)
- Mario Berrios (Univ. Córdoba)
- Robert Feldt (Chalmers Univ.)
- Jose Raúl Romero (Univ. Córdoba)

## Scenario 1C: Comparing the similarities of local explanations of a learning-to-rank model for TCP
- Dataset source: https://zenodo.org/record/6415365
- Original scripts: https://github.com/Ahmadreza-SY/TCP-CI
- Paper describing the dataset (Yaraghi et al. 2022): https://doi.org/10.1109/TSE.2022.3184842

In [1]:
import pandas as pd
import numpy as np
import lightgbm
from sklearn.metrics.pairwise import cosine_similarity
from pathlib import Path 
import dalex as dx
import Dataset

### Data preparation

1. Open the repository dataset (it already contains only failed builds)
2. Convert the dataset to create the target variable (test cases sorted by verdict and duration)
3. Split the dataset into train/test partitions for each build (so that we can test on build n, train with n-1 builds)
4. Configure data partitions for lightgbm

In [2]:
# Settings
root_path = '../datasets'
repo_name = 'Angel-ML@angel'
path = root_path + '/' + repo_name
output_path = Path('../output/' + repo_name)  
output_path.parent.mkdir(parents=True, exist_ok=True)

In [3]:
# Open dataset
dataset = Dataset.Dataset(path)
df_repo = pd.read_csv(path + '/dataset.csv')
feature_names = df_repo.columns
feature_names = feature_names[~np.isin(feature_names, ['Build', 'Test', 'Verdict', 'Duration'])]

In [4]:
# Convert data for lightgbm
dataset.save_feature_id_map()
df_converted = dataset.convert_to_lightGBM_dataset(df_repo)
feature_ids = df_converted.columns

In [5]:
# This function creates the train/test splits by build (the folder name is the build id)
# You can skip this cell if you have run the aist_scenario1a1b notebook with the same system (repo_name)
dataset.create_ranklib_training_sets(df_converted, output_path)

Creating training sets: 100%|████████████████████████████████████████████████████████| 124/124 [00:01<00:00, 88.97it/s]


In [6]:
# Open one train/test partition, the build id is the build that will be used to predict
build_id = 572642875
partitions = output_path.joinpath(Path(str(build_id)))
df_train = pd.read_table(partitions.joinpath(Path('train.txt')), sep=' ', header=None)
df_test = pd.read_table(partitions.joinpath(Path('test.txt')), sep=' ', header=None)

# Assign the column names to allow indexing by name
df_train.columns = feature_ids
df_test.columns = feature_ids

In [7]:
# In lightgbm, we need an array with the number of samples in each group of samples that represent one ranking
# In our case, each build in the training partition has its own ranking, so the array should contain the number of test cases in each build
# For the test partition, we only have one ranking with all test cases in the last build
groups_builds = df_train.groupby('i_build')['i_build'].count().to_numpy()
groups_test = [len(df_test)]

# Use 20% of training groups as validation
num_builds = len(groups_builds)
val_index = int(0.8*len(groups_builds))
val_build = pd.unique(df_train['i_build'])[val_index]
groups_train = groups_builds[:val_index]
groups_validation = groups_builds[val_index:]

In [8]:
# Split the first dataset into train and validation
df_val = df_train[df_train['i_build'] >= val_build]
df_train = df_train[df_train['i_build'] < val_build]

# Save target (ranking) as the variable to be predicted in all partitions
y_train = df_train['target'].values
y_val = df_val['target'].values
y_test = df_test['target'].values

# Save test case duration, ids and true verdict in the test partition to evaluate model performance
y_test_durations = df_test['i_duration']
y_test_ids = df_test['i_test']
y_test_verdicts = df_test['i_verdict']

# Remove identifier columns and duration (not used for learning) in all partitions
columns_to_remove = ['target', 'hashtag', 'qid', 'i_target', 'i_verdict', 'i_duration', 'i_test', 'i_build']
X_train = df_train.drop(columns_to_remove, axis=1)
X_val = df_val.drop(columns_to_remove, axis=1)
X_test = df_test.drop(columns_to_remove, axis=1)

# Remove from feature ids too
feature_ids = feature_ids[~np.isin(feature_ids, columns_to_remove)]

### Learning-to-rank model training

We use the LGBMRanker algorithm from the LightGBM Python library
See: https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRanker.html

1. Algorithm configuration:
- Objective: type of task, for ranking = 'lambdarank'
- Boosting type: gbdt (gradient boosting decision tree)
- Importante_type: 'splits' or 'gain' as criterion to weight feature importance in the trained model
- Random state: to reproduce the results
- Label gain: to specify the relative importance of the ordering of the different items (here, samples are sorted by relevance in ascending order)
2. Training the model with one validation partition.
3. Obtain feature importance based on the configured criterion.

In [9]:
# Configuration
ranker = lightgbm.LGBMRanker(
                    objective="lambdarank",
                    boosting_type = "gbdt",
                    importance_type = "split",
                    metric = "ndcg",
                    random_state = 0,
                    max_depth = -1,
                    n_estimators = 100,
                    label_gain = [i for i in range(0, max(y_train.max(), y_val.max())+1)]
                    )


In [10]:
# Train the model
ranker.fit(
        X=X_train, 
        y=y_train,
        group=groups_train,
        eval_set=[(X_val, y_val)],
        eval_group=[groups_validation],
        eval_at = 10,
        callbacks=[lightgbm.log_evaluation(period=10)]
        )

[10]	valid_0's ndcg@10: 0.988447
[20]	valid_0's ndcg@10: 0.988462
[30]	valid_0's ndcg@10: 0.988695
[40]	valid_0's ndcg@10: 0.993337
[50]	valid_0's ndcg@10: 0.993446
[60]	valid_0's ndcg@10: 0.993597
[70]	valid_0's ndcg@10: 0.99211
[80]	valid_0's ndcg@10: 0.989952
[90]	valid_0's ndcg@10: 0.98993
[100]	valid_0's ndcg@10: 0.989575


LGBMRanker(label_gain=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
                       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, ...],
           metric='ndcg', objective='lambdarank', random_state=0)

### Generate predictions for the last build
1. The predictions are obtained for all test cases in last build (test partition)
2. The test cases are sorted based on the predictions and duration

In [11]:
# Predict using the test partition
test_predictions = ranker.predict(X_test)

In [12]:
# Create a data frame with the information of the test cases in the test partition
# Note: relevance is inverse to ranking, so higher relevance is actually better for the learning-to-rank approach
df_predictions = pd.DataFrame({'test_id': y_test_ids, 'true_relevance': y_test, 'true_verdict': y_test_verdicts, 
                                'duration': y_test_durations, 'prediction': test_predictions})
display(df_predictions)

Unnamed: 0,test_id,true_relevance,true_verdict,duration,prediction
0,5141,35,1,28,6.345247
1,5140,34,1,60,5.548885
2,2953,33,1,70,5.593372
3,2732,32,0,6,2.653386
4,2161,31,0,24,2.264818
5,2735,30,0,232,1.803198
6,2749,29,0,1352,1.195901
7,2720,28,0,1630,0.746339
8,1204,27,0,19598,0.813204
9,4611,26,0,20171,0.152361


In [13]:
# Add two columns:
# - The predicted relevance after ordering based on the prediction and test case duration
# - The difference between the true and the predicted relevance
df_predictions.sort_values(
                ['prediction', 'duration'],
                ascending=[False, True],
                inplace=True,
                ignore_index=True,
            )
pred_relevance = [i for i in range(df_predictions.shape[0], 0, -1)]
df_predictions.insert(df_predictions.shape[1], 'pred_relevance', pred_relevance)
relevance_dif = np.abs(df_predictions['true_relevance'] - df_predictions['pred_relevance'])
df_predictions.insert(df_predictions.shape[1], 'relevance_dif', relevance_dif)
display(df_predictions)

Unnamed: 0,test_id,true_relevance,true_verdict,duration,prediction,pred_relevance,relevance_dif
0,5141,35,1,28,6.345247,35,0
1,2953,33,1,70,5.593372,34,1
2,5140,34,1,60,5.548885,33,1
3,2732,32,0,6,2.653386,32,0
4,2161,31,0,24,2.264818,31,0
5,2735,30,0,232,1.803198,30,0
6,2749,29,0,1352,1.195901,29,0
7,1204,27,0,19598,0.813204,28,1
8,2720,28,0,1630,0.746339,27,1
9,510,18,0,22270,0.590176,26,8


### Explain predictions of the learning-to-rank model
1. Configure an Explainer in Dalex: https://github.com/ModelOriented/DALEX
2. Explain each test case prediction in the build using Break Down

In [14]:
# Configure DALEX explainer
explainer = dx.Explainer(model=ranker, data=X_train, y=y_train, label="LGBM", model_type=None)

Preparation of a new explainer is initiated

  -> data              : 814 rows 150 cols
  -> target variable   : 814 values
  -> model_class       : lightgbm.sklearn.LGBMRanker (default)
  -> label             : LGBM
  -> predict function  : <function yhat_default at 0x000002A393E02B00> will be used (default)
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = -9.3, mean = -1.93, max = 7.79
  -> model type        : regression will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = 5.33, mean = 19.1, max = 32.9
  -> model_info        : package lightgbm

A new explainer has been created!


In [15]:
# Explanation for each test case in the target build
test_ids_prediction = []
explanations = []
df_explanations = []
prediction_values = []
intercept_values = []

for x in range(0, len(df_predictions)):
    test_ids_prediction.append(df_predictions['test_id'][x])
    test_features = df_test[df_test['i_test']==test_ids_prediction[-1]].drop(columns=columns_to_remove).values[0]
    explanations.append(explainer.predict_parts(test_features, type='break_down'))
    df_x_expl = explanations[-1].result

    # Drop each non feature row found in the explaination
    index_to_drop = []
    prediction_values.append(df_x_expl.loc[df_x_expl['variable'] == 'prediction', 'contribution'].iloc[0])
    intercept_values.append(df_x_expl.loc[df_x_expl['variable'] == 'intercept', 'contribution'].iloc[0])
    index_to_drop.append(df_x_expl[df_x_expl['variable'] == 'prediction'].index.tolist()[0])
    index_to_drop.append(df_x_expl[df_x_expl['variable'] == 'intercept'].index.tolist()[0])
    df_x_expl = df_x_expl.drop(index_to_drop, axis=0)

    # Add a column with the name of each feature
    column_to_add = []
    for index, row in df_x_expl.iterrows():
        column_to_add.append(feature_names[list(feature_ids.values).index(row['variable_name'])])

    df_x_expl.insert(0, "feature", column_to_add)
    df_explanations.append(df_x_expl)

### Compare the similarity of the explanations
1. Normalise each feature contribution of the local explanations
2. Compute the contribution difference between each feature for a given pair of test cases
3. Compute the cosine similarity of the normalized values for all test cases in the build

In [16]:
# Function that normalise and compute the feature contribution differences
# The normalisation is done by taking as a base the sum of all the absolute values of the contributions
def compare_df_normalise(df1, prediction1, intercept1, df2, prediction2, intercept2):

    real_name = []
    feature_name = []
    abs_diff = []
    same_sign = []

    values_df1 = []
    values_df2 = []
    norm_values_df1 = []
    norm_values_df2 = []

    abs_sum_contrib_df1 = df1['contribution'].abs().sum()
    abs_sum_contrib_df2 = df2['contribution'].abs().sum()

    for index, row in df1.iterrows():
        real_name.append(row['feature'])
        feature_name.append(row['variable_name'])

        values_df1.append(row['contribution'])
        values_df2.append(df2.loc[df2['variable_name'] == feature_name[-1], 'contribution'].iloc[0])

        sign_df1 = np.sign(values_df1[-1])
        sign_df2 = np.sign(values_df2[-1])
        same_sign.append(sign_df1 == sign_df2)

        norm_values_df1.append((abs(values_df1[-1])/abs_sum_contrib_df1) * sign_df1)
        norm_values_df2.append((abs(values_df2[-1])/abs_sum_contrib_df2) * sign_df2)
        abs_diff.append(abs(norm_values_df1[-1] - norm_values_df2[-1]))

    return pd.DataFrame({'feature':real_name, 'variable_name':feature_name, 
                         'contrib_df1':values_df1, 'contrib_df2':values_df2,
                         'norm_contrib_df1':norm_values_df1, 'norm_contrib_df2':norm_values_df2,
                         'same_sign':same_sign, 'abs_diff':abs_diff})

In [17]:
# Generate a data frame to collect all explanation similarities using the cosine similarity
df_comparisons = np.zeros([len(df_explanations), len(df_explanations)], dtype=object)
max = []
avg = []
ids_test1 = []
ids_test2 = []
pos_test1 = []
pos_test2 = []
diff_pos = []
cos_sim_list = []

for index_df1 in range(0, len(df_explanations)):
    for index_df2 in range(index_df1+1, len(df_explanations)):
        df_comparison = compare_df_normalise(df_explanations[index_df1], prediction_values[index_df1], intercept_values[index_df1],
                                             df_explanations[index_df2], prediction_values[index_df2], intercept_values[index_df2])
        df_comparisons[index_df1, index_df2] = df_comparison
        df_comparisons[index_df2, index_df1] = df_comparison

        norm_df1 = df_comparison['norm_contrib_df1'].to_numpy().reshape(1,-1)
        norm_df2 = df_comparison['norm_contrib_df2'].to_numpy().reshape(1,-1)
        cos_sim = cosine_similarity(norm_df1, norm_df2)

        cos_sim_list.append(cos_sim[0][0])
        pos_test1.append("t" + str(index_df1 + 1))
        pos_test2.append("t" + str(index_df2 + 1))        
        diff_pos.append(abs(index_df1 - index_df2))
        ids_test1.append(test_ids_prediction[index_df1])
        ids_test2.append(test_ids_prediction[index_df2])

comparisons_summary = pd.DataFrame({'id_test1': ids_test1, 'id_test2': ids_test2,
                                    'pos_test1': pos_test1, 'pos_test2': pos_test2,
                                    'diff_pos': diff_pos, 'cos_sim': cos_sim_list})

In [18]:
# Show the explanation similarity of a pair of test cases

# The values of 'test_1' and 'test_2' variables are the positions of the test cases to be compared as sorted in the prediction data frame. 
# In order to compare the best test with 3rd best test, the values should be 1 and 3 (or 3 and 1).
test_1 = 1
test_2 = 3

print('i_test = ' + str(test_ids_prediction[test_1 -1]) + " pred = " + str(prediction_values[test_1-1]))
print('i_test = ' + str(test_ids_prediction[test_2 -1]) + " pred = " + str(prediction_values[test_2-1]))
df_comparisons[test_1-1][test_2-1]

i_test = 5141 pred = 6.345247261144724
i_test = 5140 pred = 5.5488846119499


Unnamed: 0,feature,variable_name,contrib_df1,contrib_df2,norm_contrib_df1,norm_contrib_df2,same_sign,abs_diff
0,REC_LastExeTime,f61,2.367454,2.367454,0.269951,0.277287,True,0.007336
1,REC_RecentAvgExeTime,f48,0.604632,0.551588,0.068944,0.064604,True,0.004339
2,TES_COM_CountLine,f2,0.319162,0.319162,0.036393,0.037382,True,0.000989
3,REC_LastVerdict,f60,0.489167,0.489167,0.055778,0.057294,True,0.001516
4,TES_COM_CountLineBlank,f3,0.574708,0.574708,0.065532,0.067312,True,0.001781
...,...,...,...,...,...,...,...,...
145,COV_ImpScoreSum,f65,0.000000,0.000000,0.000000,0.000000,True,0.000000
146,TES_COM_CountLineCode,f4,0.000000,0.000000,0.000000,0.000000,True,0.000000
147,TES_PRO_OwnersContribution,f34,0.065681,0.037907,0.007489,0.004440,True,0.003050
148,REC_TotalExcRate,f58,-0.025987,-0.025987,-0.002963,-0.003044,True,0.000081


In [19]:
# Display of the comparison summary across the build
display(comparisons_summary.sort_values('cos_sim', ascending=False))

Unnamed: 0,id_test1,id_test2,pos_test1,pos_test2,diff_pos,cos_sim
1,5141,5140,t1,t3,2,0.994871
517,2960,2955,t23,t24,1,0.992033
507,509,511,t22,t26,4,0.989440
100,2732,2735,t4,t6,2,0.980399
580,2961,2962,t30,t31,1,0.979349
...,...,...,...,...,...,...
207,2749,511,t7,t26,19,-0.921281
179,2735,511,t6,t26,20,-0.925832
178,2735,2958,t6,t25,19,-0.926564
175,2735,509,t6,t22,16,-0.939104
