Skip to content

Rank of Counterfactual Explanations for multiple metrics

Notifications You must be signed in to change notification settings

rmazzine/Ranking-Tabular-CF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ranking for Tabular Counterfactual Explanation Generators

This repository shows the benchmark of counterfactual generation algorithms in terms of (click for details):

Coverage
how many factuals are converted to counterfactuals?
Sparsity
how many features are unchanged?
L2 distance
how far are the counterfactuals from the factual data?
Mean Absolute Deviation
how different are the counterfactuals from the factual data considering feature variations?
Mahalanobis distance
how different are the counterfactuals from the factual data considering the data distribution?
Time
how long does it take to generate a counterfactual?

How to include your CF generation algorithm

Follow the instructions on the CounterfactualBenchmark repository

RESULTS

All experiments consider a confidence level of 95%.

Ranking Table

Click here to see why we use ranking instead of the metrics itself

Most metrics cannot be directly compared as each algorithm has a different coverage. For example, if one algorithm only creates a single counterfactual and has a sparsity of 90%, we cannot say it is better than another algorithm that creates 1 000 counterfactuals and with sparsity of 88%. Therefore, the ranking consider these cases, giving a better picture of the algorithms' performance.

The rankings below were created with Friedman's test to evaluate the null hypothesis that the algorithms are equal. And Nemenyi's test to evaluate the significance of the difference between the algorithms. The highlighted results are the ones that are statistically significant.

Ranking for all datasets

framework alibi_nograd alibi cadex cfnow_random cfnow_greedy dice growingspheres synas lore sedc cfnow_random_simple cfnow_greedy_simple N
index                          
validity 7.55 7.56 6.10 🥇4.45 🥇4.45 6.24 8.42 7.43 8.39 8.54 🥇4.45 🥇4.45 3925
sparsity 7.65 7.82 8.55 4.20 🥇3.78 5.90 9.17 6.10 7.99 7.76 5.51 🥇3.58 3925
L2 6.68 6.89 6.81 🥇3.34 3.81 8.22 7.07 7.75 8.70 9.24 4.92 4.56 3925
MAD 7.39 7.63 7.63 3.42 🥇3.05 7.52 8.58 7.86 8.65 8.49 4.30 3.47 3925
MD 6.91 7.00 6.79 🥇3.56 🥇3.54 8.16 7.37 7.76 8.70 9.40 4.62 4.18 3925

Ranking for categorical datasets

framework alibi_nograd alibi cadex cfnow_random cfnow_greedy dice growingspheres synas lore sedc cfnow_random_simple cfnow_greedy_simple N
index                          
validity 8.39 8.79 5.93 🥇4.19 🥇4.19 🥇4.19 10.19 7.26 6.27 10.19 🥇4.19 🥇4.19 1327
sparsity 8.20 8.74 8.19 🥇3.31 🥇3.37 5.97 10.19 6.84 5.91 10.19 🥇3.69 🥇3.38 1327
L2 8.20 8.74 8.19 🥇3.31 🥇3.37 5.97 10.19 6.84 5.91 10.19 🥇3.69 🥇3.38 1327
MAD 8.49 9.29 7.85 🥇3.10 🥇3.16 5.56 9.96 7.81 6.18 9.96 🥇3.46 🥇3.16 1327
MD 8.10 8.71 8.20 🥇3.43 🥇3.34 5.88 10.19 6.92 5.93 10.19 🥇3.77 🥇3.34 1327

Ranking for numerical datasets

framework alibi_nograd alibi cadex cfnow_random cfnow_greedy dice growingspheres synas lore sedc cfnow_random_simple cfnow_greedy_simple N
index                          
validity 6.64 6.64 6.93 🥇5.09 🥇5.09 6.78 6.12 8.14 9.42 6.97 🥇5.09 🥇5.09 1598
sparsity 7.10 7.14 9.61 5.20 4.49 4.57 7.96 6.07 8.77 5.25 7.87 🥇3.97 1598
L2 4.83 4.84 5.54 4.28 4.72 9.75 🥇2.80 9.19 10.49 8.64 6.59 6.34 1598
MAD 6.02 6.06 8.41 🥇3.38 🥇3.44 8.17 6.83 8.02 10.15 7.03 6.02 4.47 1598
MD 5.32 5.34 5.70 4.02 4.21 9.68 🥇3.54 8.89 10.45 8.86 6.33 5.66 1598

Ranking for mixed datasets

framework alibi_nograd alibi cadex cfnow_random cfnow_greedy dice growingspheres synas lore sedc cfnow_random_simple cfnow_greedy_simple N
index                          
validity 7.89 7.38 5.00 🥇3.75 🥇3.75 8.07 9.75 6.51 9.56 8.84 🥇3.75 🥇3.75 1000
sparsity 7.81 7.67 7.31 3.76 🥇3.19 7.93 9.75 5.16 9.50 8.55 4.15 🥇3.22 1000
L2 7.64 7.70 7.01 🥇1.89 2.95 8.75 9.75 6.67 9.56 8.95 3.88 3.26 1000
MAD 8.11 7.96 6.11 3.92 🥇2.27 9.06 9.54 7.67 9.53 8.88 🥇2.67 🥇2.27 1000
MD 7.86 7.39 6.68 🥇3.01 🥇2.75 8.74 9.75 7.07 9.58 9.23 🥇3.02 🥇2.93 1000

Coverage analysis

The results below consider valid counterfactuals. In other words, counterfactuals that: (1) have a different prediction class if compared to the factual and (2) respects binary and one-hot encoding rules.

Coverage (%) for all datasets

Coverage (%) for categorical datasets

Coverage (%) for numerical continuous datasets

Coverage (%) for mixed datasets

Time Analysis

Time spent (in seconds) to generate a counterfactual explanation

Generation time (seconds) for all datasets

Generation time (seconds) for categorical datasets

Generation time (seconds) for numerical continuous datasets

Generation time (seconds) for mixed datasets

Reference

If you used this package on your experiments, here's the reference paper:

@Article{app11167274,
AUTHOR = {de Oliveira, Raphael Mazzine Barbosa and Martens, David},
TITLE = {A Framework and Benchmarking Study for Counterfactual Generating Methods on Tabular Data},
JOURNAL = {Applied Sciences},
VOLUME = {11},
YEAR = {2021},
NUMBER = {16},
ARTICLE-NUMBER = {7274},
URL = {https://www.mdpi.com/2076-3417/11/16/7274},
ISSN = {2076-3417},
DOI = {10.3390/app11167274}
}

About

Rank of Counterfactual Explanations for multiple metrics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages