[Explainability Evaluation] Add accuracy metrics for evaluation with groundtruth #6137

shhs29 · 2022-12-05T03:53:19Z

This is a PR for the issue #5962.
Accuracy metrics such as accuracy, recall, precision , auc and f1_score are available.

TODOs

Add docstring as needed.
Update CHANGELOG.md
Fix failing import of torchmetrics.

codecov · 2022-12-05T03:59:55Z

Codecov Report

Merging #6137 (4912084) into master (d2f2503) will increase coverage by 0.02%.
The diff coverage is 100.00%.

❗ Current head 4912084 differs from pull request most recent head f872a48. Consider uploading reports for the commit f872a48 to get more accurate results

@@            Coverage Diff             @@
##           master    #6137      +/-   ##
==========================================
+ Coverage   84.52%   84.54%   +0.02%     
==========================================
  Files         376      377       +1     
  Lines       20906    20940      +34     
==========================================
+ Hits        17670    17703      +33     
- Misses       3236     3237       +1

Impacted Files	Coverage Δ
torch_geometric/explain/explanation.py	`98.87% <100.00%> (+0.05%)`	⬆️
torch_geometric/explain/metrics.py	`100.00% <100.00%> (ø)`
torch_geometric/utils/subgraph.py	`98.78% <0.00%> (-1.22%)`	⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

BlazStojanovic

Thank you @shhs29 for this PR! Here are a few high level comments:

Please move the contents of explain/evaluate/accuracy_metrics.py into explain/metrics.py
Let's try to refrain from using networkx in computing these metrics, given that you can access the masks directly from the Explanation, using torch for most of your computations will be much more efficient.
Moreover, try to use torchmetrics to evaluate ROC
Don't forget about thresholding masks!

Other things to help you out:
You can add this masks method to the Explanation class:

    @property
    def masks(self) -> dict[str, Tensor]:
        r"""Returns a dictionary of all masks available in the explanation"""
        mask_dict = {key:self[key] for key in self. keys if key.endswith('_mask') and self[key] is not None}
        return dict(sorted(mask_dict.items()))

This will allow you to access all the available masks in the explanation as

explanation.masks

then you can combine all the masks in the returned dictionary in to a single tensor with torch.view(-1) and torch.cat. This will make it easy to work with all masks at once, afterwards you can calculate number of true positives as

torch.sum(gt_mask_tensor == ex_mask_tensor)

and similarly for other metrics. Also with these improvements you should be left with considerably less code, so no need to split the function into two parts, lets just have grountruth_metrics(explanation: Explanation, groundtruth: Explanation)

shhs29 · 2022-12-08T19:04:26Z

Hi @BlazStojanovic,
Thanks a lot for the detailed comments. @venomouscyanide and I will work on these and update the PR soon.

for more information, see https://pre-commit.ci

BlazStojanovic

Thank you @shhs29 and @venomouscyanide, this is looking much better. Left some more comments for you to address, otherwise we are close to closing this issue :)

BlazStojanovic · 2022-12-11T11:01:30Z

test/explain/test_explanation.py

+def test_masks(data, node_mask, edge_mask, node_feat_mask, edge_feat_mask):
+    expected = []
+    if node_mask:
+        expected.append('node_mask')
+    if edge_mask:
+        expected.append('edge_mask')
+    if node_feat_mask:
+        expected.append('node_feat_mask')
+    if edge_feat_mask:
+        expected.append('edge_feat_mask')
+
+    explanation = create_random_explanation(
+        data,
+        node_mask=node_mask,
+        edge_mask=edge_mask,
+        node_feat_mask=node_feat_mask,
+        edge_feat_mask=edge_feat_mask,
+    )


Let's also test for the right mask values to be included in the masks property

Suggested change

def test_masks(data, node_mask, edge_mask, node_feat_mask, edge_feat_mask):

expected = []

if node_mask:

expected.append('node_mask')

if edge_mask:

expected.append('edge_mask')

if node_feat_mask:

expected.append('node_feat_mask')

if edge_feat_mask:

expected.append('edge_feat_mask')

explanation = create_random_explanation(

data,

node_mask=node_mask,

edge_mask=edge_mask,

node_feat_mask=node_feat_mask,

edge_feat_mask=edge_feat_mask,

)

@pytest.mark.parametrize('node_mask', [True, False])

@pytest.mark.parametrize('edge_mask', [True, False])

@pytest.mark.parametrize('node_feat_mask', [True, False])

@pytest.mark.parametrize('edge_feat_mask', [True, False])

def test_masks(data, node_mask, edge_mask, node_feat_mask, edge_feat_mask):

explanation = create_random_explanation(

data,

node_mask=node_mask,

edge_mask=edge_mask,

node_feat_mask=node_feat_mask,

edge_feat_mask=edge_feat_mask,

)

expected_keys = []

expected_values = []

if node_mask:

expected_keys.append('node_mask')

expected_values.append(explanation.node_mask)

if edge_mask:

expected_keys.append('edge_mask')

expected_values.append(explanation.edge_mask)

if node_feat_mask:

expected_keys.append('node_feat_mask')

expected_values.append(explanation.node_feat_mask)

if edge_feat_mask:

expected_keys.append('edge_feat_mask')

expected_values.append(explanation.edge_feat_mask)

assert set(explanation.masks.keys()) == set(expected_keys)

assert set(explanation.masks.values()) == set(expected_values)

BlazStojanovic · 2022-12-12T09:28:53Z

torch_geometric/explain/metrics.py

+from torch_geometric.explain import Explanation
+
+
+def groundtruth_metrics(


Suggested change

def groundtruth_metrics(

def get_groundtruth_metrics(

BlazStojanovic · 2022-12-12T09:30:04Z

test/explain/test_metrics.py

+    assert accuracy_metrics[0] == 1.0
+    assert accuracy_metrics[1] == 1.0
+    assert accuracy_metrics[2] == 1.0
+    assert accuracy_metrics[3] == 1.0
+    assert accuracy_metrics[4] == 0.5


Have you ran the tests, to see if these asserts are met? To me it is not immediately obvious that two random explanations will result in these.

Yes, you are right. Random explanations is not a good way to setup tests. I have updated the test to have hardcoded mask values. A few questions that I had in mind was: 1. What should the return type of these metrics be ? Should they be tensors or floats ? 2. Should we round off different accuracy values ? 3. Should we check for division by 0 (i.e, TP and TN is 0).

Great, using hardcoded masks here is a better way to test. Just a note when writing such tests in the future, always make sure to verify outcomes of hardcoded tests independently (i.e. by hand or other library)

Probably returning Tensors is better, torch functions return them anyway so you don't need to change anything but the type hint

I think we shouldn't round anything, let the user do this if they so choose afterwards

Yes, this is a very good point! We need to handle edge cases, let's handle them in a way consistent with sklearn

when true positive + false positive == 0, precision returns 0

when true positive + false negative == 0, recall returns 0

@BlazStojanovic Thanks for the detailed comment. The status of each of these items are as follows:

I have verified all the values, except for auroc, against sklearn metrics. I need some clarification on AUROC. Based on my analysis of sklearn auc score, its equivalent in torchmetrics is
auroc = AUROC(task="binary") auc = auroc(ex_mask_tensor, gt_mask_tensor)
Moreover, looking at their example of AUROC calculation, we do not need ROC thresholding first. However, I would appreciate your thoughts on the same.

I have updated the typehint.

I am currently not rounding the values.

I have added a couple of new conditions and added a new test for this flow. In addition to the conditions you mentioned, I added one more condition for f1_score where f1_score is set to 0 when precision == 0.0 or recall == 0.0.

BlazStojanovic · 2022-12-12T10:00:55Z

torch_geometric/explain/metrics.py

+def groundtruth_metrics(
+        explanation: Explanation,
+        groundtruth: Explanation) -> Tuple[float, float, float, float, float]:
+    """accuracy_scores: Compute accuracy scores when


Describe thresholding behaviour in the docstring, i.e. >0 thresholding for TP, and FP.

Also explain the order of returned metrics

BlazStojanovic · 2022-12-12T10:20:44Z

torch_geometric/explain/metrics.py

+    accuracy = (tp + tn) / (tp + fp + tn + fn)
+    recall = tp / (tp + fn)
+    f1_score = 2 * (precision * recall) / (precision + recall)
+    roc = ROC(task="binary")


Don't you need to do this before you threshold out the ex_mask_tensor? Because here roc receives a binary tensor, which it cannot threshold?

This comment refers to line 41.

Yes. That's right. I have updated the implementation to use original ex_mask_tensor in ROC.

BlazStojanovic

Just left two more minor comments, this is looking very good and once you incorporate these last two, we can approve and merge this! 👍🏻

BlazStojanovic · 2022-12-14T10:25:27Z

torch_geometric/explain/metrics.py

+        Currently we perform hard thresholding (where the threshold value
+        is set to 0) on explanation and groundtruth masks to get true
+        positives, true negatives, false positives and false negatives.
+        I.e., all values in explanation masks and ground truth masks which
+        are greater than 0 is set to 1.


I think it might actually be better to have an additional argument, threshold=0.0 which defaults to 0.0, but can be set by the user to be the thresholding of both masks. This docstring will then describe the default behavior.

BlazStojanovic · 2022-12-14T10:30:14Z

torch_geometric/explain/metrics.py

+    roc = ROC(task="binary")
+    fpr, tpr, thresholds = roc(ex_mask_tensor, gt_mask_tensor)


You're right, we don't need the roc calculation for auroc. I think we have two options here, We can also return the full ROC curve, or we can remove these two lines and just return the AUROC, I leave the choice up to you.

ROC curve gives three values: fpr, tpr, thresholds, whereas all other metrics return a single value. To maintain consistency, I have decided to calculate AUROC and return that. Moreover, with the inclusion of ROC curve, our return value will have nested values, in which case I believe returning a dict makes more sense.

venomouscyanide · 2022-12-14T12:45:17Z

@BlazStojanovic Struggling a bit with reStructuredText. Can you please share some best practices regarding .rst files and how to make sure the formatting is correct?

BlazStojanovic · 2022-12-14T14:04:00Z

torch_geometric/explain/metrics.py

+    r"""Returns different accuracy metrics on explanation when
+    groundtruth is available.


Suggested change

r"""Returns different accuracy metrics on explanation when

groundtruth is available.

r"""Compares an explanation with the ground truth explanation. Returns basic evaluation metrics - accuracy, recall, precision, f1_score, and auroc.

BlazStojanovic · 2022-12-14T14:04:51Z

torch_geometric/explain/metrics.py

+    .. note::
+        Currently we perform hard thresholding (where the threshold value
+        defaults to 0) on explanation and groundtruth masks to get true
+        positives, true negatives, false positives and false negatives.
+        I.e., all values in explanation masks and ground truth masks which
+        are greater than the threshold value is set to 1.


You can remove this note

BlazStojanovic · 2022-12-14T14:06:05Z

torch_geometric/explain/metrics.py

+        threshold (float): threshold value to perform hard thresholding.
+            (default: :obj:`0.0`)


Suggested change

threshold (float): threshold value to perform hard thresholding.

(default: :obj:`0.0`)

threshold (float): threshold value to perform hard thresholding of the `explanation` and `groundtruth` masks.

(default: :obj:`0.0`)

BlazStojanovic

Thanks @shhs29 for including all the comments. This is looking good, so I think we can move towards merging this with other explainability code @rusty1s!

shhs29 · 2022-12-15T17:24:33Z

@BlazStojanovic @rusty1s For the fidelity metric in main now, the metrics are under a new folder. We have a different structure. Is that fine ?

rusty1s · 2022-12-15T17:27:36Z

Yes, don't worry about it :)

rusty1s · 2022-12-16T09:25:59Z

Thank you!

Please note that I changed the code slightly. I think it is a bit dangerous to combine masks of different levels with each other, so I changed the interface to expects masks rather than Explanation objects. Also added a metrics argument to be able to select a subset of metrics to compute, and used torchmetrics consistently for computation. Hope the changes are okay for you.

shhs29 changed the title ~~Add accuracy metrics for explainability~~ [Explainability] Add accuracy metrics for explainability Dec 5, 2022

shhs29 changed the title ~~[Explainability] Add accuracy metrics for explainability~~ [Explainability] Add accuracy metrics for evaluation with groundtruth Dec 5, 2022

shhs29 changed the title ~~[Explainability] Add accuracy metrics for evaluation with groundtruth~~ [Explainability Evaluation] Add accuracy metrics for evaluation with groundtruth Dec 5, 2022

rusty1s assigned shhs29 Dec 5, 2022

rusty1s added feature 0 - Priority P0 explain labels Dec 5, 2022

shhs29 added 2 commits December 6, 2022 02:47

Add accuracy metrics for explainability

7eb1580

Update accuracy metrics to use explanation subgraph and complement

a3a08bf

shhs29 force-pushed the add-eval-metric-with-groundtruth branch from 3f8104d to a3a08bf Compare December 6, 2022 07:47

shhs29 marked this pull request as ready for review December 6, 2022 07:47

Merge branch 'master' into add-eval-metric-with-groundtruth

4b0fc2c

BlazStojanovic reviewed Dec 8, 2022

View reviewed changes

shhs29 and others added 11 commits December 9, 2022 23:27

Merge branch 'master' into add-eval-metric-with-groundtruth

da3ff7d

Move metrics file to explain directory

9566a9d

Add new masks property to Explanation, add tests for masks property

c85a567

Merge branch 'master' into add-eval-metric-with-groundtruth

b3e0508

[pre-commit.ci] auto fixes from pre-commit.com hooks

03ad13b

for more information, see https://pre-commit.ci

Update accuracy metrics for explanation

434ab51

Add test for explainability accuracy metrics

3f8ee1e

Apply hard threshold to masks for accuracy evaluation

e70433f

Comment metrics test

fd70432

Add test for accuracy metrics

e8c0b54

Merge branch 'master' into add-eval-metric-with-groundtruth

40b6b28

BlazStojanovic reviewed Dec 12, 2022

View reviewed changes

shhs29 added 3 commits December 12, 2022 17:43

Update test for explanation masks

6a9f55d

Merge branch 'master' into add-eval-metric-with-groundtruth

5c42b54

Rename accuracy metrics function

6adf529

shhs29 and others added 4 commits December 13, 2022 15:21

Merge branch 'master' into add-eval-metric-with-groundtruth

4ec51b6

Improved docstring

3295756

Fix lint error

51943fc

Change condition for f1_score

e2b9c97

BlazStojanovic reviewed Dec 14, 2022

View reviewed changes

venomouscyanide added 3 commits December 14, 2022 07:11

Make threshold an arguments, update docstring accordingly

a084df9

Update threshold default to the right value

6525cb7

Attempt cleaner docstring

68d9f9d

BlazStojanovic reviewed Dec 14, 2022

View reviewed changes

shhs29 and others added 4 commits December 14, 2022 10:05

Merge branch 'master' into add-eval-metric-with-groundtruth

513fc48

Update AUROC calculation

f0d71f9

Add torchmetrics to required libraries in setup.py

ac6ae21

Updates to docstring as per suggestions

677340f

BlazStojanovic approved these changes Dec 15, 2022

View reviewed changes

Merge branch 'master' into add-eval-metric-with-groundtruth

4912084

rusty1s added 5 commits December 16, 2022 09:31

Merge branch 'master' into add-eval-metric-with-groundtruth

d465a62

move to metric package

1b0b59c

update

31e3020

update

c97693f

update

dddba38

update

f872a48

rusty1s enabled auto-merge (squash) December 16, 2022 09:29

rusty1s approved these changes Dec 16, 2022

View reviewed changes

rusty1s merged commit e43aa42 into pyg-team:master Dec 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Explainability Evaluation] Add accuracy metrics for evaluation with groundtruth #6137

[Explainability Evaluation] Add accuracy metrics for evaluation with groundtruth #6137

shhs29 commented Dec 5, 2022 •

edited

codecov bot commented Dec 5, 2022 •

edited

BlazStojanovic left a comment •

edited

shhs29 commented Dec 8, 2022

BlazStojanovic left a comment

BlazStojanovic Dec 11, 2022

BlazStojanovic Dec 12, 2022

BlazStojanovic Dec 12, 2022

shhs29 Dec 13, 2022 •

edited

BlazStojanovic Dec 13, 2022

shhs29 Dec 14, 2022 •

edited

BlazStojanovic Dec 12, 2022

BlazStojanovic Dec 12, 2022

BlazStojanovic Dec 12, 2022 •

edited

shhs29 Dec 13, 2022

BlazStojanovic left a comment

BlazStojanovic Dec 14, 2022

BlazStojanovic Dec 14, 2022

shhs29 Dec 14, 2022

venomouscyanide commented Dec 14, 2022

BlazStojanovic Dec 14, 2022 •

edited

BlazStojanovic Dec 14, 2022

BlazStojanovic Dec 14, 2022

BlazStojanovic left a comment

shhs29 commented Dec 15, 2022

rusty1s commented Dec 15, 2022

rusty1s commented Dec 16, 2022 •

edited

		from torch_geometric.explain import Explanation


		def groundtruth_metrics(

		roc = ROC(task="binary")
		fpr, tpr, thresholds = roc(ex_mask_tensor, gt_mask_tensor)

		r"""Returns different accuracy metrics on explanation when
		groundtruth is available.

	r"""Returns different accuracy metrics on explanation when
	groundtruth is available.
	r"""Compares an explanation with the ground truth explanation. Returns basic evaluation metrics - accuracy, recall, precision, f1_score, and auroc.

		threshold (float): threshold value to perform hard thresholding.
		(default: :obj:`0.0`)

[Explainability Evaluation] Add accuracy metrics for evaluation with groundtruth #6137

[Explainability Evaluation] Add accuracy metrics for evaluation with groundtruth #6137

Conversation

shhs29 commented Dec 5, 2022 • edited

codecov bot commented Dec 5, 2022 • edited

Codecov Report

BlazStojanovic left a comment • edited

Choose a reason for hiding this comment

shhs29 commented Dec 8, 2022

BlazStojanovic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shhs29 Dec 13, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shhs29 Dec 14, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BlazStojanovic Dec 12, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BlazStojanovic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

venomouscyanide commented Dec 14, 2022

BlazStojanovic Dec 14, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BlazStojanovic left a comment

Choose a reason for hiding this comment

shhs29 commented Dec 15, 2022

rusty1s commented Dec 15, 2022

rusty1s commented Dec 16, 2022 • edited

shhs29 commented Dec 5, 2022 •

edited

codecov bot commented Dec 5, 2022 •

edited

BlazStojanovic left a comment •

edited

shhs29 Dec 13, 2022 •

edited

shhs29 Dec 14, 2022 •

edited

BlazStojanovic Dec 12, 2022 •

edited

BlazStojanovic Dec 14, 2022 •

edited

rusty1s commented Dec 16, 2022 •

edited