Inconsistency with BOND paper #83

CaptainJuice · 2023-08-04T09:20:43Z

I run main.py multiple times with DOMINANT (from https://github.com/pygod-team/pygod/tree/main/benchmark).
I find out that although the hyperparameter setting is consistent with the BOND paper (https://arxiv.org/pdf/2206.10071.pdf), the results on inj_cora (AUC: 0.7566±0.0332 (0.7751)) and inj_amazon (AUC: 0.7147±0.0006 (0.7152)) are significantly different from what you show in table 3 from the BOND paper (https://arxiv.org/pdf/2206.10071.pdf), which are 82.7±5.6 (84.3) on inj_cora and 81.3±1.0 (82.2) for inj_amazon.
Is there any advice that you can provide about how to reproduce the results of the BOND paper?

kayzliu · 2023-08-06T07:23:17Z

Thank you for your report. For a short answer, you can reproduce the results by downgrading to the original version used by the benchmark v0.3.1 via pip install pygod=0.3.1.

After our invistigation, we found that the difference in the performance is mainly caused by the changes of parameter weight (alpha in v0.3.1) of DOMINANT. In the original benchmark, we apply a heuristic method to select the weight for DOMINANT (and some other methods) on injected datasets. The experimental results, as you may observed, are indeed better than random selection. But later we found that this heuristic method cannot be generalized to other datasets. To avoid misleading users, we remove this heuristic method from PyGOD in the later release.

Also, current benchmark script is out of date, ignoring the selection of weight. Could you please help us update model initialization in utils?

Remove heuristic selection of weight.
Update the parameter name for detectors (DOMINANT, GAAN, and CONAD) from alpha to weight.

If you have any further questions, please feel free to let us know.

ParthaPratimBanik · 2023-08-15T11:20:59Z

Hi @kayzliu,
I am trying to update benchmark/utils.py, to regenerate the Table-3 result.

I have some questions:

Remove heuristic selection of weight.

does it mean alpha=choice(alpha) to remove? I think you mean the choice() function by "heuristic selection".

Please correct me, if I am wrong.

kayzliu · 2023-08-16T04:26:43Z

Hi Partha! For removing the heuristic selection, I think you only need to change the following lines. Remove the if-else and set the alpha to [0.8, 0.5, 0.2] without condition. But note that it may not reproduce the results in Table 3. For exact reproduction of the results, please downgrade to 0.3.1.

pygod/benchmark/utils.py

Lines 37 to 41 in 987776c

    
           if args.dataset[:3] == 'inj' or args.dataset[:3] == 'gen': 
        
               # auto balancing on injected dataset 
        
               alpha = [None] 
        
           else: 
        
               alpha = [0.8, 0.5, 0.2]

ParthaPratimBanik · 2023-08-16T08:05:11Z

Hi @kayzliu ,
Thank you for your reply.

Hi Partha! For removing the heuristic selection, I think you only need to change the following lines. Remove the if-else and set the alpha to [0.8, 0.5, 0.2] without condition.

Yes, I got your point. I will come back after removing if-else block.

But note that it may not reproduce the results in Table 3. For exact reproduction of the results, please downgrade to 0.3.1.

Yes, you may right. I need to downgrade to 0.3.1.

On 1.0.0, I already start testing on all kinds of possible following combinations for DOMINANT model on inj_coar dataset.

dropout = [0.0, 0.1, 0.3, 0.5, 0.7]
lr = [0.004, 0.01, 0.05, 0.10] # lr brings change on AUC
weight_decay = [0.0, 0.01]
hid_dim = [8, 12, 16, 32, 48, 64, 128, 256, 512, 1024]
weight = [None, 0.5]

But I did not find any AUC >=0.80 on inj_coar dataset, maximum is ~0.7691.
I did not check other models yet.

You can check on benchmark/results folder on ParthaPratimBanik/update_bm_utils repo.
testing code: benchmark/test_dominant.py

My target is to get the generalized hyper-parameter range for every model with every dataset, which could be version independent. I don't know, is it quite possible or not? It is quite uncertain for me, until now.

What do you think? Will it be effective or not? Should I continue or stop this testing?

Sorry, for this long comment.

ParthaPratimBanik · 2023-08-17T12:04:10Z

Due to my typo mistake wight in PR #86, commit 63ae71a, the DOMINANT result was not improved. It was always within 0.75-0.77.

After fixing the issue, now I got a maximum of ~0.8367 and a maximum average of ~0.78 on v1.0.0.

kayzliu · 2023-08-17T18:36:51Z

fixed in #86.

kayzliu added the bug Something isn't working label Aug 6, 2023

ParthaPratimBanik mentioned this issue Aug 15, 2023

PyGOD Issue #83 ParthaPratimBanik/pygod#1

Closed

3 tasks

ParthaPratimBanik mentioned this issue Aug 16, 2023

fix torch.mean(auc) and updating model init in utils.py #86

Merged

4 tasks

kayzliu closed this as completed Aug 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistency with BOND paper #83

Inconsistency with BOND paper #83

CaptainJuice commented Aug 4, 2023

kayzliu commented Aug 6, 2023

ParthaPratimBanik commented Aug 15, 2023

kayzliu commented Aug 16, 2023

ParthaPratimBanik commented Aug 16, 2023

ParthaPratimBanik commented Aug 17, 2023

kayzliu commented Aug 17, 2023

Inconsistency with BOND paper #83

Inconsistency with BOND paper #83

Comments

CaptainJuice commented Aug 4, 2023

kayzliu commented Aug 6, 2023

ParthaPratimBanik commented Aug 15, 2023

kayzliu commented Aug 16, 2023

ParthaPratimBanik commented Aug 16, 2023

ParthaPratimBanik commented Aug 17, 2023

kayzliu commented Aug 17, 2023