Reproducibility #3

martinsvat · 2021-09-10T10:52:05Z

Hi guys, good work, but I struggle a bit with reproducing your results. It's nothing serious, but it would be better to have clone-and-use approach. So far I encounter these little obstacles:

hardcoded paths, e.g. USER_HOME+"/workspace/StAR/data/"
It's a handy thing to have the exact commands to reproduce WN18RR in the readme, but I guess there are some incorrect paths, e.g. to run.py in RotatE/ folder with ./result/WN18RR_roberta-large/ (that should be ../StAR/results/WN18RR_roberta-large, right?); also guessing there is a typo in --output_dir ./result/FB15k-237_roberta-largel
When I followed your readme's commands exactly, I ended up with FileNotFoundError: [Errno 2] No such file or directory: './result/UMLS_model_roberta-large/similarity_score_mtx.npy'. The only place where */similarity_score_mtx.npy is saved is in get_ensembled_data.py which is not invocated anywhere in the project (although commented in get_ensembled_data.py).

Please, can you provide a fix for the similarity_score_mtx.npy missing file? I could simply remove the commented line but there is no mention of how to use get_ensembled_data.py.

best
Martin

wangbo9719 · 2021-09-16T03:23:32Z

Thanks for reporting these bugs. Sorry for the late response.

What you said is correct, sorry for the bad running experience. I have fixed them in this version.
Now, the similarity_score_mtx.npy can be obtained by removing the comment of the get_similarity() function in get_ensembled_data.py.

Thanks again for your reporting.

martinsvat · 2021-09-17T12:31:14Z

Thanks for the fix! However, I'm still having issues getting the same results as those in your paper. Namely, when I want to reproduce WN18RR, I do (as stated in the readme)

get_new_dev_dict.py (twice with different parameters)
run_link_prediction.py (4.1)
learning RotatE using their best hyperparameter setup: bash run.sh train RotatE wn18rr 0 0 512 1024 500 6.0 0.5 0.00005 80000 8 -de
run_get_ensemble_data.py
./codes/run.py
./ensemble/run.py once with tail and once with head mode
In the end, I end up with
head Hits @1: 0.20357370772176134
head Hits @3: 0.4572431397574984
head Hits @10: 0.6726228462029356
headMean rank: 57.47383535417996
headMean reciprocal rank: 0.3644130875627938
and
---------tail, test, lr=0.001, ep=3.0, nt=5, margin=0.6, bs=32 feature=mix metric ----------
tail Hits @1: 0.2906828334396937
tail Hits @3: 0.5370134014039566
tail Hits @10: 0.7565411614550096
tailMean rank: 54.801850670070195
tailMean reciprocal rank: 0.44718844813012876

Now, taking average of these, e.g. hits1 is not the same as in the paper: see, (tail-hits1 + head-hits1) / 2 = 0.24712827058072752, meanwhile there is 0.459 in the paper (table 4).

Please, can you provide more information (e.g. hyperparameter setup for RotatE's model) to get the exact results as you have in the paper? Or, did I use the commands incorrectly somewhere? (For example, not executing the last one, ensemble/run.py, twice with different modes, but the first time with train and the second time with --init?) I'd like to use StAR model but I need to have correct results for the start.

best
Martin

wangbo9719 · 2021-09-17T15:54:32Z

Your running commands seems correct. And I just use the official hyperparameter of RotatE to train the model on WN18RR.

The data about the trained model reported in paper was lost. I will reproduce the results recently and then tell you the results.

By the way, how about your obtained results of StAR and RotatE on WN18RR?

martinsvat · 2021-09-17T20:28:55Z

By the way, how about your obtained results of StAR and RotatE on WN18RR?

Final lines from train.log
Valid MRR at step 79999: 0.478470
Valid MR at step 79999: 3284.908372
Valid HITS@1 at step 79999: 0.432597
Valid HITS@3 at step 79999: 0.493243
Valid HITS@10 at step 79999: 0.571523
Evaluating on Test Dataset...
...
Test MRR at step 79999: 0.476083
Test MR at step 79999: 3369.924059
Test HITS@1 at step 79999: 0.428207
Test HITS@3 at step 79999: 0.494416
Test HITS@10 at step 79999: 0.571315

wangbo9719 · 2021-09-18T02:07:59Z

And the results of StAR?

martinsvat · 2021-09-18T08:08:30Z

Sorry, and thanks for help. Here is content of WN18RR_roberta-large/link_prediction_metrics.txt
Hits left @1: 0.20261646458200383
Hits right @1: 0.2782386726228462
###Hits @1: 0.240427568602425
Hits left @3: 0.45213784301212506
Hits right @3: 0.5188257817485641
###Hits @3: 0.4854818123803446
Hits left @10: 0.6668793873643906
Hits right @10: 0.7479259731971921
###Hits @10: 0.7074026802807913
Mean rank left: 57.20835992342055
Mean rank right: 53.99298021697511
###Mean rank: 55.60067007019783
Mean reciprocal rank left: 0.3616734820860267
Mean reciprocal rank right: 0.4341342479524534
###Mean reciprocal rank: 0.39790386501924

Which seems quite similar to what's in table 4. RotatE's results are also quite similar to what's in table 4.

wangbo9719 · 2021-09-18T08:45:49Z

Got it. I will try to find out the reason and tell you later.

martinsvat · 2021-10-05T23:23:08Z

Hi, any success reproducing the results?

Meanwhile, I have another question regarding the ensembling model. It is learned twice for tail and head prediction tasks, right? So, if one has a little bit different prediction task to predict the value of a triple (e1, r2, e3), he has to average outputs for both queries (e1, r2, e3) to head-learned and tail-learned model, right?

thx

wangbo9719 · 2021-10-06T07:20:43Z

Sorry for the very late response.

There were some bugs in the codes and commands before. Thanks for reporting.
I have updated this repo. To reproduce the ensemble results, please follow the new version and rerun the last command in 5.1:

CUDA_VISIBLE_DEVICES=3 python ./codes/run.py \
 	--cuda --init ./models/RotatE_wn18rr_0 \
 	--test_batch_size 16 \
 	--star_info_path /home/wangbo/workspace/StAR_KGC-master/StAR/result/WN18RR_roberta-large \
 	--get_scores --get_model_dataset

By the way, the performance of ensemble model may not be stable enough. For the command in 5.2, you can just use ‘add’ for –feature_method and do_prediction only to get a suboptimal result which is corresponding to the StAR (Ensemble) in Table 4 of the paper.

For your second question, I think what you said is a way to solve the triple classification task. Or you can modify the code to adapt to the task. You can refer to the code of KG-BERT who implements triple classification.

wangbo9719 · 2021-10-07T11:13:23Z

Sorry. I fixed a small bug just now. If you have followed the last version, the generated files are saved in the wrong paths and names. You can move the file to the correct directory.

wangbo9719 closed this as completed Jan 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducibility #3

Reproducibility #3

martinsvat commented Sep 10, 2021 •

edited

wangbo9719 commented Sep 16, 2021

martinsvat commented Sep 17, 2021 •

edited

wangbo9719 commented Sep 17, 2021

martinsvat commented Sep 17, 2021

wangbo9719 commented Sep 18, 2021

martinsvat commented Sep 18, 2021 •

edited

wangbo9719 commented Sep 18, 2021

martinsvat commented Oct 5, 2021

wangbo9719 commented Oct 6, 2021

wangbo9719 commented Oct 7, 2021

Reproducibility #3

Reproducibility #3

Comments

martinsvat commented Sep 10, 2021 • edited

wangbo9719 commented Sep 16, 2021

martinsvat commented Sep 17, 2021 • edited

wangbo9719 commented Sep 17, 2021

martinsvat commented Sep 17, 2021

wangbo9719 commented Sep 18, 2021

martinsvat commented Sep 18, 2021 • edited

wangbo9719 commented Sep 18, 2021

martinsvat commented Oct 5, 2021

wangbo9719 commented Oct 6, 2021

wangbo9719 commented Oct 7, 2021

martinsvat commented Sep 10, 2021 •

edited

martinsvat commented Sep 17, 2021 •

edited

martinsvat commented Sep 18, 2021 •

edited