Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

This code chooses best model on test set instead of validation set. #2

Closed
zzxn opened this issue Jan 9, 2021 · 2 comments
Closed

Comments

@zzxn
Copy link

zzxn commented Jan 9, 2021

At utils/train.py, the train process returns the best MRR and HIT on TEST SET, RESPECTIVELY:

def train(self, epochs, log_interval=100):
    # ...
    for epoch in range(epochs):
        self.model.train()
        for batch in self.train_loader:
        # ...
        mrr, hit = evaluate(self.model, self.test_loader, self.device)  # mrr and hit on test set
        # ...
        max_mrr = max(max_mrr, mrr)  # best mrr on test set
        max_hit = max(max_hit, hit)  # best hit on test set
        self.epoch += 1
    return max_mrr, max_hit  # <- It's NOT OK to return best mrr and hit on TEST SET, RESPECTIVELY

I believe this code is not the final version, please update it to the code used in your paper.

@zzxn zzxn changed the title This code choose best model on test set instead of validation set. This code chooses best model on test set instead of validation set. Jan 9, 2021
@twchen
Copy link
Owner

twchen commented Jan 10, 2021

This is actually a common practice in session-based recommendation. The same setting was adopted in some previous works such as STAMP, SR-GNN, and FGNN, and in some recently published works such as DHCN and TAGNN. We could even find the same setting in other fields of recommender systems, such as LightGCN and NGCF.

Despite the popularity of our setting, I agree with you that in principle, we should choose the best model on the validation set and evaluate the model on the test set. The common approach is to first extract a most recent fraction of the data as the test set, and then extract a most recent fraction of the remaining data as the validation set. But this approach may also have a problem, i.e., the distributions of the validation set and the test set are different because their collection times are different, so the performance on the validation set does not reveal the true performance on the test set. I guess this is part of the reason some existing methods did not adopt the correct-in-principle approach. I think a better approach is to first extract a most recent fraction of the data and then randomly split it into the validation and test sets. But I did not think of this approach when we published this work.

In conclusion, I think our setting is OK because it is very common. Besides, it is used for all models in our experiments. The comparison of models' performance is fair.
You could update our code to use the validation set to select the best model. It should be quite easy. However, we would not update our code to make it consistent with our paper.

@zzxn
Copy link
Author

zzxn commented Jan 10, 2021

Still confused by this common practice, but thanks for your reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants