Skip to content

Conversation

Nic-Ma
Copy link
Contributor

@Nic-Ma Nic-Ma commented Feb 1, 2021

Signed-off-by: Nic Ma nma@nvidia.com

Fixes #1586

Description:
This PR added an option greater_or_equal to the Checkpoint handler, whether to save checkpoint if new priority equals to _saved[0].

Check list:

  • New tests are added (if a new feature is added)
  • New doc strings: description and/or example code are in RST format
  • Documentation is updated (if required)

Signed-off-by: Nic Ma <nma@nvidia.com>
@vfdev-5
Copy link
Collaborator

vfdev-5 commented Feb 1, 2021

Hi @Nic-Ma , thanks a lot for the PR !
I see your point about score_function and this option : #1586 (comment)

Seems like greater_or_equal will be only used with score_function, right ? As if we use Checkpoint that defines its priority by the iteration, than greater_or_equal=False wont be helpful...

@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Feb 1, 2021

Hi @vfdev-5 ,

Yes, I didn't change the default behavior, just added an option for the case that score equals to the previous _save[0] and you want to save the latest checkpoint even with the same score.
If using iteration as the priority, it will not have the case that priority equals to the previous one, right?

Thanks.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Feb 1, 2021

@Nic-Ma I think I misunderstood the PR.

Yes, I didn't change the default behavior, just added an option for the case that score equals to the previous _save[0] and you want to save the latest checkpoint even with the same score.
If using iteration as the priority, it will not have the case that priority equals to the previous one, right?

I think it was asked previously and we added that such that latest equally scored model is stored... Let me check that

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Feb 1, 2021

OK, catched up, what we did was to be able to save the latest model with the same filename.

@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Feb 1, 2021

Thanks for your confirmation, and this is a really useful feature, especially in FL, for example:
During FL training, we keep the training run for a long time and add more and more data to train the model day by day, the later model should be better even has the same metrics.

Thanks.

Copy link
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some comments to fix the implementation

Signed-off-by: Nic Ma <nma@nvidia.com>
@vfdev-5
Copy link
Collaborator

vfdev-5 commented Feb 1, 2021

Thanks for the explanation! Yes, true that this makes a lot of sense. I wonder if we should not set that as a default behaviour now ? And for users who'd like to have BC, they could use greater_or_equal=False...

@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Feb 1, 2021

Thanks for the explanation! Yes, true that this makes a lot of sense. I wonder if we should not set that as a default behaviour now ? And for users who'd like to have BC, they could use greater_or_equal=False...

But I think for regular training on a fixed dataset, maybe the earlier model with the same metrics is better, because the later model may be overfitting? That's why we usually use early-stopping?
So I prefer to default to False, and it will not break any ignite previous behavior.

Thanks.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Feb 1, 2021

yes, it could also prevent saving overfitted model. Let's keep it False by default, I agree.

Signed-off-by: Nic Ma <nma@nvidia.com>
Copy link
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks a lot @Nic-Ma !

@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Feb 1, 2021

I can't see the CI errors, could you please help me figure it out?

Thanks.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Feb 1, 2021

I can't see the CI errors, could you please help me figure it out?

Thanks.

@Nic-Ma it is not an error, I just updated the PR to the latest master and thus cancelled Circle CI tests, but Github interprets this as a failure with a red cross.

Signed-off-by: Nic Ma <nma@nvidia.com>
@vfdev-5 vfdev-5 merged commit 0981e37 into pytorch:master Feb 1, 2021
@Nic-Ma Nic-Ma deleted the add-compare-fn branch February 1, 2021 12:05
@vfdev-5
Copy link
Collaborator

vfdev-5 commented Feb 1, 2021

Oh, I forgot about adding versionadd in the docs.
In the end of the docs, we can add

.. versionadded:: 0.4.3
    Added ``greater_or_equal`` parameter.

@Nic-Ma could you please send a follow-up PR with that ?

@ydcjeff
Copy link
Contributor

ydcjeff commented Feb 1, 2021

Most project uses versionadd for new class, funtion, etc and uses versionchanged for adding new args / behaviours and bug fixes.
Example: https://docs.python.org/3/library/os.html#os.environb

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Feb 1, 2021

@ydcjeff thanks for the details ! In the provided link, actually I couldn't find any new args added, but I think it may seem reasonable to use versionchanged for that.

@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Feb 1, 2021

Hi @vfdev-5 @ydcjeff ,

Sure, I submitted PR #1600 and changed to versionchanged.
Could you please help review it?

Thanks.

@ydcjeff
Copy link
Contributor

ydcjeff commented Feb 1, 2021

@vfdev-5 Yea, the example may be belongs to changing of internal behaviour.

@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Feb 25, 2021

Hi @vfdev-5 ,

This is a missing feature for our project, could you please help add this MR to your 0.4.4 bug fix release?

Thanks.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Feb 25, 2021

Hi @Nic-Ma, sure !

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Mar 1, 2021

Hi @Nic-Ma , actually, just checked but this PR is already present in v0.4.3: https://pytorch.org/ignite/handlers.html#ignite.handlers.Checkpoint
Please, let me know if I'm missing something...

@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Mar 1, 2021

Cool, thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support options to compare in CheckpointHander
3 participants