-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] Different in performance and best model between analysis.best_config and the best model in analysis.dataframe #17923
Comments
Hey @marcovaresi thanks for raising this! Would you be able to provide more info here? What is analysis.best_config giving you vs. what does the full analysis.dataframe look like? What is the default metric and mode you pass into tune.run? If you could provide a small reproducible example showing this bug, that would be great. |
i made this little example using a tutorial of a model in pytorch lightning:
###run###
here the best model has the checkpoint at epochs 70 (probably my run is not reproducible)
here the best model is again the same but at epochs 71 so the value of validation loss is different, in this case the loss is bigger, but during the run of my complex model the best model is not the same and the minimum val loss here is smaller than the best model of analysis |
I think this is due to the confusing concept of the
If you call Here is an example that illustrates the problem you're running into:
There are two problems here. First, the In the meantime, you can fix this behavior like this (for the example above):
We'll revise the experiment analysis experience in the near future to hopefully prevent these kind of problems |
What is the problem?
pytorch_lightning: '1.4.2'
ray: '1.5.2'
tensorboardX: '2.4'
callbacks = TuneReportCheckpointCallback
scheduler = ASHAScheduler
search_alg = HyperOptSearch
reporter = CLIReporter
at the end of the tune.run (searching for the best performance on the validation loss) i obtain the best model (best config) and the best checkpoint.
When i move the result in a df using analysis.dataframe if i search for the minimum value for the validation loss, frequently i obtain a different configuration with respect to the best model.
the pl model it runs with earlystopping on validation loss
Reproduction (REQUIRED)
Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):
If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".
The text was updated successfully, but these errors were encountered: