You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Basically add resume="run" to the Task, but then also make sure the search output file is deleted in the beginning, because otherwise RETURNN fails with an exception.
As I understand, the Sisyphus resume logic would automatically increase the requirements when it crashed due to timeout or out-of-memory. This could be helpful, or not?
The text was updated successfully, but these errors were encountered:
I never considered this for the search task, as if there was a timeout or out-of-memory issue for me this was always a more substantial problem for me that required changing the job parameters. I am not a big fan of the resume logic for non-resumable logic (e.g. the search will always start from the beginning again) as relying on this leads to wasted computation in the long term.
But I don't exactly understand the argumentation. If you configured too less memory for some job, why is it ok if it is resumable but not ok if it needs to start again from scratch?
If you configured too less memory for some job, why is it ok if it is resumable
For me this is not "ok" in any way. If I notice that there are some memory/time issues I fix the setup right away, but do not rely on this for me to always run. Training jobs should be resumable, because they can actually continue from a checkpoint (independent of any resource issues, there might be other reasons for a process to be killed).
For me this automatic adjustment of resources is a relic of the GMM pipeline which distributed 100 parallel jobs, and you want 2-3 of them to restart with new requirements if they accidentally need a lot more resources (which can happen depending on the assigned segments). But even there I argue it is better to fix your search settings(e.g. max pruning) than to rely on that. I do not see the benefit for RETURNN recognition jobs...
Basically add
resume="run"
to theTask
, but then also make sure the search output file is deleted in the beginning, because otherwise RETURNN fails with an exception.As I understand, the Sisyphus resume logic would automatically increase the requirements when it crashed due to timeout or out-of-memory. This could be helpful, or not?
The text was updated successfully, but these errors were encountered: