Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make ReturnnSearchJobV2 resumable? #329

Open
albertz opened this issue Oct 22, 2022 · 3 comments
Open

Make ReturnnSearchJobV2 resumable? #329

albertz opened this issue Oct 22, 2022 · 3 comments
Assignees

Comments

@albertz
Copy link
Member

albertz commented Oct 22, 2022

Basically add resume="run" to the Task, but then also make sure the search output file is deleted in the beginning, because otherwise RETURNN fails with an exception.

As I understand, the Sisyphus resume logic would automatically increase the requirements when it crashed due to timeout or out-of-memory. This could be helpful, or not?

@JackTemaki
Copy link
Contributor

I never considered this for the search task, as if there was a timeout or out-of-memory issue for me this was always a more substantial problem for me that required changing the job parameters. I am not a big fan of the resume logic for non-resumable logic (e.g. the search will always start from the beginning again) as relying on this leads to wasted computation in the long term.

@albertz
Copy link
Member Author

albertz commented Oct 25, 2022

But I don't exactly understand the argumentation. If you configured too less memory for some job, why is it ok if it is resumable but not ok if it needs to start again from scratch?

@JackTemaki
Copy link
Contributor

If you configured too less memory for some job, why is it ok if it is resumable

For me this is not "ok" in any way. If I notice that there are some memory/time issues I fix the setup right away, but do not rely on this for me to always run. Training jobs should be resumable, because they can actually continue from a checkpoint (independent of any resource issues, there might be other reasons for a process to be killed).

For me this automatic adjustment of resources is a relic of the GMM pipeline which distributed 100 parallel jobs, and you want 2-3 of them to restart with new requirements if they accidentally need a lot more resources (which can happen depending on the assigned segments). But even there I argue it is better to fix your search settings(e.g. max pruning) than to rely on that. I do not see the benefit for RETURNN recognition jobs...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants