Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diagnosing NODE_FAILURE when using NextFlow on Cheaha #672

Open
wwarriner opened this issue Feb 5, 2024 · 0 comments
Open

Diagnosing NODE_FAILURE when using NextFlow on Cheaha #672

wwarriner opened this issue Feb 5, 2024 · 0 comments
Labels
feat: faq (ask.ci) https://ask.cyberinfrastructure.org/c/locales-data-centers-and-campus-rc/uab/52

Comments

@wwarriner
Copy link
Contributor

What would you like to see added?

If the following conditions are true, then consider that one or more NextFlow tasks may have insufficient memory allocated. Assume $jobid is the Slurm Job ID for the relevant NextFlow task.

  • sacct -j $jobid -X -o jobid,state shows NODE_FAILURE
  • The file .exitcode does not exist in the NextFlow task's working directory. Working directory here refers to the NextFlow concept.
  • No other helpful messages showing a more detailed description of the cause of the error, a more specific error name, or an exit code.
  • Tasks fail intermittently (not necessary, but increases likelihood)

When we have encountered researcher workflows where the above are true, the cause of the error has invariably been due to an "Out of Memory" (OOM) event.

@wwarriner wwarriner added the feat: faq (ask.ci) https://ask.cyberinfrastructure.org/c/locales-data-centers-and-campus-rc/uab/52 label Feb 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat: faq (ask.ci) https://ask.cyberinfrastructure.org/c/locales-data-centers-and-campus-rc/uab/52
Projects
None yet
Development

No branches or pull requests

1 participant