Diagnosing `NODE_FAILURE` when using NextFlow on Cheaha #672

wwarriner · 2024-02-05T20:11:48Z

What would you like to see added?

If the following conditions are true, then consider that one or more NextFlow tasks may have insufficient memory allocated. Assume $jobid is the Slurm Job ID for the relevant NextFlow task.

sacct -j $jobid -X -o jobid,state shows NODE_FAILURE
The file .exitcode does not exist in the NextFlow task's working directory. Working directory here refers to the NextFlow concept.
No other helpful messages showing a more detailed description of the cause of the error, a more specific error name, or an exit code.
Tasks fail intermittently (not necessary, but increases likelihood)

When we have encountered researcher workflows where the above are true, the cause of the error has invariably been due to an "Out of Memory" (OOM) event.

The text was updated successfully, but these errors were encountered:

wwarriner added the feat: faq (ask.ci) https://ask.cyberinfrastructure.org/c/locales-data-centers-and-campus-rc/uab/52 label Feb 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diagnosing `NODE_FAILURE` when using NextFlow on Cheaha #672

Diagnosing `NODE_FAILURE` when using NextFlow on Cheaha #672

wwarriner commented Feb 5, 2024

Diagnosing NODE_FAILURE when using NextFlow on Cheaha #672

Diagnosing NODE_FAILURE when using NextFlow on Cheaha #672

Comments

wwarriner commented Feb 5, 2024

What would you like to see added?

Diagnosing `NODE_FAILURE` when using NextFlow on Cheaha #672

Diagnosing `NODE_FAILURE` when using NextFlow on Cheaha #672