Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error running singularity fatal error involving no loop devices available #1824

Closed
pelahi opened this issue Jun 29, 2023 · 4 comments
Closed

Comments

@pelahi
Copy link

pelahi commented Jun 29, 2023

Version of Singularity
3.10.3

Describe the bug
The fatal error reported when running with singularity --verbose occurs between the default mount and checking for template passwd. The message is

VERBOSE: Default mount: /etc/resolv.conf:/etc/resolv.conf
FATAL:   container creation failed: mount /proc/self/fd/3->/lus/joey/software/projects/pawsey0001/pelahi/setonix/2023.07-v2/software/linux-sles15-zen3/gcc-12.2.0profile/singularityce-3.10.3-nwq6ro3vobvnpjvxletrzz4ylh2evbu2/var/singularity/mnt/session/rootfs error: while mounting image /proc/self/fd/3: failed to find loop device: could not attach image file to loop device: no loop devices available
VERBOSE: Checking for template passwd file: 

This occurs on a HPE Cray EX system, specifically a AMD MI250x node (64-core Trento CPU, 4 AMD MI250x GPU cards) and only when running a job where there is more than one MPI process per node. That is a job running

#!/bin/bash
#SBATCH --nodes=1
# run with 2 or more tasks per node
#SBATCH --ntasks-per-node=2 

will fail. But if the request is for

#!/bin/bash
# two or more nodes
#SBATCH --nodes=2
# run with 1
#SBATCH --ntasks-per-node=1

will succeed.

To Reproduce
I know this error is not easily reproducible since it would require access to a HPE Cray EX system and specifically the images running on our GPU nodes. So I will not provide details of how to reproduce. I am posting here to get some guidance on how to interpret the error message.

Expected behavior
It should just run without generating this fatal error.

OS / Linux Distribution
Which Linux distribution are you using?

Linux nid001004 5.14.21-150400.24.46_12.0.63-cray_shasta_c #1 SMP Fri Mar 3 22:39:37 UTC 2023 (6e164f9) x86_64 x86_64 x86_64 GNU/Linux
NAME="SLES"
VERSION="15-SP4"
VERSION_ID="15.4"
PRETTY_NAME="SUSE Linux Enterprise Server 15 SP4"
ID="sles"
ID_LIKE="suse"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:15:sp4"
DOCUMENTATION_URL="https://documentation.suse.com/"

Installation Method
spack install singularityce@3.10.3 with minor edit to disable conmon.

@pelahi pelahi added the bug Something isn't working label Jun 29, 2023
@dtrudg dtrudg removed the bug Something isn't working label Jun 29, 2023
@dtrudg
Copy link
Member

dtrudg commented Jun 29, 2023

Please upgrade to a current supported release (3.11.4). It is likely that the issue is related to a kernel change (backported to LTS kernels / distro kernels this year), for which a fix has already been implemented.

A workaround is mentioned in the original issue thread:

#1499

If upgrading to 3.11.4 does not fix the issue, you're welcome to re-open. Thanks.

@dtrudg dtrudg closed this as completed Jun 29, 2023
@pelahi
Copy link
Author

pelahi commented Jun 30, 2023

Hi @dtrudg
So upgrading to 3.11.4 did not fix this issue. However, I will try the suggestion in #1499 (comment). I will let you know how things go.
Cheers,
Pascal

@pelahi
Copy link
Author

pelahi commented Jun 30, 2023

Hi @dtrudg , you can consider this issue closed as it did not relate to singularity per say but a regression in the particular linux kernel I was using. With an update to the boot parameters, I am no longer encountering the issue. Cheers

@dtrudg
Copy link
Member

dtrudg commented Jun 30, 2023

Hi @dtrudg So upgrading to 3.11.4 did not fix this issue. However, I will try the suggestion in #1499 (comment). I will let you know how things go. Cheers, Pascal

It's concerning that the update to 3.11.4 did not address the issue... while changing the boot parameter did.

The code path through 3.11.4 was changed and tested (including on SLES) to avoid the need to change the boot parameter. Though, it is, of course possible that the Cray specific kernel is setting different default max loop devices etc.

I appreciate you have a working solution... but if you are able to detail how you updated to 3.11.4 and tested it, that would be useful to ensure it's not an issue for others. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants