Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pick loop device discovery change, from sylabs 472 #9

Closed
wants to merge 3 commits into from
Closed

Conversation

edytuk
Copy link
Collaborator

@edytuk edytuk commented Dec 14, 2021

This pulls in sylabs PRs

The original PR descriptions were:

Recently an issue was discovered when using the Singularity container runtime in Arvados Crunch where certain Slurm jobs would start failing with an error message:

2021-11-26T18:57:12.918958066Z ERROR [container bcftools_convert_10] (arvc1-xvhdp-ffcjthyczz8gqol) error log:
2021-11-26T18:57:12.918958066Z 
2021-11-26T18:57:12.918958066Z   2021-11-26T18:56:45.272198441Z crunch-run Not starting a gateway server (GatewayAuthSecret was not provided by dispatcher)
2021-11-26T18:57:12.918958066Z   2021-11-26T18:56:45.272309584Z crunch-run crunch-run 2.3.1 (go1.17.1) started
2021-11-26T18:57:12.918958066Z   2021-11-26T18:56:45.272331939Z crunch-run Executing container 'arvc1-dz642-tlt1uw3hyppd06g' using singularity runtime
2021-11-26T18:57:12.918958066Z   2021-11-26T18:56:45.272354161Z crunch-run Executing on host 'slurm-worker-blue-4'
2021-11-26T18:57:12.918958066Z   2021-11-26T18:56:45.433328942Z crunch-run container token "token-obfuscated" 
2021-11-26T18:57:12.918958066Z   2021-11-26T18:56:45.433798736Z crunch-run Running [arv-mount --foreground --read-write --storage-classes default --crunchstat-interval=10 --file-cache 268435456 --mount-by-pdh by_id --disable-event-listening --mount-by-id by_uuid /tmp/crunch-run.arvc1-dz642-tlt1uw3hyppd06g.1063908694/keep2883686336]
2021-11-26T18:57:12.918958066Z   2021-11-26T18:56:49.241177040Z crunch-run Fetching Docker image from collection '3126bcd60bc91b04916bae8dfadede7d+177'
2021-11-26T18:57:12.918958066Z   2021-11-26T18:56:49.372828520Z crunch-run Using Docker image id "sha256:c74eb4247d8550fa2df0c5275a9dd3b34cb105347cc0fcefdf5a05749faaf0a1" 
2021-11-26T18:57:12.918958066Z   2021-11-26T18:56:49.372958779Z crunch-run Loading Docker image from keep
2021-11-26T18:57:12.918958066Z   2021-11-26T18:56:50.079531206Z crunch-run Starting container
2021-11-26T18:57:12.918958066Z   2021-11-26T18:56:50.080267548Z crunch-run Waiting for container to finish
2021-11-26T18:57:12.918958066Z   2021-11-26T18:56:52.939218932Z stderr FATAL:   container creation failed: mount /proc/self/fd/3->/usr/local/var/singularity/mnt/session/rootfs error: while mounting image /proc/self/fd/3: failed to find loop device: could not attach image file to loop device: failed to set loop flags on loop device: resource temporarily unavailable
2021-11-26T18:57:12.918958066Z   2021-11-26T18:56:53.365279723Z crunch-run Complete
2021-11-26T18:57:13.026434934Z ERROR [container bcftools_convert_10] unable to collect output from d41d8cd98f00b204e9800998ecf8427e+0:

with version 3.7.4 and 3.9.1 of Singularity. After reporting this issue on the Arvados bug tracker, its developers suggested that the issue is due to Singularity choosing the first /dev/loopN device that meets nearly all (but not all) of the criteria it needs:

  • exists or can be created
  • not already in use
  • can be configured with the desired parameters (offset, size limit, flags, etc)

Once it has chosen a device that meets the first two criteria, it waits 1.5s for it to meet the third criterion, then gives up.

The Arvados developers (thanks @tomclegg) have analyzed the underlying issue, and based on their observations we have come up with a patch that does the following:

  1. Move the F_SETFD and CmdSetStatus64 syscalls into the "try each /dev/loopN" loop

  2. If CmdSetStatus64 returns EAGAIN or EBUSY, set transientErrorFound to true, and move on to try the next loop device

  3. When all loop devices have been tried, return a TransientAttachError.

  4. AttachFromFile was renamed to attachFromFile, and a new AttachFromFile method was written that loops for up to 5 times over attachFromFile, while it continues to return a TransientAttachError.

We have not managed to come up with a reproducer for this issue outside of Arvados Crunch, as its occurrence appears to be tied to the fact that the Singularity images are accessed through a Fuse mount (using the arv-mount command).

The loop.AttachFromFile code, used to attach an image to a free loop device, or to find a shared loop device has been refactored.

Previously, a complex loop handled both finding shared loop devices and attaching to new free devices. The logic was difficult to follow, especially with regard to tracking which errors were fatal and would bubble up, vs result in loop continuation, or retries.

The shared and 'fresh' attachment have been separated into two functions. Some additional logging / error detail has been added.

The test code has been modified to prevent a leak of a loop device.

edytuk and others added 3 commits December 14, 2021 12:57
When the CmdSetStatus64 syscall returns EAGAIN or EBUSY, move on to try
the next loop device instead of aborting. Rename AttachFromFile to
attachFromFile and run in a loop while no valid loop devices were found
but EAGAIN or EBUSY was encountered.

This fixes an intermittent issue encountered on Arvados Crunch (https://dev.arvados.org/issues/18489)

Signed-off-by: Edita Kizinevic <edita.kizinevic@cern.ch>
Signed-off-by: Edita Kizinevic <edita.kizinevic@cern.ch>
The loop.AttachFromFile code, used to attach an image to a free loop
device, or to find a shared loop device has been refactored.

Previously, a complex loop handled both finding shared loop devices
and attaching to new free devices. The logic was difficult to follow,
especially with regard to tracking which errors were fatal and would
bubble up, vs result in loop continuation, or retries.

The shared and 'fresh' attachment have been separated into two
functions. Some additional logging / error detail has been added.

The test code has been modified to prevent a leak of a loop device.

Fixes sylabs/singularity#463

Signed-off-by: Edita Kizinevic <edita.kizinevic@cern.ch>
@edytuk edytuk requested a review from DrDaveD December 14, 2021 17:17
@edytuk edytuk closed this Dec 15, 2021
@edytuk edytuk deleted the sylabs472 branch December 16, 2021 10:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants