Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeated failures when fetching the centos:7 image #2063

Closed
psss opened this issue May 9, 2023 · 20 comments · Fixed by #2308
Closed

Repeated failures when fetching the centos:7 image #2063

psss opened this issue May 9, 2023 · 20 comments · Fixed by #2308
Assignees
Labels
Milestone

Comments

@psss
Copy link
Collaborator

psss commented May 9, 2023

The /tests/finish/ansible test recently fails repeatedly when fetching the centos:7 image:

    provision
        how: container
        image: centos:7
    fail: Command 'podman pull -q centos:7' returned 125.
    finish
        summary: 4 tasks completed

Command 'podman pull -q centos:7' returned 125.

stderr (1 lines)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Error: parsing image configuration: Get "https://cdn03.quay.io/sha256/86/8652b9f0cb4c0599575e5a003f5906876e10c1ceb2ab9fe1786712dac14a50cf?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAI5LUAQGPZRPNKSJA%2F20230509%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230509T091300Z&X-Amz-Expires=600&X-Amz-SignedHeaders=host&X-Amz-Signature=c2c2f0aa8a7432dac77824eeb08c5e24a5f7f082e6b89d4c4a09d14f53f35ec7&cf_sign=TwyGtFMDOKOvXJlUIptbQqu96NLFINyFXstV7d2mUJw3Z7s89br7VaojuL%2BirD5Drddxi7P%2BrjwFyWPUAvCa4s4xCiDIEBNJ8mnpIgk3Lkm8k%2FTKXZTa%2F%2FOksO27Tw79Mu8AZ%2Ffw8FGoEavukcSGCNs7XqoWDTL%2FZ%2BCEXsX1Ul2T58H5iPW6GVEjwyVr%2BcP92y8a%2BScf5xcxbNkAwGjayymxCohis%2BaVuV13l5EFpY2aJpBW0kxLAc2Rxd5%2BduG0hcxIyfwsjMXd4Jacs8rxpHzthrF7t%2BDX3FNlKw5N83wdRxebuNrHfTiSUtiEerE8wNfqKkvC6%2FWRltNisXo5Xw%3D%3D&cf_expiry=1683624180&region=us-east-1": dial tcp: lookup cdn03.quay.io: no such host
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Here's a recent failure example. Perhaps, we should implement a retry attempt to fetch the images to cover random network issues like this?

@psss psss added provision Provision step containers labels May 9, 2023
@psss psss added this to the 1.24 milestone May 9, 2023
@psss
Copy link
Collaborator Author

psss commented May 11, 2023

And one more again.

@psss psss modified the milestones: 1.24, 1.25 May 30, 2023
@psss
Copy link
Collaborator Author

psss commented May 30, 2023

Another one, this time in /tests/prepare/adjust on fedora-38.

@psss psss modified the milestones: 1.25, 1.26 Jun 20, 2023
@psss
Copy link
Collaborator Author

psss commented Jun 20, 2023

@thrix, here's one more from today on fedora-38.

@psss
Copy link
Collaborator Author

psss commented Jun 21, 2023

And one more, on fedora-37.

@psss
Copy link
Collaborator Author

psss commented Jun 27, 2023

And again, this time on rawhide.

@psss
Copy link
Collaborator Author

psss commented Jun 29, 2023

One more on fedora-38.

@psss
Copy link
Collaborator Author

psss commented Jun 30, 2023

And again on fedora-37 and rawhide.
Seems it would be good to handle this.

@psss
Copy link
Collaborator Author

psss commented Jul 11, 2023

But failures appear with centos:stream8 images as well:

Command 'podman pull -q centos:stream8' returned 125.
Error: copying system image from manifest list: parsing image configuration: Get "https://cdn03.quay.io/sha256/d2/d21b0fbc651303b3e2f027de29d69381b5dbabc4d98a8d891487b425c2f9693b?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAI5LUAQGPZRPNKSJA%2F20230711%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230711T104338Z&X-Amz-Expires=600&X-Amz-SignedHeaders=host&X-Amz-Signature=aa70cfd35d80882a4dbc1fcef1dbae209f671894463189308fb401d18db3a6e9&cf_sign=hao8kLyL036xd1KTzJtiVWcPUon0iXPaGd4eOBm0ahx2s%2Bbwk83ooDIe9CKyUMbZBd5xU%2FHbIUwHO7QMrXvBKMz150zoIa1iTBM6sQtxwWXQ5f6SUUKZGvVmSvr9%2FPz1HuaE91ArG%2Bl13JIJjEH70H3bGMPSvGdtDYHSi4C0YfNNP4kGLCXcLsGsVKQsWY0fkUyeQuUgILbHv7qBkPtwgRQR7JAlZ%2Ba%2FUzaK7j9JpAm%2BJO%2FFmv1YtnuMytgfS%2B0DTpUDZQPG3d7332lwuQqoTScB4cmli6AFsm69Fg4YoDwTkGUzTS%2F1OrBr9ds5btLtWrr6zFFWBH1ns9Q8YjolVQ%3D%3D&cf_expiry=1689072818&region=us-east-1": dial tcp: lookup cdn03.quay.io: no such host
:: [ 10:43:38 ] :: [   FAIL   ] :: Command 'tmt -c distro=centos-8 run -arvvv provision -h container -i centos:stream8 plan --name without test --name defined 2>&1 | tee /tmp/tmp.KlAdDlE6RZ' (Expected 0, got 2)

@psss
Copy link
Collaborator Author

psss commented Jul 27, 2023

So it seems that the issue is caused by systemd-resolved. Here are some relevant links:

On the hacking session we've agreed to disable systemd-resolved for the pull request testing and file a new systemd issue to properly investigate the problem. @thrix, could you please provide info about your testing farm workaround?

@psss psss modified the milestones: 1.26, 1.27 Jul 28, 2023
@thrix
Copy link
Collaborator

thrix commented Aug 8, 2023

@thrix thrix self-assigned this Aug 29, 2023
@psss
Copy link
Collaborator Author

psss commented Sep 4, 2023

Recent jobs seem to be all green. Have you seen the problem during the last two weeks?

@psss
Copy link
Collaborator Author

psss commented Sep 5, 2023

Ok, ran into another instance today.

psss added a commit that referenced this issue Sep 5, 2023
Prevent random dns failures caused (most probably) by
`systemd-resolved`. Using workaround from Testing Farm.

Fix #2063.
@psss psss assigned psss and unassigned thrix Sep 5, 2023
@psss
Copy link
Collaborator Author

psss commented Sep 5, 2023

Ok, trying to enable the Testing Farm workaround in #2308. Let's see how it works.

@psss psss removed this from the 1.27 milestone Sep 6, 2023
@psss psss added this to the 1.28 milestone Sep 6, 2023
psss added a commit that referenced this issue Sep 13, 2023
Prevent random dns failures caused (most probably) by
`systemd-resolved`. Using workaround from Testing Farm.

Fix #2063.
@psss psss closed this as completed in 2b567d1 Sep 14, 2023
@psss
Copy link
Collaborator Author

psss commented Sep 15, 2023

Hmmm, so another dial tcp: lookup cdn.registry.fedoraproject.org: no such host failure appeared again:

Error: copying system image from manifest list: parsing image configuration: Get "https://cdn.registry.fedoraproject.org/v2/fedora/blobs/sha256:72c9e456423548988a55fa920bb35c194d568ca1959ffcc7316c02e2f60ea0ff": dial tcp: lookup cdn.registry.fedoraproject.org: no such host

Let's see if there will be more...

@psss
Copy link
Collaborator Author

psss commented Sep 22, 2023

And there's more:

Error: parsing image configuration: Get "https://cdn03.quay.io/sha256/86/8652b9f0cb4c0599575e5a003f5906876e10c1ceb2ab9fe1786712dac14a50cf?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAI5LUAQGPZRPNKSJA%2F20230922%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230922T150852Z&X-Amz-Expires=600&X-Amz-SignedHeaders=host&X-Amz-Signature=d2372170c7ecd5fe799a0f3c33c8a3c08774ab5e6fe77b7672e2755b546442d6&cf_sign=bPLeZxSCe1tMMdCo07yPeEF%2FaWFb01htKGmfLUQN1kmy4YiiE7srY24cBoxMSvmsrQOPh8JhvRNBNTPkHST%2BWHvgSepRVs%2BlQOdRyrebjh%2Fin84Rlxx4KDOG50BJG3a5CpIOJqugvTeY6UHltiFGdm4apJYLQRKPXRRrmyWTOMC0tT%2Bltm96xZP3WQjxW89Ez5DMWbOz2a9l%2BuQ1S%2BhAt4QTsKZpnUl0Ij0oWs8%2F3b8MMwRBLxqqWhtcWLVd2ihP6BVCuNIowCuLqxPUQpJfepswuVB0MVFVv8z%2FG3Y8j%2BsMnO0PBQaZBVsPVABJ%2BGK1XaMQVYldzYOXcrO8zdG9aw%3D%3D&cf_expiry=1695395932&region=us-east-1": dial tcp: lookup cdn03.quay.io: no such host

@psss
Copy link
Collaborator Author

psss commented Nov 1, 2023

For reference, similar error appeared again even with the retry support:

The error message says:

stderr (4 lines)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
time="2023-11-01T15:36:22Z" level=warning msg="Failed, retrying in 1s ... (1/3). Error: initializing source docker://localhost/become-container-test:latest: pinging container registry localhost: Get \"https://localhost/v2/\": dial tcp [::1]:443: connect: connection refused"
time="2023-11-01T15:36:23Z" level=warning msg="Failed, retrying in 1s ... (2/3). Error: initializing source docker://localhost/become-container-test:latest: pinging container registry localhost: Get \"https://localhost/v2/\": dial tcp [::1]:443: connect: connection refused"
time="2023-11-01T15:36:24Z" level=warning msg="Failed, retrying in 1s ... (3/3). Error: initializing source docker://localhost/become-container-test:latest: pinging container registry localhost: Get \"https://localhost/v2/\": dial tcp [::1]:443: connect: connection refused"
Error: initializing source docker://localhost/become-container-test:latest: pinging container registry localhost: Get "https://localhost/v2/": dial tcp [::1]:443: connect: connection refused
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

@happz
Copy link
Collaborator

happz commented Nov 1, 2023

The good news is, that retries did work. At least according to the log. The bad news is, well, 5 attempts with 5 seconds between attempts was not enough to make it across the rough seas... It would be nice to set the default big enough to help, e.g. via envvars for plugin keys, but those apply only when the plugin is mentioned on command line, which was not the case here :/

@psss
Copy link
Collaborator Author

psss commented Nov 8, 2023

Today failed again, but another good news is that it did not happen under the podman plugin but during the container image preparation:

:: [ 15:36:20 ] :: [  BEGIN   ] :: Running 'podman build -t become-container-test:latest .'
STEP 1/3: FROM quay.io/fedora/fedora:latest
Trying to pull quay.io/fedora/fedora:latest...
Error: creating build container: copying system image from manifest list: parsing image configuration: Get "https://cdn03.quay.io/sha256/a1/a1cd3cbf8adaa422629f2fcdc629fd9297138910a467b11c66e5ddb2c2753dff?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAI5LUAQGPZRPNKSJA%2F20231101%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20231101T153621Z&X-Amz-Expires=600&X-Amz-SignedHeaders=host&X-Amz-Signature=c599dc85c3c96419caa2168dcee07fd4d0e545d9630f92bf542416cd743bf4db&cf_sign=QNIVhbRLD14trMM7appn7t4A81uihY2ct7SqqFYI89404Hfa%2BJTNmCpGQvQOSidfm%2FLsGySKGEOqxPiVeovvthcKDeOsRFo6b9nineXOFXpzG2b%2FITW7f3K4hxvEGsLLaQXDT89z%2FLQ1gV7ZnPj6mGPSvHY8BkCTuC%2BeLYhI4Xg8UclA07msz7H1X1L7BuDLRaPC%2B1aVCbT05n4nY4IDzfmSr0rBWX0UgcDi3AL87NmFZnIIrZZuuAEf5Q8lteeAEKfXuuuvhKZ0nVwDQ1FmBB6ZxBGpsVXmkip7eVzWnDAdDOauvfmlYn2isYQ3X%2BoKOROjb%2FnBQTZWzj0pj5ZDHg%3D%3D&cf_expiry=1698853581&region=us-east-1": dial tcp: lookup cdn03.quay.io: no such host
:: [ 15:36:21 ] :: [   FAIL   ] :: Command 'podman build -t become-container-test:latest .' (Expected 0, got 125)

So... perhaps, the retry actually works fine! :) Would be nice to check/grep recent logs to verify...

psss added a commit that referenced this issue Nov 8, 2023
Seems this is the only point where the podman tests are still
failing with the random network errors tracked in #2063. Let's
give several attempts to build the container to reduce the
failures.
psss added a commit that referenced this issue Nov 8, 2023
Seems this is the only point where the podman tests are still
failing with the random network errors tracked in #2063. Let's
give several attempts to build the container to reduce the
failures.
@psss
Copy link
Collaborator Author

psss commented Nov 8, 2023

good news is that it did not happen under the podman plugin but during the container image preparation:

Let's try to repeat the container image creation as well: #2467

psss added a commit that referenced this issue Nov 13, 2023
Seems this is the only point where the podman tests are still
failing with the random network errors tracked in #2063. Let's
give several attempts to build the container to reduce the
failures.
happz pushed a commit that referenced this issue Nov 13, 2023
Seems this is the only point where the podman tests are still
failing with the random network errors tracked in #2063. Let's
give several attempts to build the container to reduce the
failures.
@dustymabe
Copy link

On the hacking session we've agreed to disable systemd-resolved for the pull request testing and file a new systemd issue to properly investigate the problem.

@psss - did you ever open an issue against systemd?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants