Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Restart buildkit after containerd when provisioning #461

Merged
merged 2 commits into from
Aug 10, 2023

Conversation

pendo324
Copy link
Member

Issue #, if available: Fixes #412

Description of changes:

  • Previously, only containerd was restarted after configuring it to use the data on the persistent disk. This changes the UUID of the server worker. BuildKit also needs to be restarted to use the proper UUID. See issue for why this is important.

Testing done:

  • Local testing

  • I've reviewed the guidance in CONTRIBUTING.md

License Acceptance

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

finch.yaml Outdated Show resolved Hide resolved
@pendo324
Copy link
Member Author

Seems like all of the errors on the tests are similar to this:

Finch Container Development E2E Tests save an image when the images exist should save multiple images with --output option
/Users/ec2-user/go/pkg/mod/github.com/runfinch/common-tests@v0.7.0/tests/save.go:73
  7ad7d303eef13a1d2f7cea1b13e6f5ad271e72836ed8b7104e705a8d60d74290
  No containers to be removed
  9977826e0d1d
  No images to be removed
  889f8d4032e4d9758db5fde95109c1028b52b51739f15b112690c6a150abf9a4
  bridge
  host
  none
  No networks to be removed
  2023-06-27 16:30:21.873538484 +0000 UTC finch /images/create {"name":"localhost:58043/alpine:latest"}
  cdfb4bbd2199
  2023-06-27 16:30:22.605963755 +0000 UTC finch /images/create {"name":"public.ecr.aws/docker/library/alpine:3.13"}
  469b6e04ee18
  time="2023-06-27T16:30:22Z" level=fatal msg="failed to get reader: content digest sha256:25f523f0e93b2b5fa676c15d91b90f08ee4de7a160874e6c52ea452929d5a7cc: not found"
  time="2023-06-27T16:30:22Z" level=fatal msg="exit status 1"
  [FAILED] in [It] - /Users/ec2-user/go/pkg/mod/github.com/runfinch/common-tests@v0.7.0/command/command.go:122 @ 06/27/23 16:30:22.866

Wondering if its because of the way the public.ecr.aws/docker/library/alpine:3.13 repo is cached in our local registry. Taking a look.

@austinvazquez
Copy link
Member

Seems like all of the errors on the tests are similar to this:

Finch Container Development E2E Tests save an image when the images exist should save multiple images with --output option
/Users/ec2-user/go/pkg/mod/github.com/runfinch/common-tests@v0.7.0/tests/save.go:73
  7ad7d303eef13a1d2f7cea1b13e6f5ad271e72836ed8b7104e705a8d60d74290
  No containers to be removed
  9977826e0d1d
  No images to be removed
  889f8d4032e4d9758db5fde95109c1028b52b51739f15b112690c6a150abf9a4
  bridge
  host
  none
  No networks to be removed
  2023-06-27 16:30:21.873538484 +0000 UTC finch /images/create {"name":"localhost:58043/alpine:latest"}
  cdfb4bbd2199
  2023-06-27 16:30:22.605963755 +0000 UTC finch /images/create {"name":"public.ecr.aws/docker/library/alpine:3.13"}
  469b6e04ee18
  time="2023-06-27T16:30:22Z" level=fatal msg="failed to get reader: content digest sha256:25f523f0e93b2b5fa676c15d91b90f08ee4de7a160874e6c52ea452929d5a7cc: not found"
  time="2023-06-27T16:30:22Z" level=fatal msg="exit status 1"
  [FAILED] in [It] - /Users/ec2-user/go/pkg/mod/github.com/runfinch/common-tests@v0.7.0/command/command.go:122 @ 06/27/23 16:30:22.866

Wondering if its because of the way the public.ecr.aws/docker/library/alpine:3.13 repo is cached in our local registry. Taking a look.

Been seeing some of these failures across a few different PRs. Potentially not related to your changes.

@pendo324
Copy link
Member Author

Been seeing some of these failures across a few different PRs. Potentially not related to your changes.

I reran the CI on #455, and it didn't hit these errors. I think its something related to this one

@pendo324
Copy link
Member Author

pendo324 commented Jun 30, 2023

The tests are failing here on this PR and on the finch-core repo. The similarity? Both are now running with synced containerd/buildkit worker UUIDs. In other words, the "correct" behavior appears to be that these tests fail. Looking into when/why this started happening.

@ginglis13
Copy link
Contributor

ginglis13 commented Jul 11, 2023

I've pulled your changes down locally, verified that the UUIDs are matching, but can't reproduce the e2e test failures locally.

These tests have been failing on finch-core since e2e tests were added. The difference between finch and finch-core e2e tests is that finch-core e2e 1/ runs a slightly different Fedora VM and 2/ runs tests with nerdctl as the test subject rather than finch.

Both are now running with synced containerd/buildkit worker UUIDs

True, but for some reason the Fedora VM used in finch-core e2e tests doesn't require this buildkit systemd service ordering change to enforce that UUID match... see https://github.com/runfinch/finch-core/blob/eaeaa96443a633d49d51143373650a93ab83a467/e2e/fedora.yaml#L148-L150

Here's my results using https://github.com/runfinch/finch-core/blob/main/e2e/fedora.yaml:

[giinglis@lima-finch finch-core]$ uname -a
Linux lima-finch 6.1.18-200.fc37.aarch64 #1 SMP PREEMPT_DYNAMIC Sat Mar 11 16:03:54 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
[giinglis@lima-finch finch-core]$ sudo ctr info
{
    "server": {
        "uuid": "226df813-f172-43e2-ba4f-97d015bdca4d",
        "pid": 1794,
        "pidns": 4026531836
    }
}
[giinglis@lima-finch finch-core]$ sudo buildctl debug workers --format "{{json .}}" | jq -r '. | .[].labels'
{
  "org.mobyproject.buildkit.worker.containerd.namespace": "default",
  "org.mobyproject.buildkit.worker.containerd.uuid": "226df813-f172-43e2-ba4f-97d015bdca4d",
  "org.mobyproject.buildkit.worker.executor": "containerd",
  "org.mobyproject.buildkit.worker.hostname": "lima-finch",
  "org.mobyproject.buildkit.worker.network": "host",
  "org.mobyproject.buildkit.worker.selinux.enabled": "false",
  "org.mobyproject.buildkit.worker.snapshotter": "overlayfs"
}

The changes in this PR do what we expect, which is sync the UUIDs for buildkit and containerd. finch save is where we are seeing CI failures. The high level behavior of finch save -o file.tar image1 image2 should work - by that mean both nerdctl and docker have this feature. Still investigating why we're seeing these test failures in save once UUIDs are synced...

@ginglis13
Copy link
Contributor

tl;dr for CI issues here is garbage collection labels not being set properly. I got to the bottom of it and opened an issue on nerdctl containerd/nerdctl#2372, but in reality this is a bug with buildkitv0.11.x . The bug has been patched in moby/buildkit#3972 which was included in buildkitv0.12.0 which was released just yesterday: https://github.com/moby/buildkit/releases/tag/v0.12.0.

Opened up containerd/nerdctl#2374 to upgrade nerdctl's buildkit to v0.12.0. Let's keep this PR open until we have both a new nerdctl release with the upgrade and a new Lima release pointing to the new nerdctl release.

@ginglis13
Copy link
Contributor

Should be able to rebase and re-run CI now #521 has been merged with dependency upgrades

Signed-off-by: Justin Alvarez <alvajus@amazon.com>
Signed-off-by: Justin Alvarez <alvajus@amazon.com>
@pendo324
Copy link
Member Author

Should be able to rebase and re-run CI now #521 has been merged with dependency upgrades

Just rebased 🤞

@pendo324 pendo324 closed this Aug 10, 2023
@pendo324 pendo324 reopened this Aug 10, 2023
@pendo324 pendo324 merged commit fca1828 into runfinch:main Aug 10, 2023
19 of 27 checks passed
@pendo324
Copy link
Member Author

Benchmarking steps:

  1. git clone https://github.com/apache/couchdb-docker.git (just an example image with a lot of / large layers)
  2. cd couchdb-docker/3.3.2
  3. clear all layers / images finch system prune -a -f
  4. regular build with Finch 0.7.0 time finch build -t couchdb .
    1. the time is 49s on my PC
  5. clear all layers / images again finch system prune -a -f
  6. do build with options that simulate the fix time finch build -t couchdb --output 1. type=image,unpack=true .
    1. time is 37s on my PC
  7. 49/37 = 1.33, so 33% increase in performance

KevinLiAWS pushed a commit that referenced this pull request Aug 16, 2023
🤖 I have created a release *beep* *boop*
---


## [0.8.0](v0.7.0...v0.8.0)
(2023-08-16)


### Features

* adding config option for SOCI installation on VM
([#506](#506))
([a2e077b](a2e077b))


### Bug Fixes

* configure aws creds in sync submodules/deps action
([#518](#518))
([b67452e](b67452e))
* give pull request write permissions to sync job
([#520](#520))
([55b5235](55b5235))
* give token write perms to sync-submodules
([#519](#519))
([8b639ea](8b639ea))
* Mount /var/folders to finch vm
([#525](#525))
([c97d2e9](c97d2e9))
* option to use installed lima for SOCI e2e tests
([#533](#533))
([8b66659](8b66659))
* quote recursive calls to make
([#515](#515))
([d603096](d603096))
* Restart buildkit after containerd when provisioning
([#461](#461))
([fca1828](fca1828))


### Build System or External Dependencies

* **deps:** Bump github.com/docker/cli from 24.0.4+incompatible to
24.0.5+incompatible
([#495](#495))
([e9e8617](e9e8617))
* **deps:** Bump github.com/docker/docker from 24.0.4+incompatible to
24.0.5+incompatible
([#497](#497))
([6f1afbb](6f1afbb))
* **deps:** Bump github.com/lima-vm/lima from 0.16.0 to 0.17.2
([#531](#531))
([6e33d15](6e33d15))
* **deps:** Bump github.com/onsi/gomega from 1.27.8 to 1.27.10
([#496](#496))
([d08d102](d08d102))
* **deps:** Bump github.com/pkg/sftp from 1.13.5 to 1.13.6
([#530](#530))
([09b3846](09b3846))
* **deps:** Bump github.com/shirou/gopsutil/v3 from 3.23.6 to 3.23.7
([#513](#513))
([83bd718](83bd718))
* **deps:** Bump golang.org/x/tools from 0.11.0 to 0.11.1
([#509](#509))
([e826bcf](e826bcf))
* **deps:** Bump golang.org/x/tools from 0.11.1 to 0.12.0
([#523](#523))
([09d6514](09d6514))
* **deps:** Bump k8s.io/apimachinery from 0.27.3 to 0.27.4
([#487](#487))
([444bbc0](444bbc0))
* **deps:** Bump k8s.io/apimachinery from 0.27.4 to 0.28.0
([#535](#535))
([8df84cf](8df84cf))
* **deps:** Bump submodules and dependencies
([#521](#521))
([1b3ad94](1b3ad94))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@pendo324 pendo324 mentioned this pull request Aug 24, 2023
1 task
pendo324 added a commit to runfinch/finch-core that referenced this pull request Aug 25, 2023
Issue #, if available:

*Description of changes:*
- Fixes e2e tests by restarting BuildKit when containerd is restarted
(similar issue faced in runfinch/finch#461)
- Uses a more up to date yaml file for testing (will automatically
update the OS version too since that's included in the makefile)

*Testing done:*


- [x] I've reviewed the guidance in CONTRIBUTING.md


#### License Acceptance

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Signed-off-by: Justin Alvarez <alvajus@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slower image builds because of an inconsistent containerd UUID in buildkit
3 participants