Skip to content

Containers goes into RuntimeUnhealthy state even though other containers are running #8737

Closed
@afscrome

Description

@afscrome

Is there an existing issue for this?

  • I have searched the existing issues

Describe the bug

I've seen a number of cases where a number of containers are running, yet some containers decide to enter a "Runtime Unhealthy" state. I've seen this in 3 separate scenarios:

  1. On the dashboard, a resource will enter Runtime Unhealthy state before transitioning to a Running state. This most often happens during application startup / after WaitFor, although I do occasionally see a running container briefly enter Runtime Unhealthy before going back to Running

  2. In Tests, tests will randomly fail on WaitForResourceHealthyAsync because the resource enters the Runtime Unhealthy state (even though other containers are running)

  3. With 9.2 I've today started experimenting with the container file api and I'm seeing these errors a lot.

fail: Aspire.Hosting.Dcp.dcpctrl.ContainerReconciler[0]
      could not copy file to the container      {"Container": {"name":"ONE-tcwcvabg"}, "Reconciliation": 52, "ContainerName": "ONE-tcwcvabg", "ContainerID": "1a8e5b2b76b90a08df176bb6085e9abfec82dfb035997a9f180a7cf1dc1075d6", "Destination": "/", "error": "runtime is not healthy\ndocker runtime is not healthy"}
fail: Aspire.Hosting.Dcp.dcpctrl.ContainerReconciler[0]
      could not copy file to the container      {"Container": {"name":"TWO-hjuvtvfg"}, "Reconciliation": 55, "ContainerName": "TWO-hjuvtvfg", "ContainerID": "9298a061ab3443f7cfecf940c713f0798633a0bb41420e9de5f8142f035c3d6e", "Destination": "/", "error": "runtime is not healthy\ndocker runtime is not healthy"}

In case 1 and 3, the error is annoying - containers retry and eventually start. In scenario 2 however it causes a test to become flakey due to tests using WaitBehaviour.StopOnResourceUnavailable

Expected Behavior

It feels slightly odd to me that individual containers can have different runtime health levels - I'd expect the runtime to be considered healthy for all containers, or for none .

Steps To Reproduce

Unfortunately this is an intermittent error, so no easy repro steps 😒.

With 9.2, I've been experimenting with adapting @asimmon's TLs cert blog post to the container API, it is whilst experimenting with this lifecycle hook that I've been noticing fairly frequent issues.

   // Based upon https://anthonysimmon.com/dotnet-aspire-containers-trust-self-signed-certificates/
   public class CertificateInjectorLifecycleHook : IDistributedApplicationLifecycleHook
   {
      public Task BeforeStartAsync(DistributedApplicationModel appModel, CancellationToken cancellationToken = default)
      {
         var containers = appModel.GetContainerResources().ToList();

         if (!containers.Any())
         {
            return Task.CompletedTask;
         }
         var trustedCerts = CertTrust.GetTrustedCerts();
         if (!trustedCerts.Any())
         {
            return Task.CompletedTask;
         }

         var bundle = CertTrust.BuildCertBundle(trustedCerts);
         var certBundleFiles = Task.FromResult(BuildContainerFileSystemItemsForBundle());

         foreach (var container in containers)
         {
            container.Annotations.Add(new ContainerFileSystemCallbackAnnotation()
            {
               Callback = (context, ct) => certBundleFiles,
               DestinationPath = "/"
            });
         }

         return Task.CompletedTask;

         IEnumerable<ContainerFileSystemItem> BuildContainerFileSystemItemsForBundle()
         {
            return [
               Directory("etc",
               Directory("ssl",
                  Directory("certs",
                     CertBundle("ca-certificates.crt") // Debian/Ubuntu/Gentoo etc.
                  ),
                  CertBundle("ca-bundle.pem"), // OpenSUSE
                  CertBundle("cert.pem") // Alpine Linux
               ),
               Directory("pki",
                  Directory("ca-trust",
                     Directory("extracted",
                        Directory("pem",
                           CertBundle("tls-ca-bundle.pem") // CentOS/RHEL 7
                        )
                     )
                  ),
                  Directory("tls",
                     Directory("certs",
                        CertBundle("ca-bundle.crt")// Fedora/RHEL 6
                     ),
                     CertBundle("cacert.pem") // OpenELEC
                  )
               )
            )];

            ContainerDirectory Directory(string directoryName, params ContainerFileSystemItem[] items)
               => new ContainerDirectory
               {
                  Name = directoryName,
                  Entries = items
               };

            ContainerFile CertBundle(string fileName)
               => new ContainerFile
               {
                  Name = fileName,
                  Mode = UnixFileMode.UserRead | UnixFileMode.GroupRead | UnixFileMode.OtherRead,
                  Contents = bundle
               };
         }
      }
   }

Exceptions (if any)

No response

.NET Version info

No response

Anything else?

I have seen this prior to 9.2, but I've mainly seen it after 9.2 upgrade. Although I'm not sure if this is because 9.2 has made the error more frequent, or whether it's because the Container File Api in 9.2 makes the error more in your face by logging it to the console so it stays there even after the issue resolves itself.

Metadata

Metadata

Assignees

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions