Description
Is there an existing issue for this?
- I have searched the existing issues
Describe the bug
I've seen a number of cases where a number of containers are running, yet some containers decide to enter a "Runtime Unhealthy" state. I've seen this in 3 separate scenarios:
-
On the dashboard, a resource will enter
Runtime Unhealthy
state before transitioning to aRunning
state. This most often happens during application startup / afterWaitFor
, although I do occasionally see a running container briefly enterRuntime Unhealthy
before going back toRunning
-
In Tests, tests will randomly fail on
WaitForResourceHealthyAsync
because the resource enters theRuntime Unhealthy
state (even though other containers are running) -
With 9.2 I've today started experimenting with the container file api and I'm seeing these errors a lot.
fail: Aspire.Hosting.Dcp.dcpctrl.ContainerReconciler[0]
could not copy file to the container {"Container": {"name":"ONE-tcwcvabg"}, "Reconciliation": 52, "ContainerName": "ONE-tcwcvabg", "ContainerID": "1a8e5b2b76b90a08df176bb6085e9abfec82dfb035997a9f180a7cf1dc1075d6", "Destination": "/", "error": "runtime is not healthy\ndocker runtime is not healthy"}
fail: Aspire.Hosting.Dcp.dcpctrl.ContainerReconciler[0]
could not copy file to the container {"Container": {"name":"TWO-hjuvtvfg"}, "Reconciliation": 55, "ContainerName": "TWO-hjuvtvfg", "ContainerID": "9298a061ab3443f7cfecf940c713f0798633a0bb41420e9de5f8142f035c3d6e", "Destination": "/", "error": "runtime is not healthy\ndocker runtime is not healthy"}
In case 1 and 3, the error is annoying - containers retry and eventually start. In scenario 2 however it causes a test to become flakey due to tests using WaitBehaviour.StopOnResourceUnavailable
Expected Behavior
It feels slightly odd to me that individual containers can have different runtime health levels - I'd expect the runtime to be considered healthy for all containers, or for none .
Steps To Reproduce
Unfortunately this is an intermittent error, so no easy repro steps 😒.
With 9.2, I've been experimenting with adapting @asimmon's TLs cert blog post to the container API, it is whilst experimenting with this lifecycle hook that I've been noticing fairly frequent issues.
// Based upon https://anthonysimmon.com/dotnet-aspire-containers-trust-self-signed-certificates/
public class CertificateInjectorLifecycleHook : IDistributedApplicationLifecycleHook
{
public Task BeforeStartAsync(DistributedApplicationModel appModel, CancellationToken cancellationToken = default)
{
var containers = appModel.GetContainerResources().ToList();
if (!containers.Any())
{
return Task.CompletedTask;
}
var trustedCerts = CertTrust.GetTrustedCerts();
if (!trustedCerts.Any())
{
return Task.CompletedTask;
}
var bundle = CertTrust.BuildCertBundle(trustedCerts);
var certBundleFiles = Task.FromResult(BuildContainerFileSystemItemsForBundle());
foreach (var container in containers)
{
container.Annotations.Add(new ContainerFileSystemCallbackAnnotation()
{
Callback = (context, ct) => certBundleFiles,
DestinationPath = "/"
});
}
return Task.CompletedTask;
IEnumerable<ContainerFileSystemItem> BuildContainerFileSystemItemsForBundle()
{
return [
Directory("etc",
Directory("ssl",
Directory("certs",
CertBundle("ca-certificates.crt") // Debian/Ubuntu/Gentoo etc.
),
CertBundle("ca-bundle.pem"), // OpenSUSE
CertBundle("cert.pem") // Alpine Linux
),
Directory("pki",
Directory("ca-trust",
Directory("extracted",
Directory("pem",
CertBundle("tls-ca-bundle.pem") // CentOS/RHEL 7
)
)
),
Directory("tls",
Directory("certs",
CertBundle("ca-bundle.crt")// Fedora/RHEL 6
),
CertBundle("cacert.pem") // OpenELEC
)
)
)];
ContainerDirectory Directory(string directoryName, params ContainerFileSystemItem[] items)
=> new ContainerDirectory
{
Name = directoryName,
Entries = items
};
ContainerFile CertBundle(string fileName)
=> new ContainerFile
{
Name = fileName,
Mode = UnixFileMode.UserRead | UnixFileMode.GroupRead | UnixFileMode.OtherRead,
Contents = bundle
};
}
}
}
Exceptions (if any)
No response
.NET Version info
No response
Anything else?
I have seen this prior to 9.2, but I've mainly seen it after 9.2 upgrade. Although I'm not sure if this is because 9.2 has made the error more frequent, or whether it's because the Container File Api in 9.2 makes the error more in your face by logging it to the console so it stays there even after the issue resolves itself.