When starting a container fails, don't wait until the full timeout; #115
Conversation
|
.... and, not so much on the "safe" change. I believe this is a race condition issue. The order of operations I expected was:
This is what I see when I run this locally. However, what appears to be happening in TravisCI is:
I can get around that by adding a minimum runtime; any container that hasn't been running at least that long will not be considered running. A small default value will eliminate this case, while also adding what might be useful functionality for containers that don't listen on a port. |
|
Ah, that sounds unfortunate, but I guess good to have detected now. So basically the process achieves 'running' state but crashes within milliseconds? I think the small minimum threshold you suggest sounds right as long as the value is very small. Implementation wise is it not equivalent to a Thread.sleep before starting to check the container status? Perhaps it's worth making configurable and defaulting to 0ms; I get the feeling that this is going to be something that some use cases need but others don't. What do you think?
|
| profiler.start("Wait until container state=running"); | ||
| Unreliables.retryUntilTrue(30, TimeUnit.SECONDS, () -> { | ||
| profiler.start("Wait until container state=running, or there's evidence it failed to start."); | ||
| final Boolean[] startedOK = {null}; |
There was a problem hiding this comment.
instead of a boolean you can define some non-Exception throwable (i.e. ContainerStartError extends Throwable) and catch it.
There was a problem hiding this comment.
While that would work, I'd rather not. Recall that the javadoc for Error says:
"An Error is a subclass of Throwable that indicates serious problems that a reasonable application should not try to catch."
With that in mind, what you're suggesting doesn't feel (to me at least) to pass a "principle of least surprise" check. The boolean[] approach is consistent with what we're doing with the int[] in start().
…ail out immediately.
… case where a container starts but then quickly stops.
b6de7b0 to
c681a6f
Compare
|
Sorry for my absence over the last few days - I'm looking at this now! |
When starting a container fails, don't wait until the full timeout; bail out immediately.
I'm using "finishedAt" to tell if the container did start and finish. This was a change mentioned in #107, but this isn't a total fix for that issue.
This seems like a relatively safe change, and makes failures in container start a bit less annoying to troubleshoot.