Skip to content

When starting a container fails, don't wait until the full timeout; #115

Merged
rnorth merged 3 commits intotestcontainers:masterfrom
locationlabs:dont-wait-on-failed-start
Apr 9, 2016
Merged

When starting a container fails, don't wait until the full timeout; #115
rnorth merged 3 commits intotestcontainers:masterfrom
locationlabs:dont-wait-on-failed-start

Conversation

@ostacey
Copy link
Copy Markdown

@ostacey ostacey commented Apr 1, 2016

When starting a container fails, don't wait until the full timeout; bail out immediately.

I'm using "finishedAt" to tell if the container did start and finish. This was a change mentioned in #107, but this isn't a total fix for that issue.

This seems like a relatively safe change, and makes failures in container start a bit less annoying to troubleshoot.

@ostacey
Copy link
Copy Markdown
Author

ostacey commented Apr 2, 2016

.... and, not so much on the "safe" change. I believe this is a race condition issue. The order of operations I expected was:

  • attempt to start container
  • container fails to start
  • inspect container status: running=false, finishedWhen=nonzero
  • --> we conclude that container stopped

This is what I see when I run this locally. However, what appears to be happening in TravisCI is:

  • attempt to start container
  • container begins to start
  • inspect container status: running=true
  • --> we assume container started correctly
  • container fails to start

I can get around that by adding a minimum runtime; any container that hasn't been running at least that long will not be considered running. A small default value will eliminate this case, while also adding what might be useful functionality for containers that don't listen on a port.

@rnorth
Copy link
Copy Markdown
Member

rnorth commented Apr 2, 2016

Ah, that sounds unfortunate, but I guess good to have detected now.

So basically the process achieves 'running' state but crashes within milliseconds?

I think the small minimum threshold you suggest sounds right as long as the value is very small. Implementation wise is it not equivalent to a Thread.sleep before starting to check the container status?

Perhaps it's worth making configurable and defaulting to 0ms; I get the feeling that this is going to be something that some use cases need but others don't. What do you think?

On 2 Apr 2016, at 03:35, ostacey notifications@github.com wrote:

.... and, not so much on the "safe" change. I believe this is a race condition issue. The order of operations I expected was:

attempt to start container
container fails to start
inspect container status: running=false, finishedWhen=nonzero
--> we conclude that container stopped
This is what I see when I run this locally. However, what appears to be happening in TravisCI is:

attempt to start container
container begins to start
inspect container status: running=true
--> we assume container started correctly
container fails to start
I can get around that by adding a minimum runtime; any container that hasn't been running at least that long will not be considered running. A small default value will eliminate this case, while also adding what might be useful functionality for containers that don't listen on a port.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub

profiler.start("Wait until container state=running");
Unreliables.retryUntilTrue(30, TimeUnit.SECONDS, () -> {
profiler.start("Wait until container state=running, or there's evidence it failed to start.");
final Boolean[] startedOK = {null};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of a boolean you can define some non-Exception throwable (i.e. ContainerStartError extends Throwable) and catch it.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While that would work, I'd rather not. Recall that the javadoc for Error says:
"An Error is a subclass of Throwable that indicates serious problems that a reasonable application should not try to catch."

With that in mind, what you're suggesting doesn't feel (to me at least) to pass a "principle of least surprise" check. The boolean[] approach is consistent with what we're doing with the int[] in start().

@ostacey ostacey force-pushed the dont-wait-on-failed-start branch from b6de7b0 to c681a6f Compare April 7, 2016 21:54
@rnorth
Copy link
Copy Markdown
Member

rnorth commented Apr 9, 2016

Sorry for my absence over the last few days - I'm looking at this now!

@rnorth rnorth merged commit c681a6f into testcontainers:master Apr 9, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants