Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taskcluster tasks sometimes hang at the end #10842

Closed
jgraham opened this issue May 4, 2018 · 5 comments
Closed

Taskcluster tasks sometimes hang at the end #10842

jgraham opened this issue May 4, 2018 · 5 comments

Comments

@jgraham
Copy link
Contributor

jgraham commented May 4, 2018

It looks like we have tasks that do all the work but don't exit properly. I guess it's some race condition during shutdown that may be exacerbated by printing a lot of logs about unexpected failures at the end. For example:

https://tools.taskcluster.net/groups/HbE1r6PHR-K6dVfjVBCyyg/tasks/EyHx_c0rTKeGvgSTyLeJBw/runs/0/logs/public%2Flogs%2Flive.log

@jugglinmike
Copy link
Contributor

I reviewed TaskCluster's behavior for all the builds between July 6 and July 12:

As noted, not all of those hang at the end. Many pause in the middle of test execution. After some trial-and-error, I was able to reproduce the problem locally using a generic Ubuntu 16.04 container and zero additional dependencies:

$ docker run -t --rm  ubuntu:16.04 bash -c 'seq $((2 ** 30))'

That process will eventually stall. If you run the command with --interactive, you'll find that it's waiting for terminal input--press any key, and it will continue.

I wrote up a very professional and impressive bug report for Docker, but I just now found the problem being discussed in the context of the Moby project, so you'll have to take my word for it.

It's hanging after an EIO error from the terminal. The hang is a bug but the
EIO is real. What is the use case for piping a ton of data through a
container with a pty? It's less overhead, fast, and no linux pty complexity
to just remove the -t from the container if you are doing these types of
things.

And the response is relevant:

For us it’s a ci system :)

Fortunately, someone has come through with a patch to containerd/console (yet another component of the project--Docker is complex!), so we can hope that this will resolve itself in time.

I don't know what the time frame is, though. The change will need to be released in Docker, and TaskCluster will need to migrate to the new version. That's a lot of moving parts. In the mean time, we could experiment with throttling standard output and standard error. What do you think @jgraham?

@jugglinmike
Copy link
Contributor

Here's some discussion about the issue on the TaskCluster bug tracker: https://bugzilla.mozilla.org/show_bug.cgi?id=1457694

@jugglinmike
Copy link
Contributor

The bug fix has been published in Docker version 18.06, and the folks at TaskCluster have updated to that release. With that change live in TaskCluster, we just need to wait and see if stability improves on WPT's master branch. If we're lucky, my terrible "throttle standard output" script will never see the light of day.

@jugglinmike
Copy link
Contributor

It's been a week since the folks at TaskCluster deployed the fix, and of the 56 TaskCluster builds which have run, zero have failed. Although this evidence is not as conclusive as, say, a passing unit test, the nature of the problem makes it difficult to be more precise.

@jgraham are you satisfied by these results? Do you think it's fair to call this issue resolved?

@jgraham
Copy link
Contributor Author

jgraham commented Aug 3, 2018

I am very pleased to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants