New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Host shut down during build #4924

Closed
abesto opened this Issue Oct 12, 2015 · 29 comments

Comments

Projects
None yet
@abesto
Copy link

abesto commented Oct 12, 2015

We've seen this a couple of times, for example in https://travis-ci.org/openzipkin/docker-zipkin/builds/84941149.

Broadcast message from root@testing-gce-03e2cd86-76ae-4bc4-abbd-20ea07b052e7
    (unknown) at 15:45 ...
The system is going down for power off NOW!

And then of course things get confused, and the build fails a second or so later.

@BanzaiMan

This comment has been minimized.

Copy link
Member

BanzaiMan commented Oct 19, 2015

Sounds like the GCE instance had to be shut down for one reason or another. Have you seen this recently?

@abesto

This comment has been minimized.

Copy link
Author

abesto commented Oct 19, 2015

I personally haven't seen it since that build, but I'll ask around.

@solarce

This comment has been minimized.

Copy link

solarce commented Oct 21, 2015

I am on the infrastructure team at Travis and wanted to give you an explanation as to why you were seeing this.

For our Trusty beta, http://docs.travis-ci.com/user/trusty-ci-environment/, we're using Google Compute Engine to run the build VMs. The VMs we're using right now are provisioned as pre-emptible VMs, https://cloud.google.com/compute/docs/instances/preemptible, which means Google can shut them down at will.

We normally handle this case like any number of other failure scenarios that might require us to restart a build but we found that this particular scenario wasn't being handled properly in all cases, so builds were being marked as failed when they should have been restarted.

You should no longer see this happening and we should automatically restart a build when this scenario occurs. We're also looking into how we can better track this particular scenario in our metrics, as right now it's just counted as a single "requeued job" metric that covers a range of possible sources for requeue.

@solarce solarce closed this Oct 21, 2015

@abesto

This comment has been minimized.

Copy link
Author

abesto commented Oct 21, 2015

Thanks a lot for the in-depth explanation, appreciate it :)

Love the work you're all doing on Travis 💖

@edmorley

This comment has been minimized.

Copy link

edmorley commented Oct 23, 2015

Hi Brandon, small world :-)

You should no longer see this happening and we should automatically restart a build when this scenario occurs.

I've just seen this on a run that started 20 minutes ago; it expected that the job didn't retry?
https://travis-ci.org/mozilla/treeherder/builds/86937208#L712

@solarce

This comment has been minimized.

Copy link

solarce commented Oct 23, 2015

Hello @edmorley, good to see you :)

Thanks for the link, we'll dig into why we didn't restart that build properly

@buskamuza

This comment has been minimized.

Copy link

buskamuza commented Dec 21, 2015

Hi,

is it the same issue https://travis-ci.org/mazhalai/magento2/jobs/98193334 ?
Looks like the build is not automatically restarted.

@cesy

This comment has been minimized.

Copy link

cesy commented Jan 27, 2016

I'm still seeing this frequently on our repo - is there anything else going on? Will wrapping the commands in travis_retry help?

@dmitriivoitovich

This comment has been minimized.

Copy link

dmitriivoitovich commented Mar 17, 2016

It looks like the problem is still with us. We run about 100 builds per day and I see this "The system is going down for power off NOW!" quite often unfortunately.

@iangcarroll

This comment has been minimized.

Copy link

iangcarroll commented Mar 24, 2016

Yes, this also just happened for us, and it isn't restarted:

Broadcast message from root@testing-gce-74fdc79e-f255-41ce-8c77-513b8f9077b6
    (unknown) at 0:44 ...
The system is going down for power off NOW!
@ecnalyr

This comment has been minimized.

Copy link

ecnalyr commented Apr 8, 2016

Happened for me too, did not restart.

@nsuke

This comment has been minimized.

Copy link

nsuke commented Apr 17, 2016

I'm affected by this regularly.
4 jobs just failed approximately at the same time because of this.
All of them were in the middle of travis_retry and/or travis_wait but I'm not sure if it's anything to do with your shutdown detection.

Here's full log after shutdown message for a travis_retry travis_wait ... command.

Broadcast message from root@testing-gce-af053011-bd8b-4715-a3ab-4957fcf9c2f4
    (unknown) at 6:20 ...
The system is going down for power off NOW!
The command docker build -q -t thrift-build build/docker/ubuntu exited with 1.

Log:
An error occurred trying to connect: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.23/build?buildargs=%7B%7D&cgroupparent=&cpuperiod=0&cpuquota=0&cpusetcpus=&cpusetmems=&cpushares=0&dockerfile=Dockerfile&labels=%7B%7D&memory=0&memswap=0&q=1&rm=1&shmsize=0&t=thrift-build&ulimits=null: EOF
/home/travis/build.sh: line 112:  3545 Terminated              travis_jigger $! $timeout $cmd
The command "travis_wait docker build -q -t thrift-build build/docker/ubuntu" failed. Retrying, 2 of 3.


Still running (1 of 20): docker build -q -t thrift-build build/docker/ubuntu
The command docker build -q -t thrift-build build/docker/ubuntu exited with 1.

Log:
Cannot connect to the Docker daemon. Is the docker daemon running on this host?
/home/travis/build.sh: line 112: 30866 Terminated              travis_jigger $! $timeout $cmd
The command "travis_wait docker build -q -t thrift-build build/docker/ubuntu" failed. Retrying, 3 of 3.


Still running (1 of 20): docker build -q -t thrift-build build/docker/ubuntu
The command docker build -q -t thrift-build build/docker/ubuntu exited with 1.

Log:
Cannot connect to the Docker daemon. Is the docker daemon running on this host?
/home/travis/build.sh: line 112: 30886 Terminated              travis_jigger $! $timeout $cmd
The command "travis_wait docker build -q -t thrift-build build/docker/ubuntu" failed 3 times.

The command "travis_retry travis_wait docker build -q -t thrift-build build/docker/$DISTRO" failed and exited with 1 during .
Your build has been stopped.
@rpaterson

This comment has been minimized.

Copy link

rpaterson commented Apr 18, 2016

Just had this problem on one of our travis.com builds: https://travis-ci.com/signpost/core/jobs/37876888

@rpaterson

This comment has been minimized.

Copy link

rpaterson commented Apr 18, 2016

@solarce can you reopen this issue?

@solarce

This comment has been minimized.

Copy link

solarce commented Apr 19, 2016

Sorry for the ongoing issue. We believe we've identified a fix and are looking to deploy it to production tomorrow.

@solarce solarce reopened this Apr 19, 2016

@solarce solarce self-assigned this Apr 19, 2016

@solarce solarce added the team blue label Apr 19, 2016

@ogrisel

This comment has been minimized.

Copy link

ogrisel commented Apr 21, 2016

One more data point: it happened to us too, it wasn't restarted either and the build ended in "green" state:

https://travis-ci.org/pypa/manylinux/builds/124631630

@ppalaga

This comment has been minimized.

Copy link

ppalaga commented Apr 21, 2016

Same as @ogrisel here: the job got interrupted with "The system is going down for power off NOW!" but the job result was reported as success. It happened twice in a row after I restarted the job manually https://travis-ci.org/hawkular/hawkular-inventory/builds/124672840

@ogrisel

This comment has been minimized.

Copy link

ogrisel commented Apr 21, 2016

It also happened to me twice in a row when restarted the job manually from the web UI.

@LuxoftAKutsan

This comment has been minimized.

Copy link

LuxoftAKutsan commented Apr 21, 2016

Hello, during last 2 days I had the same issue. And terminated(via shutdown) build marked as succeed. It is quite confusing.
https://travis-ci.org/smartdevicelink/sdl_core/builds/124763617 ( I restarted this build due to project needs).
Another one with the same issue : https://travis-ci.org/LuxoftAKutsan/sdl_core/builds/124696000

@BanzaiMan BanzaiMan added mac and removed team blue labels Apr 21, 2016

@ribasushi

This comment has been minimized.

Copy link

ribasushi commented Apr 21, 2016

@BanzaiMan This is not mac-specific. I see this on linux GCE all over.

bors added a commit to rust-lang/rust that referenced this issue Aug 10, 2018

Auto merge of #53234 - kennytm:debug-9696, r=alexcrichton
Replace Travis shutdown debug scripts with DNS debug scripts

Since the cause of the host shutdown (travis-ci/travis-ci#4924) is found, we could revert the shutdown debug attempts to shorten the logs.

OTOH, the DNS failure still hasn't been solved, so here we added the script from travis-ci/travis-ci#9696 (comment).

bors added a commit to rust-lang/rust that referenced this issue Aug 10, 2018

Auto merge of #53234 - kennytm:debug-9696, r=alexcrichton
Replace Travis shutdown debug scripts with DNS debug scripts

Since the cause of the host shutdown (travis-ci/travis-ci#4924) is found, we could revert the shutdown debug attempts to shorten the logs.

OTOH, the DNS failure still hasn't been solved, so here we added the script from travis-ci/travis-ci#9696 (comment).

bors added a commit to rust-lang/rust that referenced this issue Aug 10, 2018

Auto merge of #53234 - kennytm:debug-9696, r=alexcrichton
Replace Travis shutdown debug scripts with DNS debug scripts

Since the cause of the host shutdown (travis-ci/travis-ci#4924) is found, we could revert the shutdown debug attempts to shorten the logs.

OTOH, the DNS failure still hasn't been solved, so here we added the script from travis-ci/travis-ci#9696 (comment).

bors added a commit to rust-lang/rust that referenced this issue Aug 10, 2018

Auto merge of #53234 - kennytm:debug-9696, r=alexcrichton
Replace Travis shutdown debug scripts with DNS debug scripts

Since the cause of the host shutdown (travis-ci/travis-ci#4924) is found, we could revert the shutdown debug attempts to shorten the logs.

OTOH, the DNS failure still hasn't been solved, so here we added the script from travis-ci/travis-ci#9696 (comment).

bors added a commit to rust-lang/rust that referenced this issue Aug 10, 2018

Auto merge of #53234 - kennytm:debug-9696, r=alexcrichton
Replace Travis shutdown debug scripts with DNS debug scripts

Since the cause of the host shutdown (travis-ci/travis-ci#4924) is found, we could revert the shutdown debug attempts to shorten the logs.

OTOH, the DNS failure still hasn't been solved, so here we added the script from travis-ci/travis-ci#9696 (comment).

bors added a commit to rust-lang/rust that referenced this issue Aug 10, 2018

Auto merge of #53234 - kennytm:debug-9696, r=alexcrichton
Replace Travis shutdown debug scripts with DNS debug scripts

Since the cause of the host shutdown (travis-ci/travis-ci#4924) is found, we could revert the shutdown debug attempts to shorten the logs.

OTOH, the DNS failure still hasn't been solved, so here we added the script from travis-ci/travis-ci#9696 (comment).

kennytm added a commit to kennytm/rust that referenced this issue Aug 10, 2018

Rollup merge of rust-lang#53234 - kennytm:debug-9696, r=alexcrichton
Replace Travis shutdown debug scripts with DNS debug scripts

Since the cause of the host shutdown (travis-ci/travis-ci#4924) is found, we could revert the shutdown debug attempts to shorten the logs.

OTOH, the DNS failure still hasn't been solved, so here we added the script from travis-ci/travis-ci#9696 (comment).

bors added a commit to rust-lang/rust that referenced this issue Aug 10, 2018

Auto merge of #53234 - kennytm:debug-9696, r=alexcrichton
Replace Travis shutdown debug scripts with DNS debug scripts

Since the cause of the host shutdown (travis-ci/travis-ci#4924) is found, we could revert the shutdown debug attempts to shorten the logs.

OTOH, the DNS failure still hasn't been solved, so here we added the script from travis-ci/travis-ci#9696 (comment).

kennytm added a commit to kennytm/rust that referenced this issue Aug 11, 2018

Rollup merge of rust-lang#53234 - kennytm:debug-9696, r=alexcrichton
Replace Travis shutdown debug scripts with DNS debug scripts

Since the cause of the host shutdown (travis-ci/travis-ci#4924) is found, we could revert the shutdown debug attempts to shorten the logs.

OTOH, the DNS failure still hasn't been solved, so here we added the script from travis-ci/travis-ci#9696 (comment).

kennytm added a commit to kennytm/rust that referenced this issue Aug 11, 2018

Rollup merge of rust-lang#53234 - kennytm:debug-9696, r=alexcrichton
Replace Travis shutdown debug scripts with DNS debug scripts

Since the cause of the host shutdown (travis-ci/travis-ci#4924) is found, we could revert the shutdown debug attempts to shorten the logs.

OTOH, the DNS failure still hasn't been solved, so here we added the script from travis-ci/travis-ci#9696 (comment).

bors added a commit to rust-lang/rust that referenced this issue Aug 13, 2018

Auto merge of #53234 - kennytm:debug-9696, r=alexcrichton
Replace Travis shutdown debug scripts with DNS debug scripts

Since the cause of the host shutdown (travis-ci/travis-ci#4924) is found, we could revert the shutdown debug attempts to shorten the logs.

OTOH, the DNS failure still hasn't been solved, so here we added the script from travis-ci/travis-ci#9696 (comment).
@RalfJung

This comment has been minimized.

Copy link

RalfJung commented Aug 13, 2018

Shouldn't this be reopened then until travis-ci/worker#481 actually lands?

bors added a commit to rust-lang/rust that referenced this issue Aug 16, 2018

Auto merge of #53234 - kennytm:debug-9696, r=<try>
[WIP] Replace Travis shutdown debug scripts with DNS debug scripts

Since the cause of the host shutdown (travis-ci/travis-ci#4924) is found, we could revert the shutdown debug attempts to shorten the logs.

OTOH, the DNS failure still hasn't been solved, so here we added the script from travis-ci/travis-ci#9696 (comment).

kennytm added a commit to kennytm/rust that referenced this issue Aug 16, 2018

Rollup merge of rust-lang#53234 - kennytm:debug-9696, r=alexcrichton
Remove Travis shutdown debug scripts, and remove CI-specific DNS settings

Since the cause of the host shutdown (travis-ci/travis-ci#4924) is found, we could revert the shutdown debug attempts to shorten the logs.

Additionally, we're pretty sure a custom DNS (added in  will not help travis-ci/travis-ci#9696, so reverting that part of rust-lang#51420 to reduce CI-specific settings.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment