Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign up"Couldn't resolve host" errors almost daily around 6–7am UTC since May 24th #9696
Comments
This was referenced Jun 3, 2018
kennytm
added a commit
to kennytm/rust
that referenced
this issue
Jun 7, 2018
manuels
added a commit
to manuels/rust
that referenced
this issue
Jun 8, 2018
This was referenced Jun 17, 2018
pietroalbini
referenced this issue
Jun 26, 2018
Merged
add `dyn ` to display of dynamic (trait) types #51104
This comment has been minimized.
This comment has been minimized.
pietroalbini
commented
Jun 26, 2018
|
We also tried setting fallback DNS servers in the docker containers (rust-lang/rust#51420), but that had no effects. |
This was referenced Jun 27, 2018
This was referenced Jul 6, 2018
kennytm
referenced this issue
Jul 13, 2018
Merged
Change RangeInclusive to a three-field struct. #51622
This comment has been minimized.
This comment has been minimized.
pietroalbini
commented
Jul 20, 2018
|
Today's failure is https://travis-ci.org/rust-lang/rust/jobs/406102611 (raw):
|
This was referenced Jul 22, 2018
meatballhat
self-assigned this
Jul 23, 2018
This comment has been minimized.
This comment has been minimized.
soulshake
commented
Jul 25, 2018
|
@kennytm From your (extremely helpful!) failure table, it appears that it happens no more than once per day. Do you ever see a second failure on the same day? |
This comment has been minimized.
This comment has been minimized.
|
@soulshake right, IIRC we haven't seen more than one #9696 per day. However, we'll automatically cancel a build if any of its job has failed, so we don't know if this is affecting just one job or all jobs in the build. Also, each build takes ~2 hours to complete if successful, which makes it very unlikely for two builds to fall within the same hour. |
This comment has been minimized.
This comment has been minimized.
soulshake
commented
Jul 25, 2018
•
|
@kennytm We're trying to reproduce. In the meantime, if you're so inclined, I would love to know:
@meatballhat and I hacked together this (as yet untested) script if it could be helpful. (Is there a specific rust-lang docker image you would recommend for testing that I could pull?) |
This comment has been minimized.
This comment has been minimized.
Not always. We tried to detect DNS failure previously at rust-lang/rust#51939 which itself was successful, but in between rust-lang/rust#51762 (comment) did fail due to #9696. This might due to the 1.1.1.1 switch, or because 51939 did not issue any network request during that period.
Thanks, we'll try to deploy it and report back :)
Our docker images are at https://github.com/rust-lang/rust/tree/master/src/ci/docker; this one should have made more network requests allowing easier reproduction. But they are not too isolated, so the easiest perhaps would be:
|
kennytm
referenced this issue
Jul 26, 2018
Closed
[DO NOT MERGE] Test whatever Travis is having bug currently #52727
bors
added a commit
to rust-lang/rust
that referenced
this issue
Jul 26, 2018
bors
added a commit
to rust-lang/rust
that referenced
this issue
Jul 26, 2018
This was referenced Dec 1, 2018
This comment has been minimized.
This comment has been minimized.
soulshake
commented
Dec 4, 2018
•
As far as I know, you're the only ones to report this. However, I wouldn't rule out the possibility that it's happening to others without the pattern being noticed; or perhaps it only happens with jobs starting at a certain time that are still running 1 hour later; or some other unique set of circumstances that your project is rare in meeting, etc.
Yes, they mean
I'll ask. @kennytm thank you very much for the tcpdump snippet. When I shared it on the GCP support ticket, I received this response:
When I pointed out that the
Just to confirm--I know this is happening in DNS lookups from within Docker in your jobs. Have you noticed if it also happens outside Docker in the same timeframe? |
This comment has been minimized.
This comment has been minimized.
soulshake
commented
Dec 4, 2018
|
FYI, I set up a Travis job to make continuous DNS queries here. |
This was referenced Dec 5, 2018
This comment has been minimized.
This comment has been minimized.
soulshake
commented
Dec 7, 2018
•
|
I believe I've reproduced the issue: in a job that does continuous DNS lookups, both from the host and from a container, the lookups in a container start failing each morning at the same time. Here's sample output from when the failure occurs:
Lines prepended with
This is the script I'm using. Of note: these jobs failed within 4 minutes of each other around 06:35UTC, and they both started producing a I've asked GCE if they have an automated process that would modify IPv4 forwarding on instances, i.e. by restoring the system's default for sysctl values. Edit: See Beware Docker and sysctl defaults on GCE for a suggested workaround in the meantime. |
This comment has been minimized.
This comment has been minimized.
soulshake
commented
Dec 8, 2018
|
I updated my test to check IPv4 forwarding status, and it seems that it's indeed getting changed once per day, right around 07:00 UTC. Before Docker DNS lookup failures start:
and after:
I will look into options for dealing with this on our end. |
soulshake
added a commit
to travis-ci/travis-cookbooks
that referenced
this issue
Dec 10, 2018
soulshake
referenced this issue
Dec 10, 2018
Merged
Override GCE default to always enable IPv4 forwarding #1018
This was referenced Dec 12, 2018
kennytm
referenced this issue
Dec 18, 2018
Merged
Search other library paths when loking for link objects #56397
This comment has been minimized.
This comment has been minimized.
soulshake
commented
Dec 18, 2018
•
|
We're in the process of creating updated If you're not ready to update your Travis config to use
Could you give this a try and let me know if you encounter any further failures? Thank you all for your patience and help in tracking this down. We really appreciate it. P.S. Further details from GCP support, for the curious:
|
kennytm
referenced this issue
Dec 19, 2018
Merged
trigger unsized coercions keyed on Sized bounds #56219
This comment has been minimized.
This comment has been minimized.
soulshake
commented
Dec 19, 2018
•
|
Correction: the updated
Sorry for the earlier misinfo. |
soulshake
referenced this issue
Dec 19, 2018
Open
Enable IPv4 forwarding on precise, trusty and xenial #1626
This comment has been minimized.
This comment has been minimized.
|
@soulshake Thanks! We're going to upgrade to |
This was referenced Dec 23, 2018
This comment has been minimized.
This comment has been minimized.
soulshake
commented
Jan 15, 2019
|
@kennytm We released a new stable Xenial image yesterday, so this issue should no longer occur. I'm going to mark it as resolved for now, but please feel free to reopen if you run into this behavior again. Thanks again for your patience and helpful contributions! |
kennytm commentedJun 2, 2018
•
edited
We from rust-lang/rust have experienced a high rate of "Couldn't resolve host" errors since 2018-05-24, and it still persist till today. All errors happen inside Docker in
sudo: requiredjobs. An interesting observation is that the error always happen around 06:30Z to 07:15Z, so I suspect there's some cron configuration problem.The following lists the logs we've gathered that were affected by this bug. The timestamp is when we see the job has failed.
fatal: unable to access 'https://github.com/rust-lang-nursery/rust-toolstate.git/': Could not resolve host: github.comcurl: (6) Could not resolve host: s3-us-west-1.amazonaws.comfatal: unable to access 'https://github.com/rust-lang-nursery/rust-toolstate.git/': Could not resolve host: github.comcurl: (6) Could not resolve host: s3-us-west-1.amazonaws.comfatal: unable to access 'https://github.com/BurntSushi/xsv/': Could not resolve host: github.comcurl: (6) Couldn't resolve host 's3-us-west-1.amazonaws.com'curl: (6) Could not resolve host: s3-us-west-1.amazonaws.comcurl: (6) Could not resolve host: s3-us-west-1.amazonaws.comcurl: (6) Could not resolve host: s3-us-west-1.amazonaws.comcurl error: Couldn't resolve host 'github.com'curl: (6) Could not resolve host: s3-us-west-1.amazonaws.com.travis.yml: https://github.com/rust-lang/rust/blob/edae1cc38b467518a8eea590c9c3e0c103b4ecb0/.travis.yml