New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Couldn't resolve host" errors almost daily around 6–7am UTC since May 24th #9696

Closed
kennytm opened this Issue Jun 2, 2018 · 22 comments

Comments

Projects
None yet
6 participants
@kennytm
Copy link

kennytm commented Jun 2, 2018

We from rust-lang/rust have experienced a high rate of "Couldn't resolve host" errors since 2018-05-24, and it still persist till today. All errors happen inside Docker in sudo: required jobs. An interesting observation is that the error always happen around 06:30Z to 07:15Z, so I suspect there's some cron configuration problem.

The following lists the logs we've gathered that were affected by this bug. The timestamp is when we see the job has failed.

Timestamp Log Error
2018-05-24T07:14:12Z https://api.travis-ci.org/v3/job/383034809/log.txt fatal: unable to access 'https://github.com/rust-lang-nursery/rust-toolstate.git/': Could not resolve host: github.com
2018-05-26T06:34:00Z https://api.travis-ci.org/v3/job/383995388/log.txt curl: (6) Could not resolve host: s3-us-west-1.amazonaws.com
2018-05-27T07:02:13Z https://api.travis-ci.org/v3/job/384282776/log.txt fatal: unable to access 'https://github.com/rust-lang-nursery/rust-toolstate.git/': Could not resolve host: github.com
2018-05-29T06:50:07Z https://api.travis-ci.org/v3/job/385031055/log.txt curl: (6) Could not resolve host: s3-us-west-1.amazonaws.com
2018-05-30T06:47:00Z https://api.travis-ci.org/v3/job/385518123/log.txt fatal: unable to access 'https://github.com/BurntSushi/xsv/': Could not resolve host: github.com
2018-05-31T06:38:08Z https://api.travis-ci.org/v3/job/386026639/log.txt curl: (6) Couldn't resolve host 's3-us-west-1.amazonaws.com'
2018-06-01T06:42:09Z https://api.travis-ci.org/v3/job/386506502/log.txt curl: (6) Could not resolve host: s3-us-west-1.amazonaws.com
2018-06-02T06:32:08Z https://api.travis-ci.org/v3/job/386992985/log.txt curl: (6) Could not resolve host: s3-us-west-1.amazonaws.com
2018-06-03T07:10:07Z https://api.travis-ci.org/v3/job/387315900/log.txt curl: (6) Could not resolve host: s3-us-west-1.amazonaws.com
2018-06-05T07:14:48Z https://api.travis-ci.org/v3/job/388108259/log.txt curl error: Couldn't resolve host 'github.com'
2018-06-07T06:44:09Z https://api.travis-ci.org/v3/job/389090019/log.txt curl: (6) Could not resolve host: s3-us-west-1.amazonaws.com
@pietroalbini

This comment has been minimized.

Copy link

pietroalbini commented Jun 26, 2018

We also tried setting fallback DNS servers in the docker containers (rust-lang/rust#51420), but that had no effects.

@pietroalbini

This comment has been minimized.

Copy link

pietroalbini commented Jul 20, 2018

Today's failure is https://travis-ci.org/rust-lang/rust/jobs/406102611 (raw):

[01:33:17] testing https://github.com/servo/webrender
[01:33:17] Initialized empty Git repository in /checkout/obj/build/ct/webrender/.git/
[01:33:17] fatal: Could not parse object '57250b2b8fa63934f80e5376a29f7dcb3f759ad6'.
[01:34:13] fatal: unable to access 'https://github.com/servo/webrender/': Could not resolve host: github.com
@soulshake

This comment has been minimized.

Copy link

soulshake commented Jul 25, 2018

@kennytm From your (extremely helpful!) failure table, it appears that it happens no more than once per day. Do you ever see a second failure on the same day?

@kennytm

This comment has been minimized.

Copy link
Author

kennytm commented Jul 25, 2018

@soulshake right, IIRC we haven't seen more than one #9696 per day. However, we'll automatically cancel a build if any of its job has failed, so we don't know if this is affecting just one job or all jobs in the build. Also, each build takes ~2 hours to complete if successful, which makes it very unlikely for two builds to fall within the same hour.

@soulshake

This comment has been minimized.

Copy link

soulshake commented Jul 25, 2018

@kennytm We're trying to reproduce. In the meantime, if you're so inclined, I would love to know:

  • Do jobs run between 06:30Z and 07:15Z always fail?
  • What are the contents of /etc/resolv.conf and /etc/network/interfaces from within a job?
  • Could you share the output of docker info and git config --list?

@meatballhat and I hacked together this (as yet untested) script if it could be helpful. (Is there a specific rust-lang docker image you would recommend for testing that I could pull?)

@kennytm

This comment has been minimized.

Copy link
Author

kennytm commented Jul 25, 2018

@soulshake

Do jobs run between 06:30Z and 07:15Z always fail?

Not always. We tried to detect DNS failure previously at rust-lang/rust#51939 which itself was successful, but in between rust-lang/rust#51762 (comment) did fail due to #9696. This might due to the 1.1.1.1 switch, or because 51939 did not issue any network request during that period.

meatballhat and I hacked together this (as yet untested) script if it could be helpful.

Thanks, we'll try to deploy it and report back :)

(Is there a specific rust-lang docker image you would recommend for testing that I could pull?)

Our docker images are at https://github.com/rust-lang/rust/tree/master/src/ci/docker; this one should have made more network requests allowing easier reproduction. But they are not too isolated, so the easiest perhaps would be:

  1. Clone the rust-lang/rust repo
  2. Edit src/ci/run.sh and replace the content with the DNS diagnostic stuff. This is the script we'll run after entering docker.

bors added a commit to rust-lang/rust that referenced this issue Jul 26, 2018

Auto merge of #52727 - kennytm:test-9696, r=<try>
[DO NOT MERGE] Test travis-ci#9696

Test the script provided in travis-ci/travis-ci#9696 (comment).

r? @ghost

bors added a commit to rust-lang/rust that referenced this issue Jul 26, 2018

Auto merge of #52727 - kennytm:test-9696, r=<try>
[DO NOT MERGE] Test travis-ci#9696

Test the script provided in travis-ci/travis-ci#9696 (comment).

r? @ghost
@soulshake

This comment has been minimized.

Copy link

soulshake commented Dec 4, 2018

@alexcrichton

Are we the only project running into this though? I'm not sure how our network needs would be different enough from others' to not have a lot of others chiming in here...

As far as I know, you're the only ones to report this. However, I wouldn't rule out the possibility that it's happening to others without the pattern being noticed; or perhaps it only happens with jobs starting at a certain time that are still running 1 hour later; or some other unique set of circumstances that your project is rare in meeting, etc.

Additionally, by "that resolv.conf" in the reply from GCE do they mean the /etc/resolv.conf or some different one?

Yes, they mean /etc/resolv.conf.

Is there a way we can opt out of this to not have a window where configuration doesn't work?

I'll ask.

@kennytm thank you very much for the tcpdump snippet. When I shared it on the GCP support ticket, I received this response:

About the metadata server stop replying, the dumps shows a sequence of "bad udp cksum" and 
this is due to the TCO (tcp offloading). please check this third-party guide on how to disable it 
for a clean tcpdump[1].

[1] https://sokratisg.net/2012/04/01/udp-tcp-checksum-errors-from-tcpdump-nic-hardware-offloading/

When I pointed out that the bad udp cksum occur in the successful lookups too, this was the reply:

Since the dump provided is a really small extract from the traffic, and includes all the 
bad cksum, the first 2 answers from the metadata server are probably not linked to 
the same 2 requests done before that.

It is possible that it takes hundreds requests to have one that goes through, that is 
why filtering out the errors, we can have a better idea of what is really happening.

I also believe that a re-run of the tcpdumps will not help, as the ideal is to work it on 
pairs: the source and destination dump at the same time. Since it is not possible as 
you do not control the destination (metadata server) having half of the conversation 
will create more question than answers.

Even if I obtain the dumps from the metadata, if they are not synchronized, they are 
not useful as well.

Unfortunately, the provided instance is not available anymore for inspection.

What you can use as troubleshooting tool is to have another small Instance querying 
the DNS as a control group, if the issue happens to that instance, the issue is related 
to the DNS, if not is related to the Docker request for the DNS.

Please let me know if this line of investigation is possible on your side.

Just to confirm--I know this is happening in DNS lookups from within Docker in your jobs. Have you noticed if it also happens outside Docker in the same timeframe?

@soulshake

This comment has been minimized.

Copy link

soulshake commented Dec 4, 2018

FYI, I set up a Travis job to make continuous DNS queries here.

@soulshake

This comment has been minimized.

Copy link

soulshake commented Dec 7, 2018

I believe I've reproduced the issue: in a job that does continuous DNS lookups, both from the host and from a container, the lookups in a container start failing each morning at the same time.

Here's sample output from when the failure occurs:

---> Fri Dec 7 06:40:22 UTC 2018: try 3 github.com via 169.254.169.254:
192.30.253.113
192.30.253.112
192.30.253.113
192.30.253.112
and in docker:
** WARNING: IPv4 forwarding is disabled. Networking will not work.
** ;; connection timed out; no servers could be reached
x Failed to lookup github.com via @169.254.169.254 (docker)

Lines prepended with ** are output from if ! docker run -ti tutum/dnsutils dig +short "@${nameserver}" "${h}".
in 2/3 cases so far, once it starts failing in a job, it continues failing for the rest of the job:

  1. https://travis-ci.org/soulshake/startup-threads-cli/builds/463852133 (started failing at Wed Dec 5 13:58:20 UTC, but recovered; I think this one is a fluke)
  2. https://travis-ci.org/soulshake/startup-threads-cli/builds/464187798 (started failing at Thu Dec 6 06:34:32 UTC)
  3. https://travis-ci.org/soulshake/startup-threads-cli/builds/464807049 (started failing at Fri Dec 7 06:38:58 UTC)

This is the script I'm using.

Of note: these jobs failed within 4 minutes of each other around 06:35UTC, and they both started producing a WARNING: IPv4 forwarding is disabled. Networking will not work.

I've asked GCE if they have an automated process that would modify IPv4 forwarding on instances, i.e. by restoring the system's default for sysctl values.

Edit: See Beware Docker and sysctl defaults on GCE for a suggested workaround in the meantime.

@kennytm kennytm referenced this issue Dec 8, 2018

Merged

Bump to 1.33.0 #56536

@soulshake

This comment has been minimized.

Copy link

soulshake commented Dec 8, 2018

I updated my test to check IPv4 forwarding status, and it seems that it's indeed getting changed once per day, right around 07:00 UTC.

Before Docker DNS lookup failures start:

ipv4 forwarding on host: net.ipv4.ip_forward = 1
ipv4 forwarding in docker: net.ipv4.ip_forward = 1

and after:

ipv4 forwarding on host: net.ipv4.ip_forward = 0
WARNING: IPv4 forwarding is disabled. Networking will not work.
ipv4 forwarding in docker: net.ipv4.ip_forward = 0

I will look into options for dealing with this on our end.

soulshake added a commit to travis-ci/travis-cookbooks that referenced this issue Dec 10, 2018

@soulshake

This comment has been minimized.

Copy link

soulshake commented Dec 18, 2018

We're in the process of creating updated xenial images which, once deployed this week, should fix this issue. (We're not making further updates to our trusty images as they will be EOL in April 2019.)

If you're not ready to update your Travis config to use dist: xenial yet, you can add your own file to /etc/sysctl.d/ that follows 99-gce.conf alphabetically and ensure it contains net.ipv4.ip_forward = 1, like so:

before_install:
  - echo "net.ipv4.ip_forward = 1" | sudo tee /etc/sysctl.d/99-ipv4-forward-pls.conf

Could you give this a try and let me know if you encounter any further failures?

Thank you all for your patience and help in tracking this down. We really appreciate it.

P.S. Further details from GCP support, for the curious:

  1. Docker sets net.ipv4.ip_forward=1 as part of its startup process
  2. Kernel sets various sysctls via /etc/sysctl.conf and /etc/sysctl.d/*
  3. There's an upstart task for "procps" that does this configuration at startup
  4. Updating the procps package causes the upstart task to be stopped and started which resets the sysctls to their values in /etc
@soulshake

This comment has been minimized.

Copy link

soulshake commented Dec 19, 2018

Correction: the updated xenial images won't be deployed until the next time we release a new stable tag (which should be in early January). So you can either:

  • specify group: edge with dist: xenial to get this change in xenial now,
  • wait until we issue a runtime patch for trusty as well as xenial (this week), or
  • add the workaround described above to your .travis.yml.

Sorry for the earlier misinfo.

@kennytm

This comment has been minimized.

Copy link
Author

kennytm commented Dec 20, 2018

@soulshake Thanks! We're going to upgrade to xenial for now, and let the patch naturally be applied next stable.

@soulshake

This comment has been minimized.

Copy link

soulshake commented Jan 15, 2019

@kennytm We released a new stable Xenial image yesterday, so this issue should no longer occur. I'm going to mark it as resolved for now, but please feel free to reopen if you run into this behavior again.

Thanks again for your patience and helpful contributions! ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment