Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS lookup timeouts due to races in conntrack #3287

Open
dcowden opened this issue Apr 26, 2018 · 137 comments
Open

DNS lookup timeouts due to races in conntrack #3287

dcowden opened this issue Apr 26, 2018 · 137 comments

Comments

@dcowden
Copy link

@dcowden dcowden commented Apr 26, 2018

What happened?

We are experiencing random 5 second DNS timeouts in our kubernetes cluster.

How to reproduce it?

It is reproducible by requesting just about any in-cluster service, and observing that periodically ( in our case, 1 out of 50 or 100 times), we get a 5 second delay. It always happens in DNS lookup.

Anything else we need to know?

We believe this is a result of a kernel level SNAT race condition that is described quite well here:

https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02

The problem happens with non-weave CNI implementations, and is (ironically) not even a weave issue really. However, its becomes a weave issue, because the solution is to set a flag on the masquerading rules that are created, which are not in anyone's control except for weave.

What we need is the ability to apply the NF_NAT_RANGE_PROTO_RANDOM_FULLY flag on the masquerading rules that weave sets up. IN the above post, Flannel was in use, and the fix was there instead.

We searched for this issue, and didnt see that anyone had asked for this. We're also unaware of any settings that allow setting this flag today-- if that's possible, please let us know.

@bboreham
Copy link
Contributor

@bboreham bboreham commented Apr 26, 2018

Whoa! Good job for finding that.

However:

The iptables tool doesn't support setting this flag

this might be an issue.

Loading

@dcowden
Copy link
Author

@dcowden dcowden commented Apr 27, 2018

@bboreham my kernel networking Fu is weak, so I'm not even able to suggest any work arounds. I'm hoping others here have stronger Fu... Challenge proposed!

naysayers frequently make scary, handwavey stability arguments against container stacks. Usually I laugh in the face of danger, but this appears to be the first ever case I've seen in which a little known kernel level gotcha actually does create issues for containers that would otherwise be unlikely to surface

Loading

@btalbot
Copy link

@btalbot btalbot commented Apr 27, 2018

I just spent several hours trouble shooting this problem, ran into the same XING blog post and then this issue report which was opened while I was trouble shooting!

Anyway, I'm seeing the same issues reported in the XING blog. DNS 5 second delays and a lot of insert_failed counts from conntrack using weave 2.3.0.

cpu=0 found=8089 invalid=353025 ignore=1249480 insert=0 insert_failed=8042 drop=8042 early_drop=0 error=0 search_restart=591166

More details can be provided if needed.

Loading

@dcowden
Copy link
Author

@dcowden dcowden commented Apr 27, 2018

@btalbot one workaround you might try is to set this option in resolv.conf:

options single-request-reopen

It is a workaround that will basically make glibc retry the lookup, which will work most of the time.

Another bandaid that helps is to change ndots from 5 (the default) to 3, which will generate far fewer requests to your dns servers ,and lessen the frequency.

The problem is that it's kind of a pain to force changes into resolve.conf. it's done with kubelet --resolve-conf option, but then you have to create the whole file yourself which stinks.

Loading

@dcowden
Copy link
Author

@dcowden dcowden commented Apr 27, 2018

@bboreham it does appear that the patched iptables is available. Can weave use a patched iptables?

Loading

@bboreham
Copy link
Contributor

@bboreham bboreham commented Apr 27, 2018

The easiest thing is to use an iptables from a released Apline package. From there it gets progressively harder.

(Sorry for closing/reopening - finger slipped)

Loading

@bboreham bboreham closed this Apr 27, 2018
@bboreham bboreham reopened this Apr 27, 2018
@bboreham
Copy link
Contributor

@bboreham bboreham commented Apr 27, 2018

BTW my top tip to reduce DNS requests is to put a dot at the end when you know the full address. Eg instead of example.com put example.com.. This means it will not go through the search path, reducing lookups by 5x in a typical Kubernetes install.

For an in-cluster address if you know the namespace you can construct the fqdn, e.g. servicename.namespacename.svc.cluster.local.

Loading

@dcowden
Copy link
Author

@dcowden dcowden commented Apr 27, 2018

@bboreham great tip, I didn't know that one! Thanks

Loading

@dcowden
Copy link
Author

@dcowden dcowden commented Apr 27, 2018

I did a little investigation on netfilter.org.
it appears that the iptables patch that adds --random-fully is in iptables v 1.6.2, released on 2/22/2018.

alpine:latest packages v 1.6.1, however alpine:edge packages v 1.6.2

Loading

@btalbot
Copy link

@btalbot btalbot commented Apr 27, 2018

For an in-cluster address if you know the namespace you can construct the fqdn, e.g. servicename.namespacename.svc.cluster.local.

This only works for some apps or resolvers. The bind tools honor that of course since that is a decades old syntax for bind's zone files. But any apps that try to fix an address or use a different resolver that trick doesn't work. Curl is a good example of that not working.

From inside an alpine container curl https://kubernetes/ will hit the api server of course but so does curl https://kubernetes./

Loading

@dcowden
Copy link
Author

@dcowden dcowden commented Apr 27, 2018

in our testing, we have found that only the options single-request-reopen change actually addresses this issue. Its a band-aid-- but dns lookups are fast, so we get aberrations of like 100ms, not 5 seconds,w hich is acceptable for us.

Now we're trying to figure out how to inject that into resolv.conf on all the pods. Anyone know how to do that?

Loading

@btalbot
Copy link

@btalbot btalbot commented Apr 27, 2018

I found this hack in some other related github issues and it's working for me

apiVersion: v1
data:
  resolv.conf: |
    nameserver 1.2.3.4
    search default.svc.cluster.local svc.cluster.local cluster.local ec2.internal
    options ndots:3 single-request-reopen
kind: ConfigMap
metadata:
  name: resolvconf

Then in your affected pods and containers

        volumeMounts:
        - name: resolv-conf
          mountPath: /etc/resolv.conf
          subPath: resolv.conf
...

      volumes:
      - name: resolv-conf
        configMap:
          name: resolvconf
          items:
          - key: resolv.conf
            path: resolv.conf

Loading

@dcowden
Copy link
Author

@dcowden dcowden commented Apr 27, 2018

@btalbot thanks for posting that. That would definitely work in a pinch!

we use kops for our cluster, and the this seems promising. But i'm still learning how it works

Loading

@Quentin-M
Copy link

@Quentin-M Quentin-M commented May 1, 2018

Experiencing the same issue here. 5s delays on every, single, DNS lookup, 100% of the time. Similarly, insert_failed does increase for each DNS query. The AAAA query, that happens a few cycles after the A query, gets dropped systematically (tcpdump: https://hastebin.com/banulayire.swift).

Mounting a resolv.conf by hand in every single pod of our infrastructure is untenable.
kubernetes/kubernetes#62764 attempts at adding the workaround as a default in Kubernetes, but the PR is unlikely to land. And even if it does, it won't be released for a good while.

Here is the flannel patch: https://gist.github.com/maxlaverse/1fb3bfdd2509e317194280f530158c98

Loading

@dcowden
Copy link
Author

@dcowden dcowden commented May 1, 2018

@Quentin-M what k8s version are you using? I'm curious why it's 100% repeatable for some but intermittent for others.

Another method to inject resolve.conf change s would be a deployment initializer. I've been trying to avoid creating one, but it's beginning to seem inevitable that in an Enterprise environment you need a way to enforce various things on every launched workload in a central way.

I'm still investigating the use of kubelet --resolve-conf, but what I'm really worried about is that all this is just a bandaid..

The only actual fix is the iptables flag

Loading

@brb
Copy link
Contributor

@brb brb commented May 1, 2018

Has anyone tried installing and running iptables-1.6.2 from the alpine packages for edge on Alpine 3.7?

Loading

@dcowden
Copy link
Author

@dcowden dcowden commented May 1, 2018

@brb i was wondering the same thing. It would be nice to make progress and get a PR ready in anticipation of availability of 1.6.2. My go Fu is too week to take a shot at making the fix, but I'm guessing the fix goes somewhere around expose.go?

If it were possible to create a frankenversion that has this fix, we could test it out.

Loading

@brb
Copy link
Contributor

@brb brb commented May 1, 2018

Has anyone tried installing and running iptables-1.6.2 from the alpine packages for edge on Alpine 3.7?

Just installed it with apk add iptables --update-cache --repository http://dl-3.alpinelinux.org/alpine/edge/main/. However, I cannot guarantee that we don't miss anything with iptables from edge on 3.7.

the fix goes somewhere around expose.go

Yes, you are right.

If it were possible to create a frankenversion that has this fix, we could test it out.

I've just created the weave-kube image with the fix for amd64 arch only and kernel >= 3.13 (https://github.com/weaveworks/weave/tree/issues/3287-iptables-random-fully). To use it, please change the image name of weave-kube to "brb0/weave-kube:iptables-random-fully" in DaemonSet of Weave.

Loading

@dcowden
Copy link
Author

@dcowden dcowden commented May 1, 2018

@brb Score! that's awesome! we'll try this out asap!
We're currently using image weaveworks/weave-kube:2.2.0, via a kops cluster. Would this image interoperate ok with those?

Loading

@brb
Copy link
Contributor

@brb brb commented May 1, 2018

I can't think of anything which would prevent it from working.

Please let us know whether it works, thanks!

Loading

@Quentin-M
Copy link

@Quentin-M Quentin-M commented May 1, 2018

@dcowden Kubernetes 1.10.1, Container Linux 1688.5.3-1758.0.0, AWS VPCs, Weave 2.3.0, kube-proxy IPVS. My guess is that it depends how fast/stable your network is?

Loading

@Quentin-M
Copy link

@Quentin-M Quentin-M commented May 1, 2018

@dcowden

I'm still investigating the use of kubelet --resolve-conf, but what I'm really worried about is that all this is just a bandaid..

I have tried the other day, while it changed the resolv.conf of my static pods, all the other pods (with default dnsPolicy) were still based on what dns.go constructs. Note that the DNS options are written as a constant there. No possibility to get single-request-reopen without running your own compiled version of kubelet.

Loading

@Quentin-M
Copy link

@Quentin-M Quentin-M commented May 1, 2018

@brb Thanks! I haven't realized yesterday that the patched iptables was already in an Alpine release. My issue is surely still present and both insert_failed and drop are still increasing. I note however that there are two other MASQUERADE rules in place, that do not have --random-fully, so that might be why? I am no network expert by any means unfortunately.

# Setup by WEAVE too.
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE

# Setup by both kubelet and kube-proxy, used to SNAT ports when querying services.
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -m mark --mark 0x4000/0x4000 -j MASQUERADE

-A WEAVE ! -s 172.16.0.0/16 -d 172.16.0.0/16 -j MASQUERADE --random-fully
-A WEAVE -s 172.16.0.0/16 ! -d 172.16.0.0/16 -j MASQUERADE --random-fully

Loading

@dcowden
Copy link
Author

@dcowden dcowden commented May 1, 2018

@brb, i tried this out. I was able to upgrade successfully, but it didnt help my problems.

I think maybe i don't have it installed correctly, because my iptables rules do not show the fully-random flag anywhere.

Here's my daemonset ( annotations and stuff after the image omitted ):

dcowden@ubuntu:~/gitwork/kubernetes$ kc get ds weave-net -n kube-system -o yaml
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  ...omitted annotations...
  creationTimestamp: 2017-12-21T16:37:59Z
  generation: 4
  labels:
    name: weave-net
    role.kubernetes.io/networking: "1"
  name: weave-net
  namespace: kube-system
  resourceVersion: "21973562"
  selfLink: /apis/extensions/v1beta1/namespaces/kube-system/daemonsets/weave-net
  uid: 4dd96bf2-e66d-11e7-8b61-069a0a6ccd8c
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      name: weave-net
      role.kubernetes.io/networking: "1"
  template:
    metadata:
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      creationTimestamp: null
      labels:
        name: weave-net
        role.kubernetes.io/networking: "1"
    spec:
      containers:
      - command:
        - /home/weave/launch.sh
        env:
        - name: WEAVE_PASSWORD
          valueFrom:
            secretKeyRef:
              key: weave-passwd
              name: weave-passwd
        - name: HOSTNAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: IPALLOC_RANGE
          value: 100.96.0.0/11
        - name: WEAVE_MTU
          value: "8912"
        image: brb0/weave-kube:iptables-random-fully
        ...more stuff...

The daemonset was updated ok. Here's the iptables rules i see on a host. I dont see --random-fully anywhere:

[root@ip-172-25-19-92 ~]# iptables --list-rules
-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT
-N KUBE-FIREWALL
-N KUBE-FORWARD
-N KUBE-SERVICES
-N WEAVE-IPSEC-IN
-N WEAVE-NPC
-N WEAVE-NPC-DEFAULT
-N WEAVE-NPC-INGRESS
-A INPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A INPUT -j KUBE-FIREWALL
-A INPUT -j WEAVE-IPSEC-IN
-A FORWARD -o weave -m comment --comment "NOTE: this must go before \'-j KUBE-FORWARD\'" -j WEAVE-NPC
-A FORWARD -o weave -m state --state NEW -j NFLOG --nflog-group 86
-A FORWARD -o weave -j DROP
-A FORWARD -i weave ! -o weave -j ACCEPT
-A FORWARD -o weave -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -m comment --comment "kubernetes forward rules" -j KUBE-FORWARD
-A OUTPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A OUTPUT -j KUBE-FIREWALL
-A OUTPUT ! -p esp -m policy --dir out --pol none -m mark --mark 0x20000/0x20000 -j DROP
-A KUBE-FIREWALL -m comment --comment "kubernetes firewall for dropping marked packets" -m mark --mark 0x8000/0x8000 -j DROP
-A KUBE-FORWARD -m comment --comment "kubernetes forwarding rules" -m mark --mark 0x4000/0x4000 -j ACCEPT
-A KUBE-FORWARD -s 100.96.0.0/11 -m comment --comment "kubernetes forwarding conntrack pod source rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A KUBE-FORWARD -d 100.96.0.0/11 -m comment --comment "kubernetes forwarding conntrack pod destination rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A KUBE-SERVICES -d 100.65.65.105/32 -p tcp -m comment --comment "default/schaeffler-logstash:http has no endpoints" -m tcp --dport 9600 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -p tcp -m comment --comment "ops/echoheaders:http has no endpoints" -m addrtype --dst-type LOCAL -m tcp --dport 31436 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 100.69.172.111/32 -p tcp -m comment --comment "ops/echoheaders:http has no endpoints" -m tcp --dport 80 -j REJECT --reject-with icmp-port-unreachable
-A WEAVE-IPSEC-IN -s 172.25.83.126/32 -d 172.25.19.92/32 -p udp -m udp --dport 6784 -m mark ! --mark 0x20000/0x20000 -j DROP
-A WEAVE-IPSEC-IN -s 172.25.83.234/32 -d 172.25.19.92/32 -p udp -m udp --dport 6784 -m mark ! --mark 0x20000/0x20000 -j DROP
-A WEAVE-IPSEC-IN -s 172.25.83.40/32 -d 172.25.19.92/32 -p udp -m udp --dport 6784 -m mark ! --mark 0x20000/0x20000 -j DROP
-A WEAVE-IPSEC-IN -s 172.25.51.21/32 -d 172.25.19.92/32 -p udp -m udp --dport 6784 -m mark ! --mark 0x20000/0x20000 -j DROP
-A WEAVE-IPSEC-IN -s 172.25.51.170/32 -d 172.25.19.92/32 -p udp -m udp --dport 6784 -m mark ! --mark 0x20000/0x20000 -j DROP
-A WEAVE-IPSEC-IN -s 172.25.51.29/32 -d 172.25.19.92/32 -p udp -m udp --dport 6784 -m mark ! --mark 0x20000/0x20000 -j DROP
-A WEAVE-IPSEC-IN -s 172.25.19.130/32 -d 172.25.19.92/32 -p udp -m udp --dport 6784 -m mark ! --mark 0x20000/0x20000 -j DROP
-A WEAVE-NPC -m state --state RELATED,ESTABLISHED -j ACCEPT
-A WEAVE-NPC -d 224.0.0.0/4 -j ACCEPT
-A WEAVE-NPC -m state --state NEW -j WEAVE-NPC-DEFAULT
-A WEAVE-NPC -m state --state NEW -j WEAVE-NPC-INGRESS
-A WEAVE-NPC -m set ! --match-set weave-local-pods dst -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-f(09:Q6gzJb~LE_pU4n:@416L dst -m comment --comment "DefaultAllow isolation for namespace: ops" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-jXXXW48#WnolRYPFUalO(fLpK dst -m comment --comment "DefaultAllow isolation for namespace: troubleshooting" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-E.1.0W^NGSp]0_t5WwH/]gX@L dst -m comment --comment "DefaultAllow isolation for namespace: default" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-0EHD/vdN#O4]V?o4Tx7kS;APH dst -m comment --comment "DefaultAllow isolation for namespace: kube-public" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-?b%zl9GIe0AET1(QI^7NWe*fO dst -m comment --comment "DefaultAllow isolation for namespace: kube-system" -j ACCEPT

I don't know what to try next.

Loading

@Quentin-M
Copy link

@Quentin-M Quentin-M commented May 1, 2018

@dcowden You need to make sure you are calling iptables 1.6.2, otherwise you will not see the flag. One solution is to run iptables from within the weave container. As for you, it did not help my issue, the first AAAA query still appears to be dropped. I am compiling kube-proxy/kubelet to add the fully-random flag there as well, but this is going to take a while.

Loading

@bboreham
Copy link
Contributor

@bboreham bboreham commented Mar 7, 2019

there is still a problem when 2+ containers connect to google.com at the same time?

[EDIT: I was confused so scoring out this part. See later comment too.]
Those (TCP) connections are never a problem, because they will come from unique source ports.

The problem [EDIT: in this specifc GitHub issue] comes when certain DNS clients make two simultaneous UDP requests with identical source ports (and the destination port is always 53), so we get a race.

The best mitigation is a DNS service which does not go via NAT. This is being worked on in Kubernetes, basically one per node and disabling NAT for on-node connections.

Loading

@krzysztof-bronk
Copy link

@krzysztof-bronk krzysztof-bronk commented Mar 7, 2019

But isn't there a race condition in that source port uniqueness algorithm during SNAT, regardless of protocol and affecting different pods on the same host in the same way as the dns UDP client issue within one? Basically as in https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02

Loading

@bboreham
Copy link
Contributor

@bboreham bboreham commented Mar 7, 2019

Sorry, yes, there is a different race condition to do with picking unique outgoing ports for SNAT.

If you are actually encountering this please open a new issue giving the details.

Loading

@krzysztof-bronk
Copy link

@krzysztof-bronk krzysztof-bronk commented Mar 7, 2019

Thank you for the response. Indeed I'm seeing insert_failed despite implementing several workarounds and I'm note sure whether it's TCP, UDP, SNAT or DNAT. We can't bump the kernel yet.

If I understood correctly the SNAT case should be mitigated by the "random fully" flag, but Weave never went on with it? I think kubelet and kube-proxy would need those as well anyway, I don't know where things stand there.

There is one more head scratching case for me which is how all those cases fare when one uses NodePort. Isn't there a similar conntrack problem if NodePort forwards to cluster ip?

Loading

@bboreham
Copy link
Contributor

@bboreham bboreham commented Mar 7, 2019

the "random fully" flag, but Weave never went on with it?

We investigated the problem reported here, and developed fixes to that problem. If someone reports symptoms that are improved by "random fully" then we might add that. We have finite resources and have to concentrate on what is actually reported (and within that set, on paying customers).

Or, since it's Open Source, anyone else can do the investigation and contribute a PR.

Loading

@krzysztof-bronk
Copy link

@krzysztof-bronk krzysztof-bronk commented Mar 7, 2019

I understand :) I was merely trying to comprehend where things stand with regards to the different races and available mitigations, since there exist several blog posts and several github issues with a massive amount of comments to parse.

From my understanding of all of it, even with 2 kernel fixes and dns workarounds and iptables flags there is still an issue at least with multipod -> Cluster IP multipod connection, and without kernel 5.0 or "random fully" also an issue with simple multipod -> External IP connection.

But yeah, I'll raise a new issue if that proves true and impactful enough for us in production. Thank you

Loading

@Krishna1408
Copy link

@Krishna1408 Krishna1408 commented Jul 16, 2019

@Quentin-M @brb We are using weave as well for our CNI and I tried to use the workaround mentioned by @Quentin-M. But I am getting error:

No distribution data for pareto (/lib/tc//pareto.dist: No such file or directory)

I am using debian: 4.9.0-7-amd64 #1 SMP Debian 4.9.110-3+deb9u2 (2018-08-13) x86_64 GNU/Linux

And I have mounted on /usr/lib/tc

Can you please correct where I am getting wrong ?

    spec:
      containers:
      - name: weave-tc
        image: 'qmachu/weave-tc:0.0.1'
        securityContext:
          privileged: true
        volumeMounts:
          - name: xtables-lock
            mountPath: /run/xtables.lock
          - name: usr-lib-tc
            mountPath: /usr/lib/tc

      volumes:
      - hostPath:
          path: /usr/lib/tc
          type: ""
        name: usr-lib-tc

Edit:
In the container specs, VolumeMount us-lib-tc needs update. It should be /lib/tc instead of /usr/lib/tc

Loading

@hairyhenderson
Copy link
Contributor

@hairyhenderson hairyhenderson commented Jul 16, 2019

@Krishna1408 If you change mountPath: /usr/lib/tc to mountPath: /lib/tc it should work. It needs to be mounted in /lib/tc inside the container, but it's (usually) /usr/lib/tc on the host.

Loading

@Krishna1408
Copy link

@Krishna1408 Krishna1408 commented Jul 17, 2019

Hi @hairyhenderson thanks a lot, it works for me :)

Loading

@phlegx
Copy link

@phlegx phlegx commented Oct 6, 2019

@brb May I ask if the problem (5 sec DNS delay) is solved with the 5.x Kernel? Have you have some more details and feedback from people already?

Loading

@brb
Copy link
Contributor

@brb brb commented Oct 7, 2019

@phlegx It depends which race condition you hit. The first two out of three got fixed in the kernel, and someone reported a success (kubernetes/kubernetes#56903 (comment)).

However, not much can be done from the kernel side about the third race condition. See my comments in the linked issue.

Loading

@bboreham
Copy link
Contributor

@bboreham bboreham commented Oct 7, 2019

I will repeat what a few others have said in this thread: the best way forward, if you have this problem, is “node-local dns”. Then there is no NAT on the DNS requests from pods and so no race condition.

Support for this configuration is slowly improving in Kubernetes and installers.

Loading

@phlegx
Copy link

@phlegx phlegx commented Oct 7, 2019

We upgraded to Linux 5.x now and for now the "5 second" problem seem to be "solved". Need to check about this third race condition. Thanks for your replies!

Loading

@insoz
Copy link

@insoz insoz commented Oct 15, 2019

We upgraded to Linux 5.x now and for now the "5 second" problem seem to be "solved". Need to check about this third race condition. Thanks for your replies!

You mean the Linux 5.x is kernel 5.x ?

Loading

@thockin
Copy link

@thockin thockin commented Apr 10, 2020

I just wanted to pop in and say thanks for this excellent and detailed explanation. 2 years since it was filed and 1 year since it was fixed, some people still hit this issue, and frankly the DNAT part of it had me baffled.

It took a bit of reasoning but as I understand it - the client sends multiple UDP requests on the same {source IP, source port, dest IP, dest port, protocol} and one just gets lost. Since clients are INTENTIONALLY sending them in parallel, the race is exacerbated.

Loading

@DerGenaue
Copy link

@DerGenaue DerGenaue commented Apr 12, 2020

I was able to solve the issue by using the SessionAffinity feature by kubernetes:
Configuring the kube-dns service in the kube-system namespace from None to:
service.spec.sessionAffinity: ClientIP
resolved it basically immediately on our cluster.
I can't tell how long it will last, though; I expect the next kubernetes upgrade to revert that setting.
I'm pretty sure that this shouldn't have any problematic side-effects; but I cannot tell for sure.

This solution makes all DNS request packets from one pod be delivered to the same kube-dns pod, thus eliminating the problem that the conntrack DNAT race condition causes
(the race condition still exists, it just doesn't have any effect anymore).

Loading

@bboreham
Copy link
Contributor

@bboreham bboreham commented Apr 12, 2020

@DerGenaue as far as I can tell sessionAffinity only works with proxy-mode userspace, which will slow down service traffic to an extent that some people will not tolerate.

Loading

@thockin
Copy link

@thockin thockin commented Apr 12, 2020

Loading

@DerGenaue
Copy link

@DerGenaue DerGenaue commented Apr 12, 2020

I checked the kube-proxy code and the iptables version generates sessionAffinity just fine.
I don't think any single pod will ever do so many DNS requests to cause any problems in this regard.
Also, the way I understood it, the current plan for the future is to route all DNS requests to the pod running on the same node (aka. only node-local DNS traffic), which basically would be very similar to this solution.

Loading

@thockin
Copy link

@thockin thockin commented Apr 12, 2020

Loading

@elmiedo
Copy link

@elmiedo elmiedo commented May 22, 2020

Hi. Why are you not implement dnsmasq instead of working with usual dns clients?
Dnsmasq is able to send dns query to every dns-server from it config file simultaneously. You just will receive the fastest reply.

Loading

@bboreham
Copy link
Contributor

@bboreham bboreham commented May 22, 2020

@elmiedo it is uncommon to have the opportunity to change DNS client - it's bound into each container image in code from glibc or musl or similar. And the problem we are discussing hits between that client and the Linux kernel, so the server (such as dnsmasq) does not have a chance to affect things.

Loading

@thockin
Copy link

@thockin thockin commented May 22, 2020

Loading

@chengzhycn
Copy link

@chengzhycn chengzhycn commented Oct 26, 2021

We wrote a blog post describing the technical details of the problem and presenting the kernel fixes: https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts.

@brb Thanks for your excellent explains. But there is a little doubt that confused me. I viewed the glibc source codes, it used send_dg to send A and AAAA queries via UDP in parallel. But it is just called sendmmsg, seems like send two UDP packets in one thread(doesn't match the condition different thread). Is there any misunderstanding by me above?Looking forward to your reply. :)

Loading

@axot
Copy link

@axot axot commented Oct 26, 2021

https://elixir.bootlin.com/linux/v5.14.14/source/net/socket.c#L2548
Same question if it is possible to run in a different CPU by cond_resched().

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet