Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS lookup timeouts due to races in conntrack #3287

Open
dcowden opened this issue Apr 26, 2018 · 121 comments

Comments

Projects
None yet
@dcowden
Copy link

commented Apr 26, 2018

What happened?

We are experiencing random 5 second DNS timeouts in our kubernetes cluster.

How to reproduce it?

It is reproducible by requesting just about any in-cluster service, and observing that periodically ( in our case, 1 out of 50 or 100 times), we get a 5 second delay. It always happens in DNS lookup.

Anything else we need to know?

We believe this is a result of a kernel level SNAT race condition that is described quite well here:

https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02

The problem happens with non-weave CNI implementations, and is (ironically) not even a weave issue really. However, its becomes a weave issue, because the solution is to set a flag on the masquerading rules that are created, which are not in anyone's control except for weave.

What we need is the ability to apply the NF_NAT_RANGE_PROTO_RANDOM_FULLY flag on the masquerading rules that weave sets up. IN the above post, Flannel was in use, and the fix was there instead.

We searched for this issue, and didnt see that anyone had asked for this. We're also unaware of any settings that allow setting this flag today-- if that's possible, please let us know.

@bboreham

This comment has been minimized.

Copy link
Member

commented Apr 26, 2018

Whoa! Good job for finding that.

However:

The iptables tool doesn't support setting this flag

this might be an issue.

@dcowden

This comment has been minimized.

Copy link
Author

commented Apr 27, 2018

@bboreham my kernel networking Fu is weak, so I'm not even able to suggest any work arounds. I'm hoping others here have stronger Fu... Challenge proposed!

naysayers frequently make scary, handwavey stability arguments against container stacks. Usually I laugh in the face of danger, but this appears to be the first ever case I've seen in which a little known kernel level gotcha actually does create issues for containers that would otherwise be unlikely to surface

@btalbot

This comment has been minimized.

Copy link

commented Apr 27, 2018

I just spent several hours trouble shooting this problem, ran into the same XING blog post and then this issue report which was opened while I was trouble shooting!

Anyway, I'm seeing the same issues reported in the XING blog. DNS 5 second delays and a lot of insert_failed counts from conntrack using weave 2.3.0.

cpu=0 found=8089 invalid=353025 ignore=1249480 insert=0 insert_failed=8042 drop=8042 early_drop=0 error=0 search_restart=591166

More details can be provided if needed.

@dcowden

This comment has been minimized.

Copy link
Author

commented Apr 27, 2018

@btalbot one workaround you might try is to set this option in resolv.conf:

options single-request-reopen

It is a workaround that will basically make glibc retry the lookup, which will work most of the time.

Another bandaid that helps is to change ndots from 5 (the default) to 3, which will generate far fewer requests to your dns servers ,and lessen the frequency.

The problem is that it's kind of a pain to force changes into resolve.conf. it's done with kubelet --resolve-conf option, but then you have to create the whole file yourself which stinks.

@dcowden

This comment has been minimized.

Copy link
Author

commented Apr 27, 2018

@bboreham it does appear that the patched iptables is available. Can weave use a patched iptables?

@bboreham

This comment has been minimized.

Copy link
Member

commented Apr 27, 2018

The easiest thing is to use an iptables from a released Apline package. From there it gets progressively harder.

(Sorry for closing/reopening - finger slipped)

@bboreham bboreham closed this Apr 27, 2018

@bboreham bboreham reopened this Apr 27, 2018

@bboreham

This comment has been minimized.

Copy link
Member

commented Apr 27, 2018

BTW my top tip to reduce DNS requests is to put a dot at the end when you know the full address. Eg instead of example.com put example.com.. This means it will not go through the search path, reducing lookups by 5x in a typical Kubernetes install.

For an in-cluster address if you know the namespace you can construct the fqdn, e.g. servicename.namespacename.svc.cluster.local.

@dcowden

This comment has been minimized.

Copy link
Author

commented Apr 27, 2018

@bboreham great tip, I didn't know that one! Thanks

@dcowden

This comment has been minimized.

Copy link
Author

commented Apr 27, 2018

I did a little investigation on netfilter.org.
it appears that the iptables patch that adds --random-fully is in iptables v 1.6.2, released on 2/22/2018.

alpine:latest packages v 1.6.1, however alpine:edge packages v 1.6.2

@btalbot

This comment has been minimized.

Copy link

commented Apr 27, 2018

For an in-cluster address if you know the namespace you can construct the fqdn, e.g. servicename.namespacename.svc.cluster.local.

This only works for some apps or resolvers. The bind tools honor that of course since that is a decades old syntax for bind's zone files. But any apps that try to fix an address or use a different resolver that trick doesn't work. Curl is a good example of that not working.

From inside an alpine container curl https://kubernetes/ will hit the api server of course but so does curl https://kubernetes./

@dcowden

This comment has been minimized.

Copy link
Author

commented Apr 27, 2018

in our testing, we have found that only the options single-request-reopen change actually addresses this issue. Its a band-aid-- but dns lookups are fast, so we get aberrations of like 100ms, not 5 seconds,w hich is acceptable for us.

Now we're trying to figure out how to inject that into resolv.conf on all the pods. Anyone know how to do that?

@btalbot

This comment has been minimized.

Copy link

commented Apr 27, 2018

I found this hack in some other related github issues and it's working for me

apiVersion: v1
data:
  resolv.conf: |
    nameserver 1.2.3.4
    search default.svc.cluster.local svc.cluster.local cluster.local ec2.internal
    options ndots:3 single-request-reopen
kind: ConfigMap
metadata:
  name: resolvconf

Then in your affected pods and containers

        volumeMounts:
        - name: resolv-conf
          mountPath: /etc/resolv.conf
          subPath: resolv.conf
...

      volumes:
      - name: resolv-conf
        configMap:
          name: resolvconf
          items:
          - key: resolv.conf
            path: resolv.conf
@dcowden

This comment has been minimized.

Copy link
Author

commented Apr 27, 2018

@btalbot thanks for posting that. That would definitely work in a pinch!

we use kops for our cluster, and the this seems promising. But i'm still learning how it works

@Quentin-M

This comment has been minimized.

Copy link

commented May 1, 2018

Experiencing the same issue here. 5s delays on every, single, DNS lookup, 100% of the time. Similarly, insert_failed does increase for each DNS query. The AAAA query, that happens a few cycles after the A query, gets dropped systematically (tcpdump: https://hastebin.com/banulayire.swift).

Mounting a resolv.conf by hand in every single pod of our infrastructure is untenable.
kubernetes/kubernetes#62764 attempts at adding the workaround as a default in Kubernetes, but the PR is unlikely to land. And even if it does, it won't be released for a good while.

Here is the flannel patch: https://gist.github.com/maxlaverse/1fb3bfdd2509e317194280f530158c98

@dcowden

This comment has been minimized.

Copy link
Author

commented May 1, 2018

@Quentin-M what k8s version are you using? I'm curious why it's 100% repeatable for some but intermittent for others.

Another method to inject resolve.conf change s would be a deployment initializer. I've been trying to avoid creating one, but it's beginning to seem inevitable that in an Enterprise environment you need a way to enforce various things on every launched workload in a central way.

I'm still investigating the use of kubelet --resolve-conf, but what I'm really worried about is that all this is just a bandaid..

The only actual fix is the iptables flag

@brb

This comment has been minimized.

Copy link
Contributor

commented May 1, 2018

Has anyone tried installing and running iptables-1.6.2 from the alpine packages for edge on Alpine 3.7?

@dcowden

This comment has been minimized.

Copy link
Author

commented May 1, 2018

@brb i was wondering the same thing. It would be nice to make progress and get a PR ready in anticipation of availability of 1.6.2. My go Fu is too week to take a shot at making the fix, but I'm guessing the fix goes somewhere around expose.go?

If it were possible to create a frankenversion that has this fix, we could test it out.

@brb

This comment has been minimized.

Copy link
Contributor

commented May 1, 2018

Has anyone tried installing and running iptables-1.6.2 from the alpine packages for edge on Alpine 3.7?

Just installed it with apk add iptables --update-cache --repository http://dl-3.alpinelinux.org/alpine/edge/main/. However, I cannot guarantee that we don't miss anything with iptables from edge on 3.7.

the fix goes somewhere around expose.go

Yes, you are right.

If it were possible to create a frankenversion that has this fix, we could test it out.

I've just created the weave-kube image with the fix for amd64 arch only and kernel >= 3.13 (https://github.com/weaveworks/weave/tree/issues/3287-iptables-random-fully). To use it, please change the image name of weave-kube to "brb0/weave-kube:iptables-random-fully" in DaemonSet of Weave.

@dcowden

This comment has been minimized.

Copy link
Author

commented May 1, 2018

@brb Score! that's awesome! we'll try this out asap!
We're currently using image weaveworks/weave-kube:2.2.0, via a kops cluster. Would this image interoperate ok with those?

@brb

This comment has been minimized.

Copy link
Contributor

commented May 1, 2018

I can't think of anything which would prevent it from working.

Please let us know whether it works, thanks!

@Quentin-M

This comment has been minimized.

Copy link

commented May 1, 2018

@dcowden Kubernetes 1.10.1, Container Linux 1688.5.3-1758.0.0, AWS VPCs, Weave 2.3.0, kube-proxy IPVS. My guess is that it depends how fast/stable your network is?

@Quentin-M

This comment has been minimized.

Copy link

commented May 1, 2018

@dcowden

I'm still investigating the use of kubelet --resolve-conf, but what I'm really worried about is that all this is just a bandaid..

I have tried the other day, while it changed the resolv.conf of my static pods, all the other pods (with default dnsPolicy) were still based on what dns.go constructs. Note that the DNS options are written as a constant there. No possibility to get single-request-reopen without running your own compiled version of kubelet.

@Quentin-M

This comment has been minimized.

Copy link

commented May 1, 2018

@brb Thanks! I haven't realized yesterday that the patched iptables was already in an Alpine release. My issue is surely still present and both insert_failed and drop are still increasing. I note however that there are two other MASQUERADE rules in place, that do not have --random-fully, so that might be why? I am no network expert by any means unfortunately.

# Setup by WEAVE too.
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE

# Setup by both kubelet and kube-proxy, used to SNAT ports when querying services.
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -m mark --mark 0x4000/0x4000 -j MASQUERADE

-A WEAVE ! -s 172.16.0.0/16 -d 172.16.0.0/16 -j MASQUERADE --random-fully
-A WEAVE -s 172.16.0.0/16 ! -d 172.16.0.0/16 -j MASQUERADE --random-fully
@dcowden

This comment has been minimized.

Copy link
Author

commented May 1, 2018

@brb, i tried this out. I was able to upgrade successfully, but it didnt help my problems.

I think maybe i don't have it installed correctly, because my iptables rules do not show the fully-random flag anywhere.

Here's my daemonset ( annotations and stuff after the image omitted ):

dcowden@ubuntu:~/gitwork/kubernetes$ kc get ds weave-net -n kube-system -o yaml
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  ...omitted annotations...
  creationTimestamp: 2017-12-21T16:37:59Z
  generation: 4
  labels:
    name: weave-net
    role.kubernetes.io/networking: "1"
  name: weave-net
  namespace: kube-system
  resourceVersion: "21973562"
  selfLink: /apis/extensions/v1beta1/namespaces/kube-system/daemonsets/weave-net
  uid: 4dd96bf2-e66d-11e7-8b61-069a0a6ccd8c
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      name: weave-net
      role.kubernetes.io/networking: "1"
  template:
    metadata:
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      creationTimestamp: null
      labels:
        name: weave-net
        role.kubernetes.io/networking: "1"
    spec:
      containers:
      - command:
        - /home/weave/launch.sh
        env:
        - name: WEAVE_PASSWORD
          valueFrom:
            secretKeyRef:
              key: weave-passwd
              name: weave-passwd
        - name: HOSTNAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: IPALLOC_RANGE
          value: 100.96.0.0/11
        - name: WEAVE_MTU
          value: "8912"
        image: brb0/weave-kube:iptables-random-fully
        ...more stuff...

The daemonset was updated ok. Here's the iptables rules i see on a host. I dont see --random-fully anywhere:

[root@ip-172-25-19-92 ~]# iptables --list-rules
-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT
-N KUBE-FIREWALL
-N KUBE-FORWARD
-N KUBE-SERVICES
-N WEAVE-IPSEC-IN
-N WEAVE-NPC
-N WEAVE-NPC-DEFAULT
-N WEAVE-NPC-INGRESS
-A INPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A INPUT -j KUBE-FIREWALL
-A INPUT -j WEAVE-IPSEC-IN
-A FORWARD -o weave -m comment --comment "NOTE: this must go before \'-j KUBE-FORWARD\'" -j WEAVE-NPC
-A FORWARD -o weave -m state --state NEW -j NFLOG --nflog-group 86
-A FORWARD -o weave -j DROP
-A FORWARD -i weave ! -o weave -j ACCEPT
-A FORWARD -o weave -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -m comment --comment "kubernetes forward rules" -j KUBE-FORWARD
-A OUTPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A OUTPUT -j KUBE-FIREWALL
-A OUTPUT ! -p esp -m policy --dir out --pol none -m mark --mark 0x20000/0x20000 -j DROP
-A KUBE-FIREWALL -m comment --comment "kubernetes firewall for dropping marked packets" -m mark --mark 0x8000/0x8000 -j DROP
-A KUBE-FORWARD -m comment --comment "kubernetes forwarding rules" -m mark --mark 0x4000/0x4000 -j ACCEPT
-A KUBE-FORWARD -s 100.96.0.0/11 -m comment --comment "kubernetes forwarding conntrack pod source rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A KUBE-FORWARD -d 100.96.0.0/11 -m comment --comment "kubernetes forwarding conntrack pod destination rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A KUBE-SERVICES -d 100.65.65.105/32 -p tcp -m comment --comment "default/schaeffler-logstash:http has no endpoints" -m tcp --dport 9600 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -p tcp -m comment --comment "ops/echoheaders:http has no endpoints" -m addrtype --dst-type LOCAL -m tcp --dport 31436 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 100.69.172.111/32 -p tcp -m comment --comment "ops/echoheaders:http has no endpoints" -m tcp --dport 80 -j REJECT --reject-with icmp-port-unreachable
-A WEAVE-IPSEC-IN -s 172.25.83.126/32 -d 172.25.19.92/32 -p udp -m udp --dport 6784 -m mark ! --mark 0x20000/0x20000 -j DROP
-A WEAVE-IPSEC-IN -s 172.25.83.234/32 -d 172.25.19.92/32 -p udp -m udp --dport 6784 -m mark ! --mark 0x20000/0x20000 -j DROP
-A WEAVE-IPSEC-IN -s 172.25.83.40/32 -d 172.25.19.92/32 -p udp -m udp --dport 6784 -m mark ! --mark 0x20000/0x20000 -j DROP
-A WEAVE-IPSEC-IN -s 172.25.51.21/32 -d 172.25.19.92/32 -p udp -m udp --dport 6784 -m mark ! --mark 0x20000/0x20000 -j DROP
-A WEAVE-IPSEC-IN -s 172.25.51.170/32 -d 172.25.19.92/32 -p udp -m udp --dport 6784 -m mark ! --mark 0x20000/0x20000 -j DROP
-A WEAVE-IPSEC-IN -s 172.25.51.29/32 -d 172.25.19.92/32 -p udp -m udp --dport 6784 -m mark ! --mark 0x20000/0x20000 -j DROP
-A WEAVE-IPSEC-IN -s 172.25.19.130/32 -d 172.25.19.92/32 -p udp -m udp --dport 6784 -m mark ! --mark 0x20000/0x20000 -j DROP
-A WEAVE-NPC -m state --state RELATED,ESTABLISHED -j ACCEPT
-A WEAVE-NPC -d 224.0.0.0/4 -j ACCEPT
-A WEAVE-NPC -m state --state NEW -j WEAVE-NPC-DEFAULT
-A WEAVE-NPC -m state --state NEW -j WEAVE-NPC-INGRESS
-A WEAVE-NPC -m set ! --match-set weave-local-pods dst -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-f(09:Q6gzJb~LE_pU4n:@416L dst -m comment --comment "DefaultAllow isolation for namespace: ops" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-jXXXW48#WnolRYPFUalO(fLpK dst -m comment --comment "DefaultAllow isolation for namespace: troubleshooting" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-E.1.0W^NGSp]0_t5WwH/]gX@L dst -m comment --comment "DefaultAllow isolation for namespace: default" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-0EHD/vdN#O4]V?o4Tx7kS;APH dst -m comment --comment "DefaultAllow isolation for namespace: kube-public" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-?b%zl9GIe0AET1(QI^7NWe*fO dst -m comment --comment "DefaultAllow isolation for namespace: kube-system" -j ACCEPT

I don't know what to try next.

@Quentin-M

This comment has been minimized.

Copy link

commented May 1, 2018

@dcowden You need to make sure you are calling iptables 1.6.2, otherwise you will not see the flag. One solution is to run iptables from within the weave container. As for you, it did not help my issue, the first AAAA query still appears to be dropped. I am compiling kube-proxy/kubelet to add the fully-random flag there as well, but this is going to take a while.

@Quentin-M

This comment has been minimized.

Copy link

commented Jul 18, 2018

@brycesteinhoff

This comment has been minimized.

Copy link

commented Jul 18, 2018

@Quentin-M Thanks for the quick reply!

I just noticed after I posted that your shell script was marking traffic destined for 5353. I've changed that to 53 as we're seeing problems with standard DNS, and will continue to monitor. So far it seems it may be better, but I still see some delay (~2.5s) on some requests.

Our interface is called "weave" also, so I left that the same.

I've not fully dived in to understand your script; need to familiarize myself with tc. Are there any other aspects I should consider adjusting?

@brb

This comment has been minimized.

Copy link
Contributor

commented Aug 5, 2018

Just to update, I've submitted two patches to fix the conntrack races in the kernel - http://patchwork.ozlabs.org/patch/937963/ (accepted) and http://patchwork.ozlabs.org/patch/952939/ (waiting for a review).

If both are accepted, then the timeout cases due to the races will be eliminated for those who run only one instance of a DNS server, and for others - the timeout hit rate should decrease.

To completely eliminate when |DNS server| > 1 is a non-trivial task and is still WIP.

@bboreham

This comment has been minimized.

Copy link
Member

commented Aug 10, 2018

Do we envisage setting NF_NAT_RANGE_PROTO_RANDOM_FULLY inside Weave Net?
If not I would re-title this issue to match the broader problem.

@brb brb changed the title Feature Request: Set NF_NAT_RANGE_PROTO_RANDOM_FULLY flag on masquerading rules DNS lookup timeouts due to races in conntrack Aug 13, 2018

@brb

This comment has been minimized.

Copy link
Contributor

commented Aug 17, 2018

We wrote a blog post describing the technical details of the problem and presenting the kernel fixes: https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts.

@jaygorrell

This comment has been minimized.

Copy link

commented Aug 20, 2018

This thread was immensely helpful - thanks for all who contributed. Simply adding the trailing . was the easiest for most of my cases and works great. The one thing I'm still not fully understanding is how internal (ie. .default) dns lookups would sometimes fail. I can try the trailing dot but this is largely around external lookups that go through kube-dns, right?

I would have expected something like service.default. to fail without specifying the full fqdn since this would skip search domains but it appears to be working fine -- though by working, I don't necessarily mean it avoids the timeout problem. If it can't be resolved externally does it then revert back to the search?

@bboreham

This comment has been minimized.

Copy link
Member

commented Aug 20, 2018

Adding a trailing dot reduces the chance of failure since it reduces the number of lookups by (typically) 5x. It doesn't prevent any underlying problem.

DNS resolvers vary, and they are linked into your client program, so I don't know which one you are using. However I would expect a fully-qualified name like service.default. to never hit the search list.

@jaygorrell

This comment has been minimized.

Copy link

commented Aug 20, 2018

That was just trying curl from the container. Good point though... it probably doesn't follow the same rules.

@brb

This comment has been minimized.

Copy link
Contributor

commented Feb 11, 2019

The second kernel patch to mitigate the problem got accepted (context: https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts) and it is out in Linux 5.0-rc6.

Please test it and report whether it has reduced the timeout hit rate. Thanks.

fauzigo added a commit to uc-cdis/cloud-automation that referenced this issue Feb 28, 2019

add(TCP traffic): for DNS
based on weaveworks/weave#3287 (comment)
Thanks Zac for the quick find of this issue

fauzigo added a commit to uc-cdis/cloud-automation that referenced this issue Mar 1, 2019

add(TCP traffic): for DNS (#712)
* add(TCP traffic): for DNS

based on weaveworks/weave#3287 (comment)
Thanks Zac for the quick find of this issue

* chore(usersync): also using TCP for DNS
@krzysztof-bronk

This comment has been minimized.

Copy link

commented Mar 4, 2019

That's good news, thank you guys for all your investigative work so far.

I'm still a bit unclear as to what solution applies to what case, or more importantly, which cases do not have solutions yet.
Let me (re)state some of the findings gathered from various blogs and other github issues - please correct me if any of it is wrong as of what we know today.

The issue exists for both SNAT and DNAT.
The issue exists for both UDP and TCP.
conntrack -S counts failed insertions for both UDP and TCP, so the number of packets there might mean 5 seconds delay in case of DNS and 1, 3, etc. seconds for TCP retransmission

To mitigate the issue, one can for example use single-request-reopen in resolv.conf, if the container image uses glibc (which rules out Alpine), or use weave-tc to introduce microdelays for DNS packets. Disabling ipv6 or using FQDNs are quite niche solutions so let's leave them for now.

But both solutions are for DNS (weave-tc being UDP only on top), external TCP connections will still have a problem. Admittedly DNS virtual IP is probably the most used "service" in the cluster - and the topic of this issue.

The 2 fixes in the kernel solve the issue but only if you run a single DNS pod (or one per node with pods only connecting to that local one). I think weave-tc also does not guarantee 100% effectiveness in multiple pod case.

By the way, which kernel version contains the 1st fix? I understand the second is 5.0+.
And more importantly, do those fixes work both for SNAT, DNAT, both TCP and UDP?

In other words, given that moving to kernel 5.0+ is quite the leap for some, does it mean, in the simplest terms, that even if you introduce all possible mentioned workarounds, without those 2 kernel fixes, there is still a problem when 2+ containers connect to google.com at the same time?

(I'm excluding "workarounds" such as not using overlay networks at all, although as I understand that would actually work)

@chris530

This comment has been minimized.

Copy link

commented Mar 6, 2019

Launched https://github.com/Quentin-M/weave-tc as a DS in k8, and it immediately fixed the issue.

@bboreham

This comment has been minimized.

Copy link
Member

commented Mar 7, 2019

there is still a problem when 2+ containers connect to google.com at the same time?

[EDIT: I was confused so scoring out this part. See later comment too.]
Those (TCP) connections are never a problem, because they will come from unique source ports.

The problem [EDIT: in this specifc GitHub issue] comes when certain DNS clients make two simultaneous UDP requests with identical source ports (and the destination port is always 53), so we get a race.

The best mitigation is a DNS service which does not go via NAT. This is being worked on in Kubernetes, basically one per node and disabling NAT for on-node connections.

@krzysztof-bronk

This comment has been minimized.

Copy link

commented Mar 7, 2019

But isn't there a race condition in that source port uniqueness algorithm during SNAT, regardless of protocol and affecting different pods on the same host in the same way as the dns UDP client issue within one? Basically as in https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02

@bboreham

This comment has been minimized.

Copy link
Member

commented Mar 7, 2019

Sorry, yes, there is a different race condition to do with picking unique outgoing ports for SNAT.

If you are actually encountering this please open a new issue giving the details.

@krzysztof-bronk

This comment has been minimized.

Copy link

commented Mar 7, 2019

Thank you for the response. Indeed I'm seeing insert_failed despite implementing several workarounds and I'm note sure whether it's TCP, UDP, SNAT or DNAT. We can't bump the kernel yet.

If I understood correctly the SNAT case should be mitigated by the "random fully" flag, but Weave never went on with it? I think kubelet and kube-proxy would need those as well anyway, I don't know where things stand there.

There is one more head scratching case for me which is how all those cases fare when one uses NodePort. Isn't there a similar conntrack problem if NodePort forwards to cluster ip?

@bboreham

This comment has been minimized.

Copy link
Member

commented Mar 7, 2019

the "random fully" flag, but Weave never went on with it?

We investigated the problem reported here, and developed fixes to that problem. If someone reports symptoms that are improved by "random fully" then we might add that. We have finite resources and have to concentrate on what is actually reported (and within that set, on paying customers).

Or, since it's Open Source, anyone else can do the investigation and contribute a PR.

@krzysztof-bronk

This comment has been minimized.

Copy link

commented Mar 7, 2019

I understand :) I was merely trying to comprehend where things stand with regards to the different races and available mitigations, since there exist several blog posts and several github issues with a massive amount of comments to parse.

From my understanding of all of it, even with 2 kernel fixes and dns workarounds and iptables flags there is still an issue at least with multipod -> Cluster IP multipod connection, and without kernel 5.0 or "random fully" also an issue with simple multipod -> External IP connection.

But yeah, I'll raise a new issue if that proves true and impactful enough for us in production. Thank you

@Krishna1408

This comment has been minimized.

Copy link

commented Jul 16, 2019

@Quentin-M @brb We are using weave as well for our CNI and I tried to use the workaround mentioned by @Quentin-M. But I am getting error:

No distribution data for pareto (/lib/tc//pareto.dist: No such file or directory)

I am using debian: 4.9.0-7-amd64 #1 SMP Debian 4.9.110-3+deb9u2 (2018-08-13) x86_64 GNU/Linux

And I have mounted on /usr/lib/tc

Can you please correct where I am getting wrong ?

    spec:
      containers:
      - name: weave-tc
        image: 'qmachu/weave-tc:0.0.1'
        securityContext:
          privileged: true
        volumeMounts:
          - name: xtables-lock
            mountPath: /run/xtables.lock
          - name: usr-lib-tc
            mountPath: /usr/lib/tc

      volumes:
      - hostPath:
          path: /usr/lib/tc
          type: ""
        name: usr-lib-tc

Edit:
In the container specs, VolumeMount us-lib-tc needs update. It should be /lib/tc instead of /usr/lib/tc

@hairyhenderson

This comment has been minimized.

Copy link

commented Jul 16, 2019

@Krishna1408 If you change mountPath: /usr/lib/tc to mountPath: /lib/tc it should work. It needs to be mounted in /lib/tc inside the container, but it's (usually) /usr/lib/tc on the host.

@Krishna1408

This comment has been minimized.

Copy link

commented Jul 17, 2019

Hi @hairyhenderson thanks a lot, it works for me :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.