New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DNS query time out after receiving some GB of data #3924
Comments
Additional i found this kubernetes issue addressed by RKE2 NodeLocal DNSCache. Implementing this does not fix this problem. Further i inspected conntrack. Here i see high 'invalid'-counter and 'restarts'
Does anybody knows what that means and if it is related too? |
Actually i don´t know witch information is important. So i think better writing more than less. conntrack buffer message
This message is printed sometimes. It seems it have no direct impact. netcatSetting:
|
Have you considered raising the relevant buffer sizes? |
Thank you for your response @brandond. |
Could you try to change the UDP buffer size with
|
Have you done as the error message suggested?
|
thank you for your response @rbrtbnfgl |
with |
this is like it looks like on all three nodes (host level) during restore and after first fail. I can´t see any problem here. |
you need to run it on the right namespace where you got the errors. |
@brandond
At github I found https://github.com/weaveworks/scope/pull/2739/files/9cc6cdbd5f81e78567886d32bfe66582cb57552f#diff-1ef0abf47eda947a298ede4cfaee9676714793745038108a28a5696e8df9cc1dL291 that changes the value to 4096*1024=4194304.
At the next test to restore the system failed already. Do you know other / further parameters to set? |
so changing that value doesn't fix the issue? It seems related to the UDP socket of the application not directly to networking. With |
right. the issue exists already. Application restores data from a S3 bucket. So the main traffic is TCP. UDP is needed only for DNS so UDP traffic is very low. But also as I limited the restore bandwidth to half issue came up. It just took longer. |
i am not sure about missunterstand something. running this inside the elastic container
but without |
to be complete i add the elasticsearch logfile from this operation. beginning at line 290 the error occurs. As i read
i looked out deeper in Java DNS and set parameter |
Which version of RKE2 are you using? Maybe it's not a buffer related issue. |
RKE2 v1.25.5+rke2r2 |
just an other observation:
Is this correct? And if it is true - why ... ? edit: beautify log |
Are you doing tcpdump on the node where DNS is located or on the node where the pod is doing the curl? I don't know how the cache is implemented but I presume that it will always contact the DNS service IP and then the proxy will redirect it to the right pod. |
I did this inside elastic pod. Docu writes "Please be aware that nodelocal modifies the iptables of the node to intercept DNS traffic." I am not sure what that means. From the point of view by kubernetes the node-cache pods imho will get no requests. |
By default the IP address of the DNS service is modified to the IP address of the coreDNS pod by iptables. With this feature enabled I think it creates additional rules to redirect the traffic to the local cache. |
you can try to use |
after installing dropwatch on host i got a lot of logfiles (see attachment). do you know how to interpret this? BTW: thank you for your support @rbrtbnfgl |
I thought that this could give more meaningful output. There are a lot of unrelated drops. |
first of all - thank you @rbrtbnfgl and @brandond for looking to this issue. to sum up until now:
Are there any other ideas how to inspect deeper in this? Thanks! |
it seems that it's not related to the buffer the counter is 0
If you check |
@rbrtbnfgl |
This didn't work for me after all with Calico/Canal and verified this with watching metrics on the NodeLocalDNS Pods/ CoreDNS itself. Only thing that helped was setting the following in kubelet-arg:
- cluster-dns=169.254.20.10 |
After digging deeper into the issue i assume that the problem is around UDP, not the DNS. We also figured out that UDP answer paket for DNS request reaches the container. tcpdump
This lines came from a DNS request via Further we watched out the netstat on failed DNS queries. See how the counter 'packet receive error' increments:
Do you have any hints where to go next @rbrtbnfgl @brandond? Thanks! |
tcpdump was done inside the container or on the node? Could you capture using |
@rbrtbnfgl i tried to make this more understandable and did a screencast of it. also the pcap file is attached. you see the requests until the first complete timeout. during this some slow dns queries are shown. |
after inspecting pcap file with wireshark (i am not familiar with that) it seems that in packet 117 .. 122 i see:
So I want to precise my question. |
could you try this? |
@rbrtbnfgl I tried this but unfortunately it changes nothing. The system has the same behavior as shown in video before. |
|
Yes it has but most of them don´t count. I´ve checked this also with high network load.
|
@eifelmicha |
Did you test with or without the |
@eifelmicha it was tested without the argument |
If you try to start a new empty pod on the failing node and try to contact the DNS does it work? It's only to check if there are issues on the networking of the node or it is specific on the failing pod. |
Other pods working already but dont have a comparable notwork traffic. So i saw the issue only on the failing pod right now. |
And you get the same issue with other CNIs too? |
Well i can´t test this right now. It would need to setup a completely new cluster i think. As it is all bare metal things are not so easy at this point. |
I installed a k0s distribution with Calico + wireguard as CNI. The problem occurs again. |
I don't know probably it could be an issue of the application itself which brings the container interface into an error state. |
Well it is Elasticsearch - a java application. I am not able to inspect this deeper. |
I check my workaround a litte. DNS via TCP works as it should 🎉 Workaround for a kubernetes pod is
How this works:
As result my
Thank you @rbrtbnfgl for supporting me! |
i am running probably in a UDP paket lost bug. This happens reproducible but not so deterministic. Let me explain.
With a ElasticSearch cluster of 3 nodes running on Kubernetes RKE2 i tried to restore indices from an S3 store. Nodes write with moderate I/O on NVME memory. No bottleneck there.
The problem comes with an more or less huge amount of received data. It appears after around 20..40GB of restored data. Java DNS queries times out. Imho this implies a network DNS query fails 3 times. Elasticsearch can´t deal with that during restore from S3 bucket, throws an exception and all incomplete indices can´t restored after that.
My setting:
/var/lib/rancher/rke2/server/manifests/rke2-canal-config.yaml
I identified the container network interface of f.e. one of the elasticsearch pods. See the UDP packet receive errors.
I read a lot about UDP and buffers. But i am completely unsure what I should setup there.
To be complete lets see elasticsearch logline:
Any hints on how to fix this? Or how could i investigate further?
Thank you!
The text was updated successfully, but these errors were encountered: