Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS hangs on invalid domain #156

Closed
daurnimator opened this issue Jul 5, 2016 · 9 comments · Fixed by #200 · May be fixed by wahern/dns#27
Closed

DNS hangs on invalid domain #156

daurnimator opened this issue Jul 5, 2016 · 9 comments · Fixed by #200 · May be fixed by wahern/dns#27

Comments

@daurnimator
Copy link
Collaborator

This hangs:

print(require"cqueues.socket".connect("wrong.invalid", 80):connect())

Whereas dig returns NXDOMAIN almost instantly.

Relevant strace output of lua sample (exited during hang):

socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 4
fcntl(4, F_SETFD, FD_CLOEXEC)           = 0
fcntl(4, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
open("/dev/urandom", O_RDONLY|O_NOCTTY|O_NONBLOCK) = 5
fstat(5, {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 9), ...}) = 0
poll([{fd=5, events=POLLIN}], 1, 10)    = 1 ([{fd=5, revents=POLLIN}])
read(5, "\241\253\337\243A\241\fs\320\363(\255\366\2505\346\364\215\25\233\262\325YN\312\201\324\21\352\255\177\304", 32) = 32
close(5)                                = 0
getuid()                                = 1000
bind(4, {sa_family=AF_INET, sin_port=htons(33483), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
brk(0x967000)                           = 0x967000
connect(4, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.1.1.1")}, 16) = 0
sendto(4, "\263+\1\0\0\1\0\0\0\0\0\0\5wrong\7invalid\0\0\1\0\1", 31, 0, NULL, 0) = 31
recvfrom(4, 0x947064, 768, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
recvfrom(4, 0x947064, 768, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
epoll_create1(EPOLL_CLOEXEC)            = 5
eventfd2(0, EFD_CLOEXEC|EFD_NONBLOCK)   = 6
epoll_ctl(5, EPOLL_CTL_ADD, 6, {EPOLLIN, {u32=9729320, u64=9729320}}) = 0
write(6, "\1\0\0\0\0\0\0\0", 8)         = 8
epoll_wait(5, [{EPOLLIN, {u32=9729320, u64=9729320}}], 32, 0) = 1
epoll_ctl(5, EPOLL_CTL_ADD, 4, {EPOLLIN, {u32=9730880, u64=9730880}}) = 0
read(6, "\1\0\0\0\0\0\0\0", 8)          = 8
epoll_wait(5, [{EPOLLIN, {u32=9730880, u64=9730880}}], 32, -1) = 1
write(6, "\1\0\0\0\0\0\0\0", 8)         = 8
epoll_ctl(5, EPOLL_CTL_DEL, 4, 0x7ffff26147f0) = 0
recvfrom(4, "\263+\201\203\0\1\0\0\0\1\0\0\5wrong\7invalid\0\0\1\0\1\0"..., 768, 0, NULL, NULL) = 106
connect(4, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.1.1.1")}, 16) = 0
sendto(4, "\372\372\1\0\0\1\0\0\0\0\0\0\5wrong\7invalid\0\0\0\1\0\1", 32, 0, NULL, 0) = 32
recvfrom(4, 0x947da4, 768, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(5, [{EPOLLIN, {u32=9729320, u64=9729320}}], 32, 0) = 1
epoll_ctl(5, EPOLL_CTL_ADD, 4, {EPOLLIN, {u32=9730880, u64=9730880}}) = 0
read(6, "\1\0\0\0\0\0\0\0", 8)          = 8
epoll_wait(5, ^Cstrace: Process 8661 detached
@wahern
Copy link
Owner

wahern commented Jul 6, 2016

Can you capture the entire reply? Also, can you paste your /etc/resolv.conf? I can't reproduce, but I think the problem is related to search generation.

In trying to reproduce an environment that can reproduce the issue I discovered that your patch (4d66661) to cqs_newmetatable broke dns.config. The :set method added to the config userdata object is lost, and so dns.config.new(t) doesn't work. In fact, I'm surprised everything else appears to work, and can't figure out how I missed the problem on review.

@daurnimator
Copy link
Collaborator Author

daurnimator commented Jul 7, 2016

It doesn't seem to replicate since I restarted my computer. I had previously moved between various networks.... so I'm not completely sure what the state was.

In general if you send a query to a non-existant server it hangs. But I don't think that's the same issue

sudo ip addr add 10.1.1.5/24 dev wlp6s0 # add a new subnet that doesn't have any devices on it
echo nameserver 10.1.1.1 | sudo tee /etc/resolv.conf # replace resolv.conf with non-listening dns server
strace lua -e 'require"cqueues.auxlib" print(require"cqueues.socket".connect("wrong.invalid", 80):connect())' # Replicated?
  • in the strace above I can see data was received.

@daurnimator
Copy link
Collaborator Author

Closing as I can't replicate

@daurnimator
Copy link
Collaborator Author

This might have just cropped up again.

$ strace lua -e 'require"cqueues.auxlib"; print(require"cqueues.socket".connect({host="wrong.invalid", port=80}):connect())'
<snip>
open("/etc/resolv.conf", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=69, ...}) = 0
lseek(3, 0, SEEK_SET)                   = 0
read(3, "# Generated by resolvconf\nnamese"..., 4096) = 69
read(3, "", 4096)                       = 0
close(3)                                = 0
open("/etc/nsswitch.conf", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=234, ...}) = 0
read(3, "# Begin /etc/nsswitch.conf\n\npass"..., 4096) = 234
read(3, "", 4096)                       = 0
read(3, "", 4096)                       = 0
read(3, "", 4096)                       = 0
close(3)                                = 0
open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=195, ...}) = 0
lseek(3, 0, SEEK_SET)                   = 0
read(3, "#\n# /etc/hosts: static lookup ta"..., 4096) = 195
brk(0x20d3000)                          = 0x20d3000
read(3, "", 4096)                       = 0
close(3)                                = 0
socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 3
open("/dev/urandom", O_RDONLY|O_NOCTTY|O_NONBLOCK) = 4
fstat(4, {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 9), ...}) = 0
poll([{fd=4, events=POLLIN}], 1, 10)    = 1 ([{fd=4, revents=POLLIN}])
read(4, "A\6\215`r\342X\254\321z\rp\316|\267foGi\2}\376\t\217m\210\302z\223\4\200\354", 32) = 32
close(4)                                = 0
getuid()                                = 1000
bind(3, {sa_family=AF_INET, sin_port=htons(55374), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.1.2.1")}, 16) = 0
sendto(3, "_\252\1\0\0\1\0\0\0\0\0\0\5wrong\7invalid\0\0\1\0\1", 31, 0, NULL, 0) = 31
recvfrom(3, 0x20b501c, 768, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
recvfrom(3, 0x20b501c, 768, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
epoll_create1(EPOLL_CLOEXEC)            = 4
eventfd2(0, EFD_CLOEXEC|EFD_NONBLOCK)   = 5
epoll_ctl(4, EPOLL_CTL_ADD, 5, {EPOLLIN, {u32=34297176, u64=34297176}}) = 0
write(5, "\1\0\0\0\0\0\0\0", 8)         = 8
epoll_wait(4, [{EPOLLIN, {u32=34297176, u64=34297176}}], 32, 0) = 1
epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN, {u32=34299040, u64=34299040}}) = 0
read(5, "\1\0\0\0\0\0\0\0", 8)          = 8
epoll_wait(4, [{EPOLLIN, {u32=34299040, u64=34299040}}], 32, -1) = 1
write(5, "\1\0\0\0\0\0\0\0", 8)         = 8
epoll_ctl(4, EPOLL_CTL_DEL, 3, 0x7ffedda261d0) = 0
recvfrom(3, "_\252\201\203\0\1\0\0\0\0\0\0\5wrong\7invalid\0\0\1\0\1", 768, 0, NULL, NULL) = 31
connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.1.2.1")}, 16) = 0
sendto(3, "\254\304\1\0\0\1\0\0\0\0\0\0\5wrong\7invalid\0\0\0\1\0\1", 32, 0, NULL, 0) = 32
recvfrom(3, 0x20b5e0c, 768, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(4, [{EPOLLIN, {u32=34297176, u64=34297176}}], 32, 0) = 1
epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN, {u32=34299040, u64=34299040}}) = 0
read(5, "\1\0\0\0\0\0\0\0", 8)          = 8
epoll_wait(4, ^Cstrace: Process 15454 detached
 <detached ...>
lua: /usr/share/lua/5.3/cqueues.lua:77: interrupted!
$ cat /etc/resolv.conf
# Generated by resolvconf
nameserver 10.1.2.1
nameserver 192.168.1.1

It never got to the 2nd nameserver. Removing that from resolv.conf doesn't seem to fix anything.

Passing a family argument (of either 2 or 10) to connect doesn't seem to help.

@daurnimator daurnimator reopened this Sep 14, 2016
@daurnimator
Copy link
Collaborator Author

daurnimator commented Sep 14, 2016

I think this might be easier to replicate with just dns.c.
Though I'm not 100% confident this is exhibiting the same behaviour.

$ gcc -D DNS_DEBUG -D DNS_MAIN dns.c
$ ./a.out  -q wrong.invalid show-resconf
; SOURCES
;   /etc/resolv.conf
;   /etc/nsswitch.conf
;
nameserver 10.1.2.1
search .
; hosts: files dns
lookup file bind
options ndots:1 timeout:5 attempts:2
interface 0.0.0.0 0
$ ./a.out -vvvv -q wrong.invalid addrinfo-stub
@@ BEGIN * * * * * * * * * * * *
@@ ASKING: hints.local./10.1.2.1 @ DEPTH: 0)
;; [HEADER]
;;    qid : 53039
;;     qr : QUERY(0)
;; opcode : QUERY(0)
;;     aa : NON-AUTHORITATIVE(0)
;;     tc : NOT-TRUNCATED(0)
;;     rd : RECURSION-DESIRED(1)
;;     ra : RECURSION-NOT-ALLOWED(0)
;;  rcode : NOERROR(0)

;; [QUESTION:1]
;wrong.invalid. IN A
@@ END * * * * * * * * * * * * *

@@ BEGIN * * * * * * * * * * * *
@@ ANSWER @ DEPTH: 0)
;; [HEADER]
;;    qid : 53039
;;     qr : RESPONSE(1)
;; opcode : QUERY(0)
;;     aa : NON-AUTHORITATIVE(0)
;;     tc : NOT-TRUNCATED(0)
;;     rd : RECURSION-DESIRED(1)
;;     ra : RECURSION-ALLOWED(1)
;;  rcode : NXDOMAIN(3)

;; [QUESTION:1]
;wrong.invalid. IN A
@@ END * * * * * * * * * * * * *

@@ BEGIN * * * * * * * * * * * *
@@ ASKING: hints.local./10.1.2.1 @ DEPTH: 0)
;; [HEADER]
;;    qid : 51133
;;     qr : QUERY(0)
;; opcode : QUERY(0)
;;     aa : NON-AUTHORITATIVE(0)
;;     tc : NOT-TRUNCATED(0)
;;     rd : RECURSION-DESIRED(1)
;;     ra : RECURSION-NOT-ALLOWED(0)
;;  rcode : NOERROR(0)

;; [QUESTION:1]
;wrong.invalid. 256 0
@@ END * * * * * * * * * * * * *

^C

@daurnimator
Copy link
Collaborator Author

daurnimator commented Oct 29, 2017

I still have this issue semi-regularly.

 $ strace -e network lua -e 'print(require"cqueues.socket".connect("wrong.invalid", 80):connect())'
socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 3
bind(3, {sa_family=AF_INET, sin_port=htons(53399), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.1.1.2")}, 16) = 0
sendto(3, "\276\234\1\0\0\1\0\0\0\0\0\0\5wrong\7invalid\0\0\1\0\1", 31, 0, NULL, 0) = 31
recvfrom(3, 0x2494a2c, 768, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
recvfrom(3, 0x2494a2c, 768, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
recvfrom(3, "\276\234\201\203\0\1\0\0\0\0\0\0\5wrong\7invalid\0\0\1\0\1", 768, 0, NULL, NULL) = 31
connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.1.1.2")}, 16) = 0
sendto(3, "se\1\0\0\1\0\0\0\0\0\0\5wrong\7invalid\0\0\0\1\0\1", 32, 0, NULL, 0) = 32
recvfrom(3, 0x249d26c, 768, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
^Cstrace: Process 24159 detached

Looking at this in wireshark was more educational. The second dns query is invalid:
as hex:

7365010000010000000000000577726f6e6707696e76616c6964000000010001

Which is an invalid type and an invalid class.

Which we should have noticed in the dump in my earlier comment...

> ;; [QUESTION:1]
> ;wrong.invalid. 256 0

@daurnimator
Copy link
Collaborator Author

@wahern are you able to look into this issue with the dump above? A user aside from myself seems to be running into it. daurnimator/lua-http#87

@daurnimator
Copy link
Collaborator Author

Digging into dns.c further. the issue seems to be coming in dns_resconf_search (from the DNS_R_FILE callsite): it yields wrong.invalid. the first time through via the ndots branch, but then goes to the search brach, where it appends another . and yields wrong.invalid.., which results in an outgoing query packet with an extra null terminator

This was referenced Jun 16, 2018
@daurnimator
Copy link
Collaborator Author

daurnimator commented Jun 16, 2018

I finally figured this out; and it was sitting there as a "fixme" the whole time!
When gethostname returns a string that doesn't contain a '.', then we were getting a garbage search entry, this then resulted in a corrupt dns question packet.

See wahern/dns@0ac72c9

PRs sent: wahern/dns#27 and #200

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants