Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stub resolver cannot resolve www.vox.com or bugzilla.redhat.com, missing A records in response from stub resolver #18972

Closed
mcatanzaro opened this issue Mar 11, 2021 · 26 comments · Fixed by #19009
Labels
downstream/fedora Tracking bugs for Fedora priority Stuff that should enter master quickly, since it fixes a major bug, unbreaks CI or stalls other work resolve
Milestone

Comments

@mcatanzaro
Copy link
Contributor

mcatanzaro commented Mar 11, 2021

systemd-248~rc2-3.fc34, note this version contains the recent fix for #18819.

Fedora 34

5.11.5-300.fc34.x86_64

x86_64

Expected behaviour you didn't see

I should be able to resolve www.vox.com and bugzilla.redhat.com. (Edit: I've removed my examples here because they are not right, see next comment for examples.) A similar problem occurs for www.vox.com (not vox.com). Finally, note this occurs with systemd-248~rc2-3.fc34 which already includes the fix from #18819.

@mcatanzaro
Copy link
Contributor Author

mcatanzaro commented Mar 11, 2021

Er, hang on, the responses provided by the stub resolver keep changing, so I can sometimes randomly resolve the address and other times not. Here is www.vox.com using 8.8.8.8:

$ dig @8.8.8.8 www.vox.com

; <<>> DiG 9.16.11-RedHat-9.16.11-5.fc34 <<>> @8.8.8.8 www.vox.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 18442
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 4, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: fa68be36e831a810a44399c5604a7c09d403dbf1c8776112 (good)
;; QUESTION SECTION:
;www.vox.com.			IN	A

;; ANSWER SECTION:
www.vox.com.		70653	IN	CNAME	vox-chorus.map.fastly.net.
vox-chorus.map.fastly.net. 5	IN	A	199.232.77.52

;; AUTHORITY SECTION:
fastly.net.		144789	IN	NS	ns1.fastly.net.
fastly.net.		144789	IN	NS	ns4.fastly.net.
fastly.net.		144789	IN	NS	ns3.fastly.net.
fastly.net.		144789	IN	NS	ns2.fastly.net.

;; Query time: 18 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Thu Mar 11 14:22:33 CST 2021
;; MSG SIZE  rcvd: 195

And using the stub resolver, again no A records:

$ dig www.vox.com

; <<>> DiG 9.16.11-RedHat-9.16.11-5.fc34 <<>> www.vox.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 49904
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;www.vox.com.			IN	A

;; ANSWER SECTION:
www.vox.com.		4669	IN	CNAME	vox-chorus.map.fastly.net.

;; AUTHORITY SECTION:
fastly.net.		30	IN	SOA	ns1.fastly.net. hostmaster.fastly.com. 2017052201 3600 600 604800 30
vox-chorus.map.fastly.net. 12	IN	A	199.232.77.52

;; Query time: 0 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Thu Mar 11 14:22:25 CST 2021
;; MSG SIZE  rcvd: 153

@mcatanzaro
Copy link
Contributor Author

Seems www.vox.com is reliably busted. The results I get for bugzilla.redhat.com keep changing so it's down to luck whether that works or not.

@mcatanzaro mcatanzaro changed the title Fun with CNAMEs continues: stub resolver cannot resolve www.vox.com or bugzilla.redhat.com Stub resolver cannot resolve www.vox.com or bugzilla.redhat.com, missing A records in response from stub resolver Mar 11, 2021
@keszybz keszybz added downstream/fedora Tracking bugs for Fedora priority Stuff that should enter master quickly, since it fixes a major bug, unbreaks CI or stalls other work labels Mar 11, 2021
@keszybz keszybz added this to the v248 milestone Mar 11, 2021
@poettering
Copy link
Member

Are you sure you are running the right resolved version? I cannot reproduce this here, both the rh and the vox domain resolve completely reliably here.

your output look a lot like from a version without #18819 fixed, as if you didn't get that patch?

@mcatanzaro
Copy link
Contributor Author

No, I definitely have the patch from #18819, it is applied here for systemd-248~rc2-3.fc34.

@nanonyme thinks it may be semi-random, which feels right to me. bugzilla.redhat.com is resolving for me now, though it was almost always failing a couple hours ago. I managed to visit www.vox.com once just now, but it still almost always fails.

I'm willing to build a modified systemd with extra debug if that would be helpful. Another thing that would help would be if you temporarily disable nss-resolve, then you should get the stub resolver via nss-dns and will notice yourself when it starts acting wonky. It's definitely specific to the stub resolver: anything using nss-resolve is unaffected.

@mcatanzaro
Copy link
Contributor Author

Um, I just realized bugzilla.redhat.com is a somewhat bad example because I am connected to Red Hat VPN, sorry. :/ So comparing its output to what 8.8.8.8 returns is just confusing because there are no CNAMEs in the public version. I've edited my posts to remove those distracting examples. www.vox.com is a better example.

@mcatanzaro
Copy link
Contributor Author

I think the problem is here:

;; ANSWER SECTION:
www.vox.com.		4669	IN	CNAME	vox-chorus.map.fastly.net.

;; AUTHORITY SECTION:
fastly.net.		30	IN	SOA	ns1.fastly.net. hostmaster.fastly.com. 2017052201 3600 600 604800 30
vox-chorus.map.fastly.net. 12	IN	A	199.232.77.52

The record vox-chorus.map.fastly.net. 12 IN A 199.232.77.52 is correct but it should be in the ANSWER SECTION, not AUTHORITY SECTION. Right?

@mcatanzaro
Copy link
Contributor Author

Yeah, currently systemd is randomly returning this:

$ dig www.vox.com

; <<>> DiG 9.16.11-RedHat-9.16.11-5.fc34 <<>> www.vox.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 53719
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;www.vox.com.			IN	A

;; ANSWER SECTION:
www.vox.com.		7082	IN	CNAME	vox-chorus.map.fastly.net.
vox-chorus.map.fastly.net. 30	IN	A	199.232.77.52

;; Query time: 0 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Thu Mar 11 16:47:37 CST 2021
;; MSG SIZE  rcvd: 95

And currently it works for me. So sometimes the A record shows up in the wrong section, then it breaks. Sometimes it shows up in the correct section, then it works.

@mcatanzaro
Copy link
Contributor Author

@cmurf asked me to try SYSTEMD_LOG_LEVEL=debug. Here is a good case:

Mar 11 16:45:40 chargestone-cave systemd-resolved[78205]: Received dns UDP packet of size 52, ifindex=0, ttl=64, fragsize=0
Mar 11 16:45:40 chargestone-cave systemd-resolved[78205]: Got DNS stub UDP query packet for id 6521
Mar 11 16:45:40 chargestone-cave systemd-resolved[78205]: Looking up RR for www.vox.com IN A.
Mar 11 16:45:40 chargestone-cave systemd-resolved[78205]: Cache miss for www.vox.com IN A
Mar 11 16:45:40 chargestone-cave systemd-resolved[78205]: Regular transaction 52607 for <www.vox.com IN A> scope dns on tun0/* (validate=yes).
Mar 11 16:45:40 chargestone-cave systemd-resolved[78205]: Using feature level UDP+EDNS0 for transaction 52607.
Mar 11 16:45:40 chargestone-cave systemd-resolved[78205]: Using DNS server 10.8.0.1 for transaction 52607.
Mar 11 16:45:40 chargestone-cave systemd-resolved[78205]: Announcing packet size 512 in egress EDNS(0) packet.
Mar 11 16:45:40 chargestone-cave systemd-resolved[78205]: Emitting UDP, link MTU is 1500, socket MTU is 0, minimal MTU is 40
Mar 11 16:45:40 chargestone-cave systemd-resolved[78205]: Sending query packet with id 52607 of size 40.
Mar 11 16:45:40 chargestone-cave systemd-resolved[78205]: Processing query...
Mar 11 16:45:40 chargestone-cave systemd-resolved[78205]: Received dns UDP packet of size 95, ifindex=5, ttl=0, fragsize=0
Mar 11 16:45:40 chargestone-cave systemd-resolved[78205]: Processing incoming packet of size 95 on transaction 52607 (rcode=SUCCESS).
Mar 11 16:45:40 chargestone-cave systemd-resolved[78205]: Added positive unauthenticated non-confidential cache entry for www.vox.com IN CNAME 7200s on tun0/INET/10.8.0.1
Mar 11 16:45:40 chargestone-cave systemd-resolved[78205]: Added positive unauthenticated non-confidential cache entry for vox-chorus.map.fastly.net IN A 30s on tun0/INET/10.8.0.1
Mar 11 16:45:40 chargestone-cave systemd-resolved[78205]: Regular transaction 52607 for <www.vox.com IN A> on scope dns on tun0/* now complete with <success> from network (unsigned; non-confidential).
Mar 11 16:45:40 chargestone-cave systemd-resolved[78205]: Following CNAME/DNAME www.vox.com → vox-chorus.map.fastly.net.
Mar 11 16:45:40 chargestone-cave systemd-resolved[78205]: Positive cache hit for vox-chorus.map.fastly.net IN A
Mar 11 16:45:40 chargestone-cave systemd-resolved[78205]: Regular transaction 24216 for <vox-chorus.map.fastly.net IN A> on scope dns on tun0/* now complete with <success> from cache (unsigned; non-confidential).
Mar 11 16:45:40 chargestone-cave systemd-resolved[78205]: Freeing transaction 24216.
Mar 11 16:45:40 chargestone-cave systemd-resolved[78205]: Sending response packet with id 6521 on interface 1/AF_INET of size 95.
Mar 11 16:45:40 chargestone-cave systemd-resolved[78205]: Freeing transaction 52607.

Notice that in this case we have a cache miss for www.vox.com. Now here is a bad case:

Mar 11 17:12:57 chargestone-cave systemd-resolved[82390]: Received dns UDP packet of size 52, ifindex=0, ttl=64, fragsize=0
Mar 11 17:12:57 chargestone-cave systemd-resolved[82390]: Got DNS stub UDP query packet for id 38680
Mar 11 17:12:57 chargestone-cave systemd-resolved[82390]: Looking up RR for www.vox.com IN A.
Mar 11 17:12:57 chargestone-cave systemd-resolved[82390]: Removing cache entry for fedoraproject.org IN AAAA (expired 42s ago)
Mar 11 17:12:57 chargestone-cave systemd-resolved[82390]: Removing cache entry for p1.parsely.com IN AAAA (expired 30s ago)
Mar 11 17:12:57 chargestone-cave systemd-resolved[82390]: Removing cache entry for cdn.parsely.com IN AAAA (expired 30s ago)
Mar 11 17:12:57 chargestone-cave systemd-resolved[82390]: Removing cache entry for optimize-stats.voxmedia.com IN CNAME (expired 29s ago)
Mar 11 17:12:57 chargestone-cave systemd-resolved[82390]: Removing cache entry for fedoraproject.org IN A (expired 4s ago)
Mar 11 17:12:57 chargestone-cave systemd-resolved[82390]: Positive cache hit for www.vox.com IN A
Mar 11 17:12:57 chargestone-cave systemd-resolved[82390]: Regular transaction 13830 for <www.vox.com IN A> on scope dns on tun0/* now complete with <success> from cache (unsigned; non-confidential).
Mar 11 17:12:57 chargestone-cave systemd-resolved[82390]: Freeing transaction 13830.
Mar 11 17:12:57 chargestone-cave systemd-resolved[82390]: Following CNAME/DNAME www.vox.com → vox-chorus.map.fastly.net.
Mar 11 17:12:57 chargestone-cave systemd-resolved[82390]: Cache miss for vox-chorus.map.fastly.net IN A
Mar 11 17:12:57 chargestone-cave systemd-resolved[82390]: Regular transaction 42596 for <vox-chorus.map.fastly.net IN A> scope dns on tun0/* (validate=yes).
Mar 11 17:12:57 chargestone-cave systemd-resolved[82390]: Using feature level UDP+EDNS0 for transaction 42596.
Mar 11 17:12:57 chargestone-cave systemd-resolved[82390]: Using DNS server 10.8.0.1 for transaction 42596.
Mar 11 17:12:57 chargestone-cave systemd-resolved[82390]: Announcing packet size 512 in egress EDNS(0) packet.
Mar 11 17:12:57 chargestone-cave systemd-resolved[82390]: Emitting UDP, link MTU is 1500, socket MTU is 0, minimal MTU is 40
Mar 11 17:12:57 chargestone-cave systemd-resolved[82390]: Sending query packet with id 42596 of size 54.
Mar 11 17:12:57 chargestone-cave systemd-resolved[82390]: Processing query...
Mar 11 17:12:58 chargestone-cave systemd-resolved[82390]: Received dns UDP packet of size 70, ifindex=5, ttl=0, fragsize=0
Mar 11 17:12:58 chargestone-cave systemd-resolved[82390]: Processing incoming packet of size 70 on transaction 42596 (rcode=SUCCESS).
Mar 11 17:12:58 chargestone-cave systemd-resolved[82390]: Added positive unauthenticated non-confidential cache entry for vox-chorus.map.fastly.net IN A 30s on tun0/INET/10.8.0.1
Mar 11 17:12:58 chargestone-cave systemd-resolved[82390]: Regular transaction 42596 for <vox-chorus.map.fastly.net IN A> on scope dns on tun0/* now complete with <success> from network (unsigned; non-confidential).
Mar 11 17:12:58 chargestone-cave systemd-resolved[82390]: Sending response packet with id 38680 on interface 1/AF_INET of size 153.
Mar 11 17:12:58 chargestone-cave systemd-resolved[82390]: Freeing transaction 42596.

Notice there it starts with a cache hit for www.vox.com. But not all cache hits are bad. Sometimes it caches a good result:

Mar 11 17:21:14 chargestone-cave systemd-resolved[84538]: Received dns UDP packet of size 52, ifindex=0, ttl=64, fragsize=0
Mar 11 17:21:14 chargestone-cave systemd-resolved[84538]: Got DNS stub UDP query packet for id 15824
Mar 11 17:21:14 chargestone-cave systemd-resolved[84538]: Looking up RR for www.vox.com IN A.
Mar 11 17:21:14 chargestone-cave systemd-resolved[84538]: Positive cache hit for www.vox.com IN A
Mar 11 17:21:14 chargestone-cave systemd-resolved[84538]: Regular transaction 14769 for <www.vox.com IN A> on scope dns on tun0/* now complete with <success> from cache (unsigned; non-confidential).
Mar 11 17:21:14 chargestone-cave systemd-resolved[84538]: Freeing transaction 14769.
Mar 11 17:21:14 chargestone-cave systemd-resolved[84538]: Following CNAME/DNAME www.vox.com → vox-chorus.map.fastly.net.
Mar 11 17:21:14 chargestone-cave systemd-resolved[84538]: Positive cache hit for vox-chorus.map.fastly.net IN A
Mar 11 17:21:14 chargestone-cave systemd-resolved[84538]: Regular transaction 7177 for <vox-chorus.map.fastly.net IN A> on scope dns on tun0/* now complete with <success> from cache (unsigned; non-confidential).
Mar 11 17:21:14 chargestone-cave systemd-resolved[84538]: Freeing transaction 7177.
Mar 11 17:21:14 chargestone-cave systemd-resolved[84538]: Sending response packet with id 15824 on interface 1/AF_INET of size 95.
Mar 11 17:21:14 chargestone-cave systemd-resolved[84538]: Processing query...

What's frustrating is that I haven't found any way to reliably reproduce. Sometimes it works, sometimes it doesn't. Restarting systemd-resolved seems to get it into a good state, though.

@mcatanzaro
Copy link
Contributor Author

Same problem with www.reddit.com:

$ dig www.reddit.com

; <<>> DiG 9.16.11-RedHat-9.16.11-5.fc34 <<>> www.reddit.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 8617
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;www.reddit.com.			IN	A

;; ANSWER SECTION:
www.reddit.com.		67	IN	CNAME	reddit.map.fastly.net.

;; AUTHORITY SECTION:
fastly.net.		30	IN	SOA	ns1.fastly.net. hostmaster.fastly.com. 2017052201 3600 600 604800 30
reddit.map.fastly.net.	1	IN	A	199.232.125.140

;; Query time: 1 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Thu Mar 11 17:35:45 CST 2021
;; MSG SIZE  rcvd: 152

@AdamWill
Copy link
Contributor

I am also still seeing issues with the latest Fedora package build (that includes the previous fix). There's a specific test in openQA which resolves mirrors.fedoraproject.org several times (from inside Fedora's infra, where the record returned is slightly different from the one returned from outside Fedora's infra), and this often fails. I attached the debug log messages from a failure to the downstream bug.

@mcatanzaro
Copy link
Contributor Author

mcatanzaro commented Mar 12, 2021

Workaround: systemctl restart systemd-resolved.service. Run that whenever name resolution fails to flush the DNS cache, then it will work again for a little while before the next failure.

Edit: better workaround: resolvectl flush-caches

@mcatanzaro
Copy link
Contributor Author

Adam wound up disabling the DNS cache in Fedora 34. Since upgrading to this new version about an hour ago, I have not seen a single DNS failure. I can't be completely certain, but I now strongly suspect the bug only occurs when there is a cache hit (and yet: not always when there is a cache hit).

@AdamWill
Copy link
Contributor

yeah, the openQA tests I ran with the disabled cache build all passed (or didn't fail on DNS resolution, at least) too. Will report back if we do hit a failure.

@poettering
Copy link
Member

@mcatanzaro thanks a lot for the elaborate debug logs. I managed to rerproduce this finally with help of those. Working on a fix.

@poettering
Copy link
Member

Fix waiting in #19009. Would be great if you could give this some testing?

poettering added a commit to poettering/systemd that referenced this issue Mar 15, 2021
Previously by mistake we'd always match every single reply we get in a
CNAME chain to the original question from the stub client. That's
broken, we need to test it against the CNAME query we are currently
looking at.

The effect of this incorrect matching was that we'd assign the RRs to
the wrong section since we'd assume they'd be auxiliary answers instead
of primary answers.

Fixes: systemd#18972
jkartseva pushed a commit to jkartseva/systemd that referenced this issue Mar 17, 2021
Previously by mistake we'd always match every single reply we get in a
CNAME chain to the original question from the stub client. That's
broken, we need to test it against the CNAME query we are currently
looking at.

The effect of this incorrect matching was that we'd assign the RRs to
the wrong section since we'd assume they'd be auxiliary answers instead
of primary answers.

Fixes: systemd#18972
@AdamWill
Copy link
Contributor

A reporter downstream is still reporting issues with systemd-248~rc2-8.fc34 , which has #19009 backported to rc2. So either I muffed the backport somehow, the reporter messed something up, or we still have issues here. I did have to rediff one small thing in the patch - changing UINT32_MAX back to (uint32_t) -1 in both the "before" and "after" lines of one patch fragment, because f5fbe71 was not in rc2.

@AdamWill
Copy link
Contributor

oh, never mind, reporter just updated with "Scratch that. This seems to be unrelated as when overriding the old systemd into 34-testing the issue persists. I'm going to dig a little deeper." I'll keep an eye open though.

@mcatanzaro
Copy link
Contributor Author

I did have to rediff one small thing in the patch - changing UINT32_MAX back to (uint32_t) -1 in both the "before" and "after" lines of one patch fragment, because f5fbe71 was not in rc2.

You did basically the same thing I did when backporting this. I've now tested both your backport and my own, and they both fix this issue for me. Maybe the downstream tester failed to systemctl restart systemd-resolved.service or something.

@poettering

This comment has been minimized.

@mcatanzaro

This comment has been minimized.

@AdamWill

This comment has been minimized.

@AdamWill

This comment has been minimized.

@mrc0mmand

This comment has been minimized.

@AdamWill
Copy link
Contributor

I am also hitting problems with the -8 build with the backport, BTW. resolvectl seems to be giving me IPv6 addresses even though I have no IPv6 connectivity. See downstream comment.

@yuwata
Copy link
Member

yuwata commented Mar 18, 2021

BTW, rc4 for fedora rawhide and f34 are built successfully. Please try them.
f34: https://bodhi.fedoraproject.org/updates/FEDORA-2021-f26baf322b
rawhide: https://bodhi.fedoraproject.org/updates/FEDORA-2021-2fa763b178

@AdamWill
Copy link
Contributor

The above is my new report for the latest issue, as Michael suggested it should be filed separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
downstream/fedora Tracking bugs for Fedora priority Stuff that should enter master quickly, since it fixes a major bug, unbreaks CI or stalls other work resolve
Development

Successfully merging a pull request may close this issue.

6 participants