Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with mDNS Queries/Responses #22

Open
cbpowell opened this issue May 9, 2022 · 17 comments
Open

Issue with mDNS Queries/Responses #22

cbpowell opened this issue May 9, 2022 · 17 comments

Comments

@cbpowell
Copy link

cbpowell commented May 9, 2022

I was getting the same problem as @readybeginn in homebridge-lutron-caseta-leap#10, so I poked around a bit and tried some things. I have an OPNsense router with multiple VLANs, though I've turned on the mDNS repeater and it's generally worked fine for all other things HomeKit. As far as I could tell the mDNS queries/responses were getting through the firewall, but the plugin wasn't seeing my hub at initial setup. My Homebridge also runs in Docker with --net=host.

I tried running a simple multicast-dns standalone script that fires off the same query as the plugin here and seemed to be getting responses, so I tried working through the plugin code. I found that the response packets never have an id value other than 0 for me, with the plugin or with multicast-dns independently. Not sure if that's an issue with my network/devices/firewall or what?

I also had to change the resolver configuration to use port 5353 (or leave it off). With the setting of 0 it never received any packets - I'm not sure of the total significance of using 0 here.

I edited the plugin code to instead look for incoming packet responses that had was a single-answer, was the SRV type, and had a name that matched Lutron:

const tgt: string = await new Promise((resolve, reject) => {
    resolver.on('response', (packet: any) => {
        if (packet.answers.length === 1 &&
            packet.answers[0].type === 'SRV' &&
            packet.answers[0].name.match(/[Ll]utron/)) {

This change didn't immediately work by itself, but I tried simultaneously running the independent multicast-dns script (note: running on my laptop, instead of my Homebridge server). That DID result in a response packet that matched the above criteria, and the bridge appeared in the Homebridge UI. Of course that doesn't solve the long term problem of needing to do that on every Homebridge boot so the plugin can find the Lutron bridge again.

I'm brushing the edge of my Node async experience here - it seemed like the resolver query might be going out, but before the resolver listener was running to handle it? Or some other sequencing/timing type thing? (Ignoring the whole absence of a packet id).

There's probably a much better way than this hacky approach, but I edited the getBridgeId function to the following. It uses the above packet criteria, and fires off a new query upon receiving any packet that isn't the correct one. It then destroys the resolver as (I think) it's no longer needed.

Everything seems to work fine with this code, including initial pairing and Homebridge reboots. But I totally wouldn't be surprised if there's something simple I missed!

private async getBridgeID(mdnsID: string): Promise<string> {
    const resolver = mdns({
        multicast: true,
        ttl: 1,
    });
    
    const tgt: string = await new Promise((resolve, reject) => {
        resolver.on('response', (packet: any) => {
            if (packet.answers.length === 1 &&
            packet.answers[0].type === 'SRV' &&
            packet.answers[0].name.match(/[Ll]utron/)) {
                resolve(packet.answers[0].data.target);
                // Stop the resolver now that we have our good answer
                resolver.destroy();
            } else {
                // Resend query
                resolver.query(
                    {
                        // tslint:disable:no-bitwise
                        flags: dnspacket.RECURSION_DESIRED | dnspacket.AUTHENTIC_DATA,
                        questions: [
                            {
                                name: mdnsID,
                                type: 'SRV',
                                class: 'IN',
                            },
                        ],
                        additionals: [
                            {
                                name: '.',
                                type: 'OPT',
                                udpPayloadSize: 0x1000,
                            },
                        ],
                    },
                    {
                        port: 5353,
                        address: '224.0.0.251',
                    },
                );
            }
        });
    });
    
    try {
        return tgt.match(/[Ll]utron-(?<id>\w+)\.local/)!.groups!.id.toUpperCase();
    } catch (e) {
        throw new Error(`could not get bridge serial number from ${tgt}: ${e}`);
    }
}
@readybeginn
Copy link

I'm also running VLANs. I use Pfsense and the Avahi package to provide mDNS. Awesome work @cbpowell!

@thenewwazoo
Copy link
Owner

thenewwazoo commented May 9, 2022

I'm still reading through this (excellent!) report, but I just have to take a moment and wonder

65535 - 1 + 1

what the hell was I thinking when I wrote that? 😆

Okay, back to reading...

@thenewwazoo
Copy link
Owner

I'm going to take some notes here while I work through this:

  • The _id value is the DNS query transaction ID, which can be used to tie a specific response to a specific request. When I send broadcast DNS queries for SRV records, I get responses with a random ID as expected (and no warnings from dig), so it looks like the mDNS responder in the bridge is behaving correctly. I'm not sure about this, though.
  • I specify port 0 because, when mdns uses it, it instructs the operating system to bind to a random port. Turns out, that's not compliant with the RFC (see the last para in 5.2)!
  • Setting the transaction ID (should) obviates the need to filter response packets as you've done. Filtering based on record type, etc, explains why you can trigger discovery with unrelated queries.
  • The logic that follows is, I think, based on working around/with the previous changes. I don't see any async/timing issues because the response handler is hooked up before the query is issued, as expected. Issuing a query after receiving an unknown response is probably not the best approach.

I suspect something in your stack is inspecting (and maybe altering) mDNS responses as they come back by re-writing the transaction ID and/or refusing to pass packets with a malformed source port.

I'll get a change out right now to fix the source port bug. Can you share the script you're using to test multicast-dns? I'd love to see if I can reproduce the qid 0 weirdness.

@cbpowell
Copy link
Author

cbpowell commented May 9, 2022

Sure thing - sorry I should have just done that yesterday. The code is in this gist. With a preceding npm install multicast-dns to install it in the running directory.

Absolutely makes sense on the unique _id value to distinguish the specific query. I admit I thought mDNS generally worked with broad, non-specific queries (because there certainly seem to be a lot flying around on my network!), but that could certainly be exploited to do some DNS spoofing.

What's weird with the _id value is 0 (edit) even if I put my laptop on my IoT VLAN with the bridge and run that script, I still get responses with _id values of 0. Also weird is that in the script, the filtering approach catches the first/original query response that the plugin with the same code seems to miss.

@thenewwazoo
Copy link
Owner

Super weird. The internet appears to disagree about whether mDNS should have a qid of 0, but I can't find anything in the spec saying it must be. What's weirder is that I can reproduce your results, even directly on my RPi running Homebridge, which means the current code should never work, right?

@thenewwazoo
Copy link
Owner

Ah, okay, here we go, from RFC 6762 section 18.1:

Multicast DNS implementations SHOULD listen for unsolicited responses issued by hosts booting up (or waking up from sleep or otherwise joining the network). Since these unsolicited responses may contain a useful answer to a question for which the querier is currently awaiting an answer, Multicast DNS implementations SHOULD examine all received Multicast DNS response messages for useful answers, without regard to the contents of the ID field or the Question Section. In Multicast DNS, knowing which particular query message (if any) is responsible for eliciting a particular response message is less interesting than knowing whether the response message contains useful information.
...
In multicast query messages, the Query Identifier SHOULD be set to zero on transmission.

In multicast responses, including unsolicited multicast responses, the Query Identifier MUST be set to zero on transmission, and MUST be ignored on reception.

In legacy unicast response messages generated specifically in response to a particular (unicast or multicast) query, the Query Identifier MUST match the ID from the query message.

@thenewwazoo
Copy link
Owner

thenewwazoo commented May 9, 2022

Ah, here we go, from RFC 6762 section 6.7:

If the source UDP port in a received Multicast DNS query is not port 5353, this indicates that the querier originating the query is a simple resolver such as described in Section 5.1, "One-Shot Multicast DNS Queries", which does not fully implement all of Multicast DNS. In this case, the Multicast DNS responder MUST send a UDP response directly back to the querier, via unicast, to the query packet's source IP address and port. This unicast response MUST be a conventional unicast response as would be generated by a conventional Unicast DNS server; for example, it MUST repeat the query ID and the question given in the query message.

@thenewwazoo
Copy link
Owner

Okay, got it.

In the original code, I set the source port to 0, which causes mDNS responders to behave as though the requestor wants a "Legacy unicast response".

@thenewwazoo
Copy link
Owner

Check out this gist.

I get the following output:

$ node node_modules/multicast-dns/lutron.js
ID for match:  33361
got a response packet w/ id:  33361
got a response packet w/ id:  0
packet WOULD have matched
^C

@thenewwazoo
Copy link
Owner

Okay, so the behavior here is broken in the case of multiple bridges, as I could get multiple responses for the query for Lutron Status._lutron._tcp.local. In that case, the query is not closed when I get one response, and I am not guaranteed to even get a response from the same bridge that announced itself.

@thenewwazoo
Copy link
Owner

I'm going to have to think about this, and probably refactor how discovery works (again 😆). The DNS-SD library I use only looks for PTR records, but the SRV record contains the bridge ID. Maybe asking for PTR records is redundant, and I could possibly only ask for SRV records, but I'm not sure what the implications of that would be w.r.t. DNS-SD (does DNS-SD require PTR record requests or something?).

At the very least, I need to back out my most recent change, I think, as I'm relying on having the port set in order for the qid to work.

@thenewwazoo
Copy link
Owner

Alright, I unpublished 2.3.7.

@cbpowell
Copy link
Author

cbpowell commented May 9, 2022

Oh man, as a quick response - I'm at work and it's killing me that I can't reply and work on it live! I had done some Wireshark investigation since yesterday too, but I don't think I uncovered anything that you didn't also just describe.

When you request a SRV you do also get an A record with the IP in the additionals item, so would that be enough to also extract the IP? Instead of requesting the PTRs separately? (Maybe that's not proper usage of DNS-SD, like you say).

@thenewwazoo
Copy link
Owner

It looks like I do get an A record along with the SRV response:

$ dig @224.0.0.251 -p 5353 -t srv 'Lutron\032Status._lutron._tcp.local.'

; <<>> DiG 9.10.6 <<>> @224.0.0.251 -p 5353 -t srv Lutron\032Status._lutron._tcp.local.
; (1 server found)
;; global options: +cmd
;; Got answer:
;; WARNING: .local is reserved for Multicast DNS
;; You are currently testing what happens when an mDNS query is leaked to DNS
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 23810
;; flags: qr aa; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 2

;; QUESTION SECTION:
;Lutron\032Status._lutron._tcp.local. IN	SRV

;; ANSWER SECTION:
Lutron\032Status._lutron._tcp.local. 10	IN SRV	0 0 22 Lutron-032e7e88.local.

;; ADDITIONAL SECTION:
Lutron-032e7e88.local.	10	IN	A	192.168.1.151
Lutron-032e7e88.local.	10	IN	AAAA	fe80::9270:65ff:fee3:8f25

@cbpowell
Copy link
Author

cbpowell commented May 9, 2022

Yep same with the mdns-resolver script:

ID for match:  52572
got a response packet w/ id:  0
packet WOULD have matched:  {
  id: 0,
  type: 'response',
  flags: 1024,
  flag_qr: true,
  opcode: 'QUERY',
  flag_aa: true,
  flag_tc: false,
  flag_rd: false,
  flag_ra: false,
  flag_z: false,
  flag_ad: false,
  flag_cd: false,
  rcode: 'NOERROR',
  questions: [],
  answers: [
    {
      name: 'Lutron Status._lutron._tcp.local',
      type: 'SRV',
      ttl: 120,
      class: 'IN',
      flush: true,
      data: [Object]
    }
  ],
  authorities: [],
  additionals: [
    {
      name: 'Lutron-01e510eb.local',
      type: 'A',
      ttl: 120,
      class: 'IN',
      flush: true,
      data: '10.7.20.41'
    },
    {
      name: 'Lutron-01e510eb.local',
      type: 'AAAA',
      ttl: 120,
      class: 'IN',
      flush: true,
      data: 'fe80::8a4a:eaff:fefd:6b4b'
    },
    {
      name: 'Lutron-01e510eb.local',
      type: 'NSEC',
      ttl: 120,
      class: 'IN',
      flush: true,
      data: [Object]
    },
    {
      name: 'Lutron Status._lutron._tcp.local',
      type: 'NSEC',
      ttl: 120,
      class: 'IN',
      flush: true,
      data: [Object]
    }
  ]
}

@cbpowell
Copy link
Author

cbpowell commented May 9, 2022

Side note, those dig queries just don't seem to work across the OPNsense mdns repeater.

@thenewwazoo
Copy link
Owner

Yeah, I'd actually expect they wouldn't since it's not well-formed mDNS. Send a packet (using my gist above) with qid zero and source port 5353, and I bet it repeats it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants