Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Domain names not being resolved over extender node bridged connection #27

Closed
paidforby opened this issue Apr 9, 2018 · 18 comments
Closed

Comments

@paidforby
Copy link

Discovered on a recent node mount that domain names are not being resolved over an extender node bridged mesh connection. I believe @jhpoelen and I observed similar behavior with the mesh test bed in sudoroom a few weeks ago.

The setup

  • Two homes nodes (MyNet N600s) that are unable to mesh with one other (either due to physical distance, or by disabling their ad-hoc interface)
  • One of the home node (let's say Cow) has an internet connection, the other does not (let's call this one Chicken)
  • Two extender nodes (Ubquiti Nanobridge M5s) that are able to mesh with each other via line of sight

To reproduce

Connect to the peoplesopen.net SSID of Chicken. And try the following,

ping 8.8.8.8

you should see

64 bytes from 8.8.8.8: icmp_seq=1 ttl=47 time=389 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=47 time=273 ms

however, if you try, ping google.com you will receive a timeout

You can also try running traceroute 8.8.8.8 and see that it is successfully hoping the connection,

traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 38 byte packets
 1  100.65.98.130 (100.65.98.130)  0.377 ms  0.777 ms  0.536 ms
 2  100.65.98.2 (100.65.98.2)  1.635 ms  1.942 ms  1.505 ms
 3  100.65.98.1 (100.65.98.1)  1.578 ms  1.778 ms  2.122 ms
 4  100.64.0.43 (100.64.0.43)  15.146 ms  12.827 ms  14.989 ms
...
 14  8.8.8.8 (8.8.8.8)  12.793 ms  11.533 ms  11.941 ms

Where,
100.65.98.130 is the IP of Chicken's roof mounted antenna,
100.65.98.2 is the IP of Cow's roof mounted antenna
100.65.98.1 is the IP of Cow's home node
100.65.0.43 is the IP of the mesh DNS

But trying to traceroute google.com also timesout

Also when ssh'd into Chicken, in /var/log/messages you should see,

Sun Apr  8 14:23:53 2018 daemon.info dnsmasq-dhcp[2007]: DHCPREQUEST(br-open) 100.65.98.164 fc:f8:ae:02:69:f3                                  
Sun Apr  8 14:23:53 2018 daemon.info dnsmasq-dhcp[2007]: DHCPACK(br-open) 100.65.98.164 fc:f8:ae:02:69:f3 <yourcomputershostname>

It looks like this just logs that a request was made and acknowledged, but, of course, it's unclear what the ack means. Also ignore the tunneldigger broker selection failure, Chicken doesn't need to tunnel to the exit node, only Cow.

thoughts

It's unclear how long this has been an issue, but it could be related to recent changes to exitnode or makenode, perhaps #23 and its related commits hold some secrets that may help debug this issue.

I suspect that difference between the configuration of newly madenodes and freshly flashed extender nodes (they don't get makenoded) results in conflicts with dns configurations, preventing the extender nodes from correctly routing dns requests.

Any help would be hugely appreciated, the first step is to get the mesh test bed in sudoroom back in working order.

@Juul
Copy link
Member

Juul commented Apr 10, 2018

Ok here's my theory of what's going on.

When a client connects to Chicken it uses DHCP to get an IP/subnet/gateway but also DNS servers to use. The primary (only) DNS server provided by Chicken to the client is probably Chicken's own IP address (same as the gateway IP it handed to the client).

When the client pings 8.8.8.8 everything is routed normally, but when the client resolves a DNS request it will send the request to the DNS server running on Chicken (dnsmasq) and if Chicken's DNS server does not know the answer the Chicken's DNS server will send a request to upstream DNS servers asking for the answer to the DNS request and then reply to the client when it knows the answer.

The problem is that the DNS query from Chicken to the upstream DNS servers is originating from Chicken's IP and any traffic originating from a home node is using the private routing table (rather than the public) so if there is no direct internet connection on the home node's WAN port then this will fail.

Checking if this theory is true

Setting the client to get only IP/subnet/gateway from DHCP, but not DNS, and then manually specifying the DNS server as e.g. 8.8.8.8 on the client should fix the issue.

SSH'ing into Chicken and then running e.g. host omnicommons.org or even ping 8.8.8.8 should fail.

If these tests both succeed then the theory is probably true.

Fixing this issue

The simplest fix is to alter the policy-based routing rules such that traffic originating from the node itself will use the public routing table (tunnel) rather than the private, even if the private is available. The disadvantage of this is that now DNS will stop working on the private network if the public network is down.

Another fix (which I don't recommend) is to have the DHCP server on the home nodes provide e.g. 8.8.8.8 as the DNS server. The side-effect would be that now we can't resolve e.g. my.sudomesh.org or anything else to the local home node address and we can't resolve anything at all without an internet connection. This would prevent people from accessing their home node web admin interface unless they know the IP of their home node.

The best fix in terms of results, but worst in terms of implementation complexity, would be to trigger a script whenever the DHCP client running on the home node gets a DHCP lease and trigger another script whenever the WAN link status changes to down or the lease expires. This script would then ensure that the policy-based routing. It looks like we might be able to use the procd hotplug ifup/ifdown events to trigger these changes properly.

I am willing to work on this. Just let me know when and where (and send me an IRC message, Signal message or private email if I don't respond here).

@paidforby
Copy link
Author

I like the idea of writing a hook script, feels like something that would be useful especially as we work on a zeroconf build of the firmware. I'm down to work on this and I could be around sudoroom most of the day Thursday, 4/12, if you can work on it then, @Juul or anyone else.

@Juul
Copy link
Member

Juul commented Apr 10, 2018 via email

@paidforby
Copy link
Author

2pm on Thursday works for me, we'll plan on that, any other node whisperers are welcome to join us

@bennlich
Copy link
Collaborator

bennlich commented Apr 10, 2018

The problem is that the DNS query from Chicken to the upstream DNS servers is originating from Chicken's IP and any traffic originating from a home node is using the private routing table (rather than the public) so if there is no direct internet connection on the home node's WAN port then this will fail.

I wonder if this is why the exit node mesh IP used to be hardcoded in as a DNS server? That would be similar to the first fix you mentioned, right @Juul? And if the exit node was unreachable, I'm guessing lookups would use DNS servers further down the list?

This commit from #23 removed it: sudomesh/makenode@686499d.

If this is the case, maybe another fix would be to dynamically add any connected exit nodes as DNS servers.

Nice catch @paidforby !

@Juul
Copy link
Member

Juul commented Apr 10, 2018 via email

@jhpoelen
Copy link
Contributor

jhpoelen commented Apr 10, 2018

In efforts to make (home) nodes independent of specific exit node mesh/public IPs (needed to support multiple, non-conflicting, exit nodes), please also consider changes I made via sudomesh/makenode@a3b243e . For examples of dhcp hooks, please see an existing hook at https://github.com/sudomesh/makenode/blob/master/configs/ar71xx/home_nodes/templates/files/etc/udhcpc.user . I'd be interested to learn about the root cause of the issue described here.

btw - nice catch @paidforby

@paidforby
Copy link
Author

Using the "sudomesh test bed", which I am attempting to document here. I've successfully proven that the patch to #23 causes the the bug observed in this issue.

While using a pre-patch 23 version of makenode on both the Cow and Chicken nodes (who knows what commit, but the nodes in the sudomesh test bed had already been flashed/makenoded before that patch was committed to makenode), I was able to access the internet (i.e. ping google.com) from any of the nodes. However, after applying patch 23, as instructed here, to the "remote" Chicken node, I observed the same behavior as described in this issue with home nodes flashed/makenoded in the last two weeks.

I suggest that we find a better way of patch bug #23 this may be as simple as modifying the extender node configs to match the changes made by patch 23 or we may need a more complicated hook script as suggested by @Juul . Whatever the case, @jhpoelen you should be able to use the sudomesh test bed (in sudo room) to test any theories of your own.

@jhpoelen
Copy link
Contributor

jhpoelen commented Apr 13, 2018

Thanks for sharing and reproducing the bug. @paidforby please let me know where I can find access info to test bed nodes. Also, for everyone, I consider fixing this a collective activity, so please don't wait for me, @paidforby or @Juul to fix this issue. If you are interested to troubleshoot and find a solution for this, please do holler and communicate on https://peoplesopen.net/chat . Being part of a community network project is not a spectator sport.

@bennlich
Copy link
Collaborator

@paidforby nice repro! Also, the extended extender node extension program looks great!

Unfortunately, I didn't bring an extender node w/ me to NYC, but if I had, I think I have a pretty clear idea of how I'd debug. Brain dump below.

--

The patch 23 changes are sudomesh/makenode@a3b243e and sudomesh/makenode@686499d.

I suspect the bug was introduced in the single line deletion in the latter commit: sudomesh/makenode@686499d. This would be very easy to test. Simply add the following line to the top of /etc/resolv.conf.dnsmasq in the home node, and restart:

nameserver [exit-node-mesh-ip-goes-here]

(where [exit-node-mesh-ip-goes-here] is 100.64.0.42 if the tunneling node is connected to psychz or 100.64.0.43 if it is connected to hurricane electric).

--

Hopefully re-adding the /etc/resolv.conf.dnsmasq line causes WAN traffic to route once again. If not, the next thing I'd try re-adding are the dns options that were removed from /etc/config/network in sudomesh/makenode@a3b243e.

--

If you are interested to troubleshoot and find a solution for this, please do holler and communicate on https://peoplesopen.net/chat. Being part of a community network project is not a spectator sport.

Darn. I dunno what I'm going to do with all of this coney island kettle corn...

jk <3 <3 <3

If anyone is planning to work on this with @paidforby's test bed in front of them and wants moral support / a pair of remote eyes and ears, I'd love to work/hack with you (from afar).

paidforby pushed a commit to sudomesh/patches that referenced this issue Apr 16, 2018
paidforby pushed a commit to sudomesh/makenode that referenced this issue Apr 16, 2018
@paidforby
Copy link
Author

@jhpoelen @eenblam and @bennlich thanks for the help debugging the sudomesh test bed, it looks like we've got it working in some capacity. Tested the patch on deployed nodes (the real life Cow and Chicken) and it appears to be working well, even with the patch only applied to the Chicken end. I also added the changes to makenode, but have yet to test this commit.

Once tested, I would suggest that we tag this commit of makenode and recommend using it in the readme and on our wiki. I'd still like to find the root cause of this bug and understand why it only breaks when extender nodes are involved. Also this undoes some of the work done on bug #23 so we may need to revisit our solution to the single/double exit node issue.

@gobengo
Copy link

gobengo commented Apr 18, 2018

I just attempted a home node setup using

The resulting homenode has hostname 'spritzer'. It is not connected to the internet directly, but I'd expect it to mesh with my other node 'annie', which has a cable to a router connected to the internet.
When I connect to spritzer's private SSID (or any other?), and enter a URL in my web browser, the web page never loads. (IIRC it hangs on 'connecting').

(Update: This test was not valid. See @eenblam comment below).

Now I will try to rewind makenode to a commit before changes for #23. I'll try https://github.com/sudomesh/makenode/commit/ae172149bdbe822ad85d018d549827125591b12e next.

@eenblam
Copy link
Member

eenblam commented Apr 18, 2018

@gobengo If I understand you correctly, that's not a bug. Private SSID should only give you internet if you have a WAN connection on that node.

@jhpoelen
Copy link
Contributor

please note that v0.2.3 is the latest firmware.

@gobengo
Copy link

gobengo commented Apr 18, 2018

@eenblam Thanks for pointing that out!

I retested with same releases, but connecting to the non-private SSIDs. Things worked!


@jhpoelen I noticed that there is no 0.2.3 on the builds server, so wasn't sure about it's status. Why is there no 0.2.3 on builds? https://builds.sudomesh.org/builds/sudowrt/fledgling/

@jhpoelen
Copy link
Contributor

@gobengo two reasons: 1. build machine is running out of build space and 2. zenodo offers permanent and "free" storage of citable open digital artifacts.

@jhpoelen
Copy link
Contributor

jhpoelen commented Apr 18, 2018

given the semi-permanent nature of the builds.sudomesh.org (its a droplet that runs everything from website to chat) with no backup that I am aware of, we should probably consider to at least archive the binaries for 0.2.2 and 0.2.0 to zenodo. Please holler for questions about zenodo.

@jhpoelen
Copy link
Contributor

jhpoelen commented May 2, 2018

Fix exist in makenode and most nodes have been patched. Closing issue.

@jhpoelen jhpoelen closed this as completed May 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants