node establish tunnels, but does not get route to mesh #21

jhpoelen · 2018-03-13T22:36:33Z

For some reason, a specific node, successfully creates a tunnel digger interface, but exitnode fails to create a route for the node. This is why the client's ping to the 100.64.0.42 go unanswered, client drops the tunnel and tries to reconnect all over again. As far as I can tell, this only happens with one specific node. mesh IP address of home node can be shared privately if needed.

Please see attached log file.

tcpdumpL2TP-2018-03-13.txt

jhpoelen · 2018-03-13T22:37:56Z

Found while attempting to reproduce #8 .

paidforby · 2018-04-18T16:15:50Z

I'm also observing this on a handful of nodes, patched with #23 and #27. Immediately after flashing/makenodeing (or patching or even zeroconf'ing), they will appear to work and provide a route to the mesh and the internet via the exit node (either 100.64.0.42 or 100.64.0.43) but after awhile the routes will disappear and babel will stop babeling, despite the l2tp tunnel still being successfully established. babeld -i provides the very unexciting output,

root@node_paidforby_:~# babeld -i
Listening on interfaces: l2tp0 mesh2 mesh5 eth0.1 eth0.2 

My id 02:90:a9:ff:fe:0e:36:61 seqno 51954
100.65.99.0/26 metric 128 (exported)

Also tailing /var/log/messages repeatedly shows,

Wed Apr 18 09:05:47 2018 user.notice root: no mesh routes available yet via [l2tp0] on try [126]: checking again in [5]s...

but it also looks like the tunnel is getting shutdown and restarted every so often?

Wed Apr 18 09:07:00 2018 daemon.notice netifd: wan (1629): Stopping tunneldigger for interface l2tp0
Wed Apr 18 09:07:00 2018 daemon.warn td-client: Got termination signal, shutting down tunnel...
Wed Apr 18 09:07:00 2018 daemon.notice netifd: wan (1629):   waiting for tunneldigger to stop
Wed Apr 18 09:07:00 2018 user.notice root: Received 'session.down' for interface l2tp0
Wed Apr 18 09:07:01 2018 daemon.notice netifd: wan (1629):   tunneldigger stopped
Wed Apr 18 09:07:01 2018 daemon.notice netifd: wan (1629): Starting tunneldigger on l2tp0
Wed Apr 18 09:07:01 2018 daemon.info td-client: Performing broker selection...
Wed Apr 18 09:07:02 2018 daemon.debug td-client: Broker usage of 64.71.176.94:8942: 255
Wed Apr 18 09:07:02 2018 daemon.info td-client: Selected 64.71.176.94:8942 as the best broker.
Wed Apr 18 09:07:03 2018 daemon.info td-client: Tunnel successfully established.
Wed Apr 18 09:07:03 2018 daemon.info td-client: Requesting the broker to configure downstream bandwidth limit of 99999 kbps.
Wed Apr 18 09:07:03 2018 user.notice root: Received 'session.up' for interface l2tp0
Wed Apr 18 09:07:03 2018 kern.warn kernel: [  794.590000] HTB: quantum of class 10001 is big. Consider r2q change.
Wed Apr 18 09:07:04 2018 user.notice root: New mtu=1488
Wed Apr 18 09:07:11 2018 daemon.notice netifd: wan (1629): brctl: bridge br-open: Device or resource busy
Wed Apr 18 09:07:11 2018 daemon.notice netifd: wan (1629): brctl: bridge br-open: Device or resource busy
Wed Apr 18 09:07:12 2018 daemon.notice netifd: wan (1629): Starting in fungible mode
Wed Apr 18 09:07:13 2018 user.notice root: no mesh routes available yet via [l2tp0] on try [141]: checking again in [5]s...
Wed Apr 18 09:07:18 2018 user.notice root: no mesh routes available yet via [l2tp0] on try [142]: checking again in [5]s...
Wed Apr 18 09:07:23 2018 daemon.info td-client: Setting MTU to 1446
Wed Apr 18 09:07:23 2018 user.notice root: no mesh routes available yet via [l2tp0] on try [143]: checking again in [5]s...
Wed Apr 18 09:07:24 2018 daemon.info td-client: Setting MTU to 1446
Wed Apr 18 09:07:28 2018 user.notice root: no mesh routes available yet via [l2tp0] on try [144]: checking again in [5]s...

This looks suspiciously like the output produced by /etc/init.d/meshrouting I'm going to attempt re-adding the single line deletion suggested by the patch to #23 and see if the that changes anything.

paidforby · 2018-04-18T16:21:45Z

Or maybe it's udhcp.user, either way I'll revert one then the other and find the cause, or maybe the exit node is misbehaving?

jhpoelen · 2018-04-18T17:32:54Z

@paidforby suggest to troubleshoot the exit node first to get a complete picture.

paidforby · 2018-04-18T17:39:16Z

I do not have access to exit nodes, my node is now working after forcing it to use psychz exit node, not sure if there is a problem with HE. But I also, reverted patch 23 so it's hard to tell what fixed what. will do more debugging and report back.

jhpoelen · 2018-04-18T17:40:29Z

@paidforby perhaps time to get access to the exit nodes. I can confirm it is an issue on the HE exit node.

paidforby · 2018-04-18T17:44:29Z

Cool, yes, I can ping 100.64.0.43 again. Any ideas of the cause, is it related to #24 ? You can PM me to give me access so I can work on that bug, thanks.

jhpoelen · 2018-04-18T17:44:38Z

It appears that a "dirty" pid file /var/run/babeld.pid caused issue. After cleaning up the pid file, the babeld started ok.

$ /usr/local/bin/babeld -F -S /var/lib/babeld/state -c /etc/babeld.conf 
Starting in fungible mode
creat(/var/run/babeld.pid): File exists
$ ps -ef | grep babeld
xxx 19757 19436  0 10:41 pts/0    00:00:00 grep babeld
$ sudo rm /var/run/babeld.pid
$ sudo service babeld status
● babeld.service - babeld
   Loaded: loaded (/etc/systemd/system/babeld.service; disabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Tue 2018-04-17 18:42:30 PDT; 15h ago
  Process: 10452 ExecStart=/usr/local/bin/babeld -F -S /var/lib/babeld/state -c /etc/babeld.conf (code=exited, status=1/FAILURE)
 Main PID: 10452 (code=exited, status=1/FAILURE)
$ sudo service babeld start
$ sudo service babeld status
● babeld.service - babeld
   Loaded: loaded (/etc/systemd/system/babeld.service; disabled; vendor preset: enabled)
   Active: active (running) since Wed 2018-04-18 10:41:38 PDT; 2s ago
 Main PID: 19826 (babeld)
    Tasks: 1 (limit: 4915)
   CGroup: /system.slice/babeld.service
           └─19826 /usr/local/bin/babeld -F -S /var/lib/babeld/state -c /etc/babeld.conf

Apr 18 10:41:38 exit0 systemd[1]: Started babeld.

bennlich · 2018-04-18T22:40:49Z

@jhpoelen was babeld not running on the HE exit node?

jhpoelen · 2018-04-18T22:43:35Z

yep, as shown in

$ ps -ef | grep babeld
xxx 19757 19436  0 10:41 pts/0    00:00:00 grep babeld
$

bennlich · 2018-04-18T22:55:02Z

Saw that. Was confused by the order. Wondering why you tried to start it manually with

/usr/local/bin/babeld -F -S /var/lib/babeld/state -c /etc/babeld.conf

instead of with service start as you did later. I guess babeld was not running and was failing to start w/ service command--is that right?

jhpoelen · 2018-04-18T22:57:52Z

yep. tried starting manually to see what was going on and eliminate systemd from the equation.

bennlich · 2018-04-18T23:02:36Z

Got it. Any sign of why it died?

jhpoelen · 2018-04-18T23:18:36Z

please switch to rc

bennlich · 2018-04-19T04:54:44Z

Here's the log excerpt when babeld died on HE:

Apr 17 18:30:18 exit0 systemd[1]: Started babeld-monitor.
Apr 17 18:30:18 exit0 babeld-monitor.sh[10257]: found no [Cannot allocate memory] error entry or babeld is not running
Apr 17 18:39:08 exit0 babeld[17629]: Warning: cannot save old configuration for l2tp745-745.
Apr 17 18:40:18 exit0 systemd[1]: Started babeld-monitor.
Apr 17 18:40:18 exit0 babeld-monitor.sh[10333]: found no [Cannot allocate memory] error entry or babeld is not running
Apr 17 18:40:26 exit0 babeld[17629]: removing: l2tp745-745
Apr 17 18:40:34 exit0 babeld[17629]: Warning: cannot save old configuration for l2tp746-746.
Apr 17 18:42:17 exit0 babeld[17629]: removing: l2tp746-746
Apr 17 18:42:28 exit0 systemd[1]: babeld.service: Main process exited, code=killed, status=11/SEGV
Apr 17 18:42:28 exit0 systemd[1]: babeld.service: Unit entered failed state.
Apr 17 18:42:28 exit0 systemd[1]: babeld.service: Failed with result 'signal'.
Apr 17 18:42:28 exit0 kernel: [2693589.502208] traps: babeld[17629] general protection ip:556562bea3d0 sp:7ffec3777290 error:0
Apr 17 18:42:28 exit0 kernel: [2693589.502215]  in babeld[556562be0000+16000]
Apr 17 18:42:28 exit0 systemd[1]: babeld.service: Service hold-off time over, scheduling restart.
Apr 17 18:42:28 exit0 systemd[1]: Stopped babeld.
Apr 17 18:42:28 exit0 systemd[1]: Started babeld.
Apr 17 18:42:28 exit0 babeld[10443]: creat(/var/run/babeld.pid): File exists
Apr 17 18:42:28 exit0 babeld[10443]: Starting in fungible mode
Apr 17 18:42:28 exit0 systemd[1]: babeld.service: Main process exited, code=exited, status=1/FAILURE
Apr 17 18:42:28 exit0 systemd[1]: babeld.service: Unit entered failed state.
Apr 17 18:42:28 exit0 systemd[1]: babeld.service: Failed with result 'exit-code'.
Apr 17 18:42:29 exit0 systemd[1]: babeld.service: Service hold-off time over, scheduling restart.
Apr 17 18:42:29 exit0 systemd[1]: Stopped babeld.

So I guess it was a kernel panic?

jhpoelen · 2018-04-19T04:56:46Z

Interesting . . . seems consistent with the idea that the babeld shutdown process crashed before removing the pid. Agree?

bennlich · 2018-04-19T05:02:41Z

Yeah. I guess babled is responsible for writing/removing its own pid file. Might be nice if systemd were in charge of that. Shucks.

As a team, I'd say we're not bad at taking over issue threads with solid work aimed at solving totally unrelated issues :-P

jhpoelen · 2018-04-19T05:20:37Z

add a force clean pid in babeld-monitor and patched exit nodes. Please re-open this issue if you feel that the issue has not been addressed.

bennlich · 2018-04-19T05:33:23Z

Hmm. I thought you originally opened this issue about a sad node that was never able to create a route to its exit node despite succeeding to dig a tunnel. I think the bug we've been discussing since (#21 (comment)) is not related?

jhpoelen · 2018-04-19T05:40:16Z

Agree. The symptoms were the same, both root causes seem to have been fixed. The first was fixed by removing the need for a "ping" in the reconnect procedure. The second one, I just (force) fixed by ensuring that the pid file gets removed.

jhpoelen · 2018-04-19T05:43:01Z

@bennlich btw - the symptom seems consistent with the "rogue" node behavior - continuously reconnecting every 5 minutes after establishing a tunnel.

bennlich · 2018-04-19T05:45:14Z

The first was fixed by removing the need for a "ping" in the reconnect procedure.

Okay so fixed by sudomesh/makenode@a3b243e.

Revert "force clean babeld pid. related to sudomesh/bugs#21 (comment)" This reverts commit 99cddc2.

jhpoelen · 2018-04-19T06:08:30Z

@bennlich suggested to fix second root cause by:

Am thinking the fix might be to stop letting babeld manage its own pid file. From https://github.com/jech/babeld/blob/master/babeld.c#L396-L422 and the babeld manual, it looks like we might be able to do this by passing -I '' as an argument when starting babeld (i.e. the name of your pid file is null).

hoping that someone other than myself can apply this magic to relevant systemd service files and patch exit nodes.

… (comment)

paidforby · 2018-04-19T17:09:48Z

Per @bennlich suggestion, I added the -I switch to the systemd service. This means babeld no longer creates its own pid, but I'm not sure this is the best solution. This may solve the dirty pid issue, but now we have a process with no stored identifier. Looking into systemd, it seems that it always up to the service to create/manage its own pid file, only when the service fails to do this (as we were seeing here) should systemd take over and clean up the dirty pid. Perhaps we just need to tell systemd where to find babeld's pid file, with something like, PIDFile=/var/run/babeld.pid under the [Service] in babeld.service. No idea if this addition will address the bug on HE, not sure how to reproduce on my own exit node. @jhpoelen original solution still seems like the most surefire way of addressing the issue.

…/bugs#21 (comment)" This reverts commit b61cad0.

bennlich · 2018-04-19T17:26:32Z

Perhaps we just need to tell systemd where to find babeld's pid file, with something like, PIDFile=/var/run/babeld.pid under the [Service] in babeld.service.

Yes! This sounds better.

@paidforby I think you should be able to repro the busted behavior by kill -9ing babeld on your exit node. Babeld should die before it has time to clean up the PID file, and systemd should fail to restart it (this should be visible in the logs).

This seems to support the idea to add the PIDfile option to the service definition: https://unix.stackexchange.com/questions/256125/whats-the-best-way-to-remove-pid-files-before-starting-a-service

paidforby · 2018-04-19T17:28:02Z

I made the change I suggested, but further research suggests that this option is only useful when applied to a forking service? Is there any reason babeld.service is defined as Type=simple as opposed to Type=forking? Maybe this is the root cause? Thanks @bennlich I'll try reproducing.

paidforby · 2018-04-19T17:45:27Z

latest commit sudomesh/exitnode@920847d appears to solve the issue assuming that kill -9 is an adequate repro. babeld does not appear to like being started as a forking service. It gets stuck in an activating state. Will leave as a simple type and just add the PIDFIle line. Patching both exit nodes now.

bennlich · 2018-04-19T17:46:08Z

babeld is not type=forking because it does not use fork to spawn the main babeld process (at least, I see no mention of fork here: https://github.com/jech/babeld/blob/master/babeld.c).

I think what I read online suggests that the PIDfile option is needed when using systemd to start a forking process, otherwise systemd will think the main process has exited when really only the bootstrapping startup process has exited (after using fork to start the real process). Does this make sense?

In our case, there's no forking going on, but babeld was written to explicitly check for an existing PIDfile to avoid running multiple babeld daemons at once. It sounds like including the PIDfile option in the service definition might achieve our goal, which is to get systemd to clean up the PID file when it notices the babeld process has terminated.

bennlich · 2018-04-19T17:48:47Z

latest commit sudomesh/exitnode@920847d appears to solve the issue assuming that kill -9 is an adequate repro.

@paidforby did you test kill -9 before applying the patch to see if it was an adequate test? I.e. to see if babeld did fail to restart afterward?

paidforby · 2018-04-19T17:52:38Z

@bennlich makes perfect sense, thanks for the explanation!
And yes, I tested it before and after the patch. Before, babeld fails to restart until you delete the PIDfile. After the patch, babeld restarts all by itself, no problem. I may destroy my exit node and spin up a new one with the latest commit to make sure I didn't make any silly mistakes. But otherwise it seems to work fine.

bennlich · 2018-04-19T17:53:10Z

babeld does not appear to like being started as a forking service. It gets stuck in an activating state.

Yes, this makes sense. With type=forking, systemd waits for the process to exit before consdiering the service "started". It assumes that the command you told it to execute is for a process that spins off a separate process (via fork) and then exits.

To be clear, type=forking does not change how systemd executes the process. It just changes the way systemd treats/observes the process once executed.

jhpoelen · 2018-04-19T18:52:52Z

Glad to hear that systemd is doing the cleanup by including the pid reference in .service file. Also, am pretty excited about the relay style handling of the issue. Re-closing issue.

paidforby · 2018-04-19T18:56:49Z

HE and psychz successfully patched and tested with kill -9, closing issue, reopen if you feel the new solution is not adequate.
Side note: I had to manually start babeld after rebooting the exit nodes, is this normal?

jhpoelen · 2018-04-19T19:04:48Z

@paidforby babeld should start automatically. Did you follow the specific reboot instructions on the "old" exit node?

paidforby · 2018-04-19T19:08:17Z

@jhpoelen yes, but it happened on both the "old" and the "new" exit node, not sure if it's related to this patch, here's what I saw after rebooting both of them,

paidforby@exit:~$ sudo service babeld status
[sudo] password for paidforby: 
● babeld.service - babeld
   Loaded: loaded (/etc/systemd/system/babeld.service; disabled)
   Active: inactive (dead)
paidforby@exit:~$ sudo service babeld start
paidforby@exit:~$ sudo service babeld status
● babeld.service - babeld
   Loaded: loaded (/etc/systemd/system/babeld.service; disabled)
   Active: active (running) since Thu 2018-04-19 14:52:20 EDT; 1s ago
 Main PID: 969 (babeld)
   CGroup: /system.slice/babeld.service
           └─969 /usr/local/bin/babeld -F -S /var/lib/babeld/state -c /etc/babeld.conf

jhpoelen pushed a commit to sudomesh/exitnode that referenced this issue Apr 19, 2018

force clean babeld pid. related to sudomesh/bugs#21 (comment)

99cddc2

jhpoelen closed this as completed Apr 19, 2018

jhpoelen pushed a commit to sudomesh/exitnode that referenced this issue Apr 19, 2018

thanks @bennlich

dad8b65

Revert "force clean babeld pid. related to sudomesh/bugs#21 (comment)" This reverts commit 99cddc2.

jhpoelen reopened this Apr 19, 2018

paidforby pushed a commit to sudomesh/exitnode that referenced this issue Apr 19, 2018

babeld no longer creates its own pid as suggested in sudomesh/bugs#21…

b61cad0

… (comment)

paidforby pushed a commit to sudomesh/exitnode that referenced this issue Apr 19, 2018

Revert "babeld no longer creates its own pid as suggested in sudomesh…

b35bff7

…/bugs#21 (comment)" This reverts commit b61cad0.

paidforby pushed a commit to sudomesh/exitnode that referenced this issue Apr 19, 2018

specify pid file is babeld.service as suggested in sudomesh/bugs#21

920847d

jhpoelen closed this as completed Apr 19, 2018

paidforby mentioned this issue Apr 25, 2018

babeld segmentation fault on exit node causes lost routes #31

Closed

node establish tunnels, but does not get route to mesh #21

node establish tunnels, but does not get route to mesh #21

Comments

jhpoelen commented Mar 13, 2018 • edited Loading

jhpoelen commented Mar 13, 2018

paidforby commented Apr 18, 2018

paidforby commented Apr 18, 2018

jhpoelen commented Apr 18, 2018

paidforby commented Apr 18, 2018

jhpoelen commented Apr 18, 2018

paidforby commented Apr 18, 2018

jhpoelen commented Apr 18, 2018 • edited Loading

bennlich commented Apr 18, 2018

jhpoelen commented Apr 18, 2018

bennlich commented Apr 18, 2018

jhpoelen commented Apr 18, 2018

bennlich commented Apr 18, 2018

jhpoelen commented Apr 18, 2018

bennlich commented Apr 19, 2018

jhpoelen commented Apr 19, 2018

bennlich commented Apr 19, 2018 • edited Loading

jhpoelen commented Apr 19, 2018

bennlich commented Apr 19, 2018

jhpoelen commented Apr 19, 2018

jhpoelen commented Apr 19, 2018

bennlich commented Apr 19, 2018

jhpoelen commented Apr 19, 2018 • edited Loading

paidforby commented Apr 19, 2018

bennlich commented Apr 19, 2018

paidforby commented Apr 19, 2018 • edited Loading

paidforby commented Apr 19, 2018

bennlich commented Apr 19, 2018 • edited Loading

bennlich commented Apr 19, 2018

paidforby commented Apr 19, 2018

bennlich commented Apr 19, 2018 • edited Loading

jhpoelen commented Apr 19, 2018

paidforby commented Apr 19, 2018

jhpoelen commented Apr 19, 2018

paidforby commented Apr 19, 2018 • edited Loading

jhpoelen commented Mar 13, 2018 •

edited

Loading

jhpoelen commented Apr 18, 2018 •

edited

Loading

bennlich commented Apr 19, 2018 •

edited

Loading

jhpoelen commented Apr 19, 2018 •

edited

Loading

paidforby commented Apr 19, 2018 •

edited

Loading

bennlich commented Apr 19, 2018 •

edited

Loading

bennlich commented Apr 19, 2018 •

edited

Loading

paidforby commented Apr 19, 2018 •

edited

Loading