Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node establish tunnels, but does not get route to mesh #21

Closed
jhpoelen opened this issue Mar 13, 2018 · 35 comments
Closed

node establish tunnels, but does not get route to mesh #21

jhpoelen opened this issue Mar 13, 2018 · 35 comments

Comments

@jhpoelen
Copy link
Contributor

jhpoelen commented Mar 13, 2018

For some reason, a specific node, successfully creates a tunnel digger interface, but exitnode fails to create a route for the node. This is why the client's ping to the 100.64.0.42 go unanswered, client drops the tunnel and tries to reconnect all over again. As far as I can tell, this only happens with one specific node. mesh IP address of home node can be shared privately if needed.

Please see attached log file.

tcpdumpL2TP-2018-03-13.txt

@jhpoelen
Copy link
Contributor Author

Found while attempting to reproduce #8 .

@paidforby
Copy link

I'm also observing this on a handful of nodes, patched with #23 and #27. Immediately after flashing/makenodeing (or patching or even zeroconf'ing), they will appear to work and provide a route to the mesh and the internet via the exit node (either 100.64.0.42 or 100.64.0.43) but after awhile the routes will disappear and babel will stop babeling, despite the l2tp tunnel still being successfully established. babeld -i provides the very unexciting output,

root@node_paidforby_:~# babeld -i
Listening on interfaces: l2tp0 mesh2 mesh5 eth0.1 eth0.2 

My id 02:90:a9:ff:fe:0e:36:61 seqno 51954
100.65.99.0/26 metric 128 (exported)

Also tailing /var/log/messages repeatedly shows,

Wed Apr 18 09:05:47 2018 user.notice root: no mesh routes available yet via [l2tp0] on try [126]: checking again in [5]s...

but it also looks like the tunnel is getting shutdown and restarted every so often?

Wed Apr 18 09:07:00 2018 daemon.notice netifd: wan (1629): Stopping tunneldigger for interface l2tp0
Wed Apr 18 09:07:00 2018 daemon.warn td-client: Got termination signal, shutting down tunnel...
Wed Apr 18 09:07:00 2018 daemon.notice netifd: wan (1629):   waiting for tunneldigger to stop
Wed Apr 18 09:07:00 2018 user.notice root: Received 'session.down' for interface l2tp0
Wed Apr 18 09:07:01 2018 daemon.notice netifd: wan (1629):   tunneldigger stopped
Wed Apr 18 09:07:01 2018 daemon.notice netifd: wan (1629): Starting tunneldigger on l2tp0
Wed Apr 18 09:07:01 2018 daemon.info td-client: Performing broker selection...
Wed Apr 18 09:07:02 2018 daemon.debug td-client: Broker usage of 64.71.176.94:8942: 255
Wed Apr 18 09:07:02 2018 daemon.info td-client: Selected 64.71.176.94:8942 as the best broker.
Wed Apr 18 09:07:03 2018 daemon.info td-client: Tunnel successfully established.
Wed Apr 18 09:07:03 2018 daemon.info td-client: Requesting the broker to configure downstream bandwidth limit of 99999 kbps.
Wed Apr 18 09:07:03 2018 user.notice root: Received 'session.up' for interface l2tp0
Wed Apr 18 09:07:03 2018 kern.warn kernel: [  794.590000] HTB: quantum of class 10001 is big. Consider r2q change.
Wed Apr 18 09:07:04 2018 user.notice root: New mtu=1488
Wed Apr 18 09:07:11 2018 daemon.notice netifd: wan (1629): brctl: bridge br-open: Device or resource busy
Wed Apr 18 09:07:11 2018 daemon.notice netifd: wan (1629): brctl: bridge br-open: Device or resource busy
Wed Apr 18 09:07:12 2018 daemon.notice netifd: wan (1629): Starting in fungible mode
Wed Apr 18 09:07:13 2018 user.notice root: no mesh routes available yet via [l2tp0] on try [141]: checking again in [5]s...
Wed Apr 18 09:07:18 2018 user.notice root: no mesh routes available yet via [l2tp0] on try [142]: checking again in [5]s...
Wed Apr 18 09:07:23 2018 daemon.info td-client: Setting MTU to 1446
Wed Apr 18 09:07:23 2018 user.notice root: no mesh routes available yet via [l2tp0] on try [143]: checking again in [5]s...
Wed Apr 18 09:07:24 2018 daemon.info td-client: Setting MTU to 1446
Wed Apr 18 09:07:28 2018 user.notice root: no mesh routes available yet via [l2tp0] on try [144]: checking again in [5]s...

This looks suspiciously like the output produced by /etc/init.d/meshrouting I'm going to attempt re-adding the single line deletion suggested by the patch to #23 and see if the that changes anything.

@paidforby
Copy link

Or maybe it's udhcp.user, either way I'll revert one then the other and find the cause, or maybe the exit node is misbehaving?

@jhpoelen
Copy link
Contributor Author

@paidforby suggest to troubleshoot the exit node first to get a complete picture.

@paidforby
Copy link

I do not have access to exit nodes, my node is now working after forcing it to use psychz exit node, not sure if there is a problem with HE. But I also, reverted patch 23 so it's hard to tell what fixed what. will do more debugging and report back.

@jhpoelen
Copy link
Contributor Author

@paidforby perhaps time to get access to the exit nodes. I can confirm it is an issue on the HE exit node.

@paidforby
Copy link

Cool, yes, I can ping 100.64.0.43 again. Any ideas of the cause, is it related to #24 ? You can PM me to give me access so I can work on that bug, thanks.

@jhpoelen
Copy link
Contributor Author

jhpoelen commented Apr 18, 2018

It appears that a "dirty" pid file /var/run/babeld.pid caused issue. After cleaning up the pid file, the babeld started ok.

$ /usr/local/bin/babeld -F -S /var/lib/babeld/state -c /etc/babeld.conf 
Starting in fungible mode
creat(/var/run/babeld.pid): File exists
$ ps -ef | grep babeld
xxx 19757 19436  0 10:41 pts/0    00:00:00 grep babeld
$ sudo rm /var/run/babeld.pid
$ sudo service babeld status
● babeld.service - babeld
   Loaded: loaded (/etc/systemd/system/babeld.service; disabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Tue 2018-04-17 18:42:30 PDT; 15h ago
  Process: 10452 ExecStart=/usr/local/bin/babeld -F -S /var/lib/babeld/state -c /etc/babeld.conf (code=exited, status=1/FAILURE)
 Main PID: 10452 (code=exited, status=1/FAILURE)
$ sudo service babeld start
$ sudo service babeld status
● babeld.service - babeld
   Loaded: loaded (/etc/systemd/system/babeld.service; disabled; vendor preset: enabled)
   Active: active (running) since Wed 2018-04-18 10:41:38 PDT; 2s ago
 Main PID: 19826 (babeld)
    Tasks: 1 (limit: 4915)
   CGroup: /system.slice/babeld.service
           └─19826 /usr/local/bin/babeld -F -S /var/lib/babeld/state -c /etc/babeld.conf

Apr 18 10:41:38 exit0 systemd[1]: Started babeld.

@bennlich
Copy link
Collaborator

@jhpoelen was babeld not running on the HE exit node?

@jhpoelen
Copy link
Contributor Author

yep, as shown in

$ ps -ef | grep babeld
xxx 19757 19436  0 10:41 pts/0    00:00:00 grep babeld
$ 

@bennlich
Copy link
Collaborator

Saw that. Was confused by the order. Wondering why you tried to start it manually with

/usr/local/bin/babeld -F -S /var/lib/babeld/state -c /etc/babeld.conf 

instead of with service start as you did later. I guess babeld was not running and was failing to start w/ service command--is that right?

@jhpoelen
Copy link
Contributor Author

yep. tried starting manually to see what was going on and eliminate systemd from the equation.

@bennlich
Copy link
Collaborator

Got it. Any sign of why it died?

@jhpoelen
Copy link
Contributor Author

please switch to rc

@bennlich
Copy link
Collaborator

Here's the log excerpt when babeld died on HE:

Apr 17 18:30:18 exit0 systemd[1]: Started babeld-monitor.
Apr 17 18:30:18 exit0 babeld-monitor.sh[10257]: found no [Cannot allocate memory] error entry or babeld is not running
Apr 17 18:39:08 exit0 babeld[17629]: Warning: cannot save old configuration for l2tp745-745.
Apr 17 18:40:18 exit0 systemd[1]: Started babeld-monitor.
Apr 17 18:40:18 exit0 babeld-monitor.sh[10333]: found no [Cannot allocate memory] error entry or babeld is not running
Apr 17 18:40:26 exit0 babeld[17629]: removing: l2tp745-745
Apr 17 18:40:34 exit0 babeld[17629]: Warning: cannot save old configuration for l2tp746-746.
Apr 17 18:42:17 exit0 babeld[17629]: removing: l2tp746-746
Apr 17 18:42:28 exit0 systemd[1]: babeld.service: Main process exited, code=killed, status=11/SEGV
Apr 17 18:42:28 exit0 systemd[1]: babeld.service: Unit entered failed state.
Apr 17 18:42:28 exit0 systemd[1]: babeld.service: Failed with result 'signal'.
Apr 17 18:42:28 exit0 kernel: [2693589.502208] traps: babeld[17629] general protection ip:556562bea3d0 sp:7ffec3777290 error:0
Apr 17 18:42:28 exit0 kernel: [2693589.502215]  in babeld[556562be0000+16000]
Apr 17 18:42:28 exit0 systemd[1]: babeld.service: Service hold-off time over, scheduling restart.
Apr 17 18:42:28 exit0 systemd[1]: Stopped babeld.
Apr 17 18:42:28 exit0 systemd[1]: Started babeld.
Apr 17 18:42:28 exit0 babeld[10443]: creat(/var/run/babeld.pid): File exists
Apr 17 18:42:28 exit0 babeld[10443]: Starting in fungible mode
Apr 17 18:42:28 exit0 systemd[1]: babeld.service: Main process exited, code=exited, status=1/FAILURE
Apr 17 18:42:28 exit0 systemd[1]: babeld.service: Unit entered failed state.
Apr 17 18:42:28 exit0 systemd[1]: babeld.service: Failed with result 'exit-code'.
Apr 17 18:42:29 exit0 systemd[1]: babeld.service: Service hold-off time over, scheduling restart.
Apr 17 18:42:29 exit0 systemd[1]: Stopped babeld.

So I guess it was a kernel panic?

@jhpoelen
Copy link
Contributor Author

Interesting . . . seems consistent with the idea that the babeld shutdown process crashed before removing the pid. Agree?

@bennlich
Copy link
Collaborator

bennlich commented Apr 19, 2018

Yeah. I guess babled is responsible for writing/removing its own pid file. Might be nice if systemd were in charge of that. Shucks.

As a team, I'd say we're not bad at taking over issue threads with solid work aimed at solving totally unrelated issues :-P

jhpoelen pushed a commit to sudomesh/exitnode that referenced this issue Apr 19, 2018
@jhpoelen
Copy link
Contributor Author

add a force clean pid in babeld-monitor and patched exit nodes. Please re-open this issue if you feel that the issue has not been addressed.

@bennlich
Copy link
Collaborator

Hmm. I thought you originally opened this issue about a sad node that was never able to create a route to its exit node despite succeeding to dig a tunnel. I think the bug we've been discussing since (#21 (comment)) is not related?

@jhpoelen
Copy link
Contributor Author

Agree. The symptoms were the same, both root causes seem to have been fixed. The first was fixed by removing the need for a "ping" in the reconnect procedure. The second one, I just (force) fixed by ensuring that the pid file gets removed.

@jhpoelen
Copy link
Contributor Author

@bennlich btw - the symptom seems consistent with the "rogue" node behavior - continuously reconnecting every 5 minutes after establishing a tunnel.

@bennlich
Copy link
Collaborator

The first was fixed by removing the need for a "ping" in the reconnect procedure.

Okay so fixed by sudomesh/makenode@a3b243e.

jhpoelen pushed a commit to sudomesh/exitnode that referenced this issue Apr 19, 2018
Revert "force clean babeld pid. related to sudomesh/bugs#21 (comment)"

This reverts commit 99cddc2.
@jhpoelen jhpoelen reopened this Apr 19, 2018
@jhpoelen
Copy link
Contributor Author

jhpoelen commented Apr 19, 2018

@bennlich suggested to fix second root cause by:


Am thinking the fix might be to stop letting babeld manage its own pid file. From https://github.com/jech/babeld/blob/master/babeld.c#L396-L422 and the babeld manual, it looks like we might be able to do this by passing -I '' as an argument when starting babeld (i.e. the name of your pid file is null).


hoping that someone other than myself can apply this magic to relevant systemd service files and patch exit nodes.

@paidforby
Copy link

Per @bennlich suggestion, I added the -I switch to the systemd service. This means babeld no longer creates its own pid, but I'm not sure this is the best solution. This may solve the dirty pid issue, but now we have a process with no stored identifier. Looking into systemd, it seems that it always up to the service to create/manage its own pid file, only when the service fails to do this (as we were seeing here) should systemd take over and clean up the dirty pid. Perhaps we just need to tell systemd where to find babeld's pid file, with something like, PIDFile=/var/run/babeld.pid under the [Service] in babeld.service. No idea if this addition will address the bug on HE, not sure how to reproduce on my own exit node. @jhpoelen original solution still seems like the most surefire way of addressing the issue.

paidforby pushed a commit to sudomesh/exitnode that referenced this issue Apr 19, 2018
paidforby pushed a commit to sudomesh/exitnode that referenced this issue Apr 19, 2018
@bennlich
Copy link
Collaborator

Perhaps we just need to tell systemd where to find babeld's pid file, with something like, PIDFile=/var/run/babeld.pid under the [Service] in babeld.service.

Yes! This sounds better.

@paidforby I think you should be able to repro the busted behavior by kill -9ing babeld on your exit node. Babeld should die before it has time to clean up the PID file, and systemd should fail to restart it (this should be visible in the logs).

This seems to support the idea to add the PIDfile option to the service definition: https://unix.stackexchange.com/questions/256125/whats-the-best-way-to-remove-pid-files-before-starting-a-service

@paidforby
Copy link

paidforby commented Apr 19, 2018

I made the change I suggested, but further research suggests that this option is only useful when applied to a forking service? Is there any reason babeld.service is defined as Type=simple as opposed to Type=forking? Maybe this is the root cause? Thanks @bennlich I'll try reproducing.

@paidforby
Copy link

latest commit sudomesh/exitnode@920847d appears to solve the issue assuming that kill -9 is an adequate repro. babeld does not appear to like being started as a forking service. It gets stuck in an activating state. Will leave as a simple type and just add the PIDFIle line. Patching both exit nodes now.

@bennlich
Copy link
Collaborator

bennlich commented Apr 19, 2018

babeld is not type=forking because it does not use fork to spawn the main babeld process (at least, I see no mention of fork here: https://github.com/jech/babeld/blob/master/babeld.c).

I think what I read online suggests that the PIDfile option is needed when using systemd to start a forking process, otherwise systemd will think the main process has exited when really only the bootstrapping startup process has exited (after using fork to start the real process). Does this make sense?

In our case, there's no forking going on, but babeld was written to explicitly check for an existing PIDfile to avoid running multiple babeld daemons at once. It sounds like including the PIDfile option in the service definition might achieve our goal, which is to get systemd to clean up the PID file when it notices the babeld process has terminated.

@bennlich
Copy link
Collaborator

latest commit sudomesh/exitnode@920847d appears to solve the issue assuming that kill -9 is an adequate repro.

@paidforby did you test kill -9 before applying the patch to see if it was an adequate test? I.e. to see if babeld did fail to restart afterward?

@paidforby
Copy link

@bennlich makes perfect sense, thanks for the explanation!
And yes, I tested it before and after the patch. Before, babeld fails to restart until you delete the PIDfile. After the patch, babeld restarts all by itself, no problem. I may destroy my exit node and spin up a new one with the latest commit to make sure I didn't make any silly mistakes. But otherwise it seems to work fine.

@bennlich
Copy link
Collaborator

bennlich commented Apr 19, 2018

babeld does not appear to like being started as a forking service. It gets stuck in an activating state.

Yes, this makes sense. With type=forking, systemd waits for the process to exit before consdiering the service "started". It assumes that the command you told it to execute is for a process that spins off a separate process (via fork) and then exits.

To be clear, type=forking does not change how systemd executes the process. It just changes the way systemd treats/observes the process once executed.

@jhpoelen
Copy link
Contributor Author

Glad to hear that systemd is doing the cleanup by including the pid reference in .service file. Also, am pretty excited about the relay style handling of the issue. Re-closing issue.

@paidforby
Copy link

HE and psychz successfully patched and tested with kill -9, closing issue, reopen if you feel the new solution is not adequate.
Side note: I had to manually start babeld after rebooting the exit nodes, is this normal?

@jhpoelen
Copy link
Contributor Author

@paidforby babeld should start automatically. Did you follow the specific reboot instructions on the "old" exit node?

@paidforby
Copy link

paidforby commented Apr 19, 2018

@jhpoelen yes, but it happened on both the "old" and the "new" exit node, not sure if it's related to this patch, here's what I saw after rebooting both of them,

paidforby@exit:~$ sudo service babeld status
[sudo] password for paidforby: 
● babeld.service - babeld
   Loaded: loaded (/etc/systemd/system/babeld.service; disabled)
   Active: inactive (dead)
paidforby@exit:~$ sudo service babeld start
paidforby@exit:~$ sudo service babeld status
● babeld.service - babeld
   Loaded: loaded (/etc/systemd/system/babeld.service; disabled)
   Active: active (running) since Thu 2018-04-19 14:52:20 EDT; 1s ago
 Main PID: 969 (babeld)
   CGroup: /system.slice/babeld.service
           └─969 /usr/local/bin/babeld -F -S /var/lib/babeld/state -c /etc/babeld.conf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants