-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
node establish tunnels, but does not get route to mesh #21
Comments
Found while attempting to reproduce #8 . |
I'm also observing this on a handful of nodes, patched with #23 and #27. Immediately after flashing/makenodeing (or patching or even zeroconf'ing), they will appear to work and provide a route to the mesh and the internet via the exit node (either 100.64.0.42 or 100.64.0.43) but after awhile the routes will disappear and babel will stop babeling, despite the l2tp tunnel still being successfully established. babeld -i provides the very unexciting output,
Also tailing
but it also looks like the tunnel is getting shutdown and restarted every so often?
This looks suspiciously like the output produced by |
Or maybe it's |
@paidforby suggest to troubleshoot the exit node first to get a complete picture. |
I do not have access to exit nodes, my node is now working after forcing it to use psychz exit node, not sure if there is a problem with HE. But I also, reverted patch 23 so it's hard to tell what fixed what. will do more debugging and report back. |
@paidforby perhaps time to get access to the exit nodes. I can confirm it is an issue on the HE exit node. |
Cool, yes, I can ping |
It appears that a "dirty" pid file
|
@jhpoelen was babeld not running on the HE exit node? |
yep, as shown in
|
Saw that. Was confused by the order. Wondering why you tried to start it manually with
instead of with service start as you did later. I guess babeld was not running and was failing to start w/ service command--is that right? |
yep. tried starting manually to see what was going on and eliminate systemd from the equation. |
Got it. Any sign of why it died? |
please switch to rc |
Here's the log excerpt when babeld died on HE:
So I guess it was a kernel panic? |
Interesting . . . seems consistent with the idea that the babeld shutdown process crashed before removing the pid. Agree? |
Yeah. I guess babled is responsible for writing/removing its own pid file. Might be nice if systemd were in charge of that. Shucks. As a team, I'd say we're not bad at taking over issue threads with solid work aimed at solving totally unrelated issues :-P |
add a force clean pid in babeld-monitor and patched exit nodes. Please re-open this issue if you feel that the issue has not been addressed. |
Hmm. I thought you originally opened this issue about a sad node that was never able to create a route to its exit node despite succeeding to dig a tunnel. I think the bug we've been discussing since (#21 (comment)) is not related? |
Agree. The symptoms were the same, both root causes seem to have been fixed. The first was fixed by removing the need for a "ping" in the reconnect procedure. The second one, I just (force) fixed by ensuring that the pid file gets removed. |
@bennlich btw - the symptom seems consistent with the "rogue" node behavior - continuously reconnecting every 5 minutes after establishing a tunnel. |
Okay so fixed by sudomesh/makenode@a3b243e. |
Revert "force clean babeld pid. related to sudomesh/bugs#21 (comment)" This reverts commit 99cddc2.
@bennlich suggested to fix second root cause by: Am thinking the fix might be to stop letting babeld manage its own pid file. From https://github.com/jech/babeld/blob/master/babeld.c#L396-L422 and the babeld manual, it looks like we might be able to do this by passing -I '' as an argument when starting babeld (i.e. the name of your pid file is null). hoping that someone other than myself can apply this magic to relevant systemd service files and patch exit nodes. |
Per @bennlich suggestion, I added the |
…/bugs#21 (comment)" This reverts commit b61cad0.
Yes! This sounds better. @paidforby I think you should be able to repro the busted behavior by This seems to support the idea to add the PIDfile option to the service definition: https://unix.stackexchange.com/questions/256125/whats-the-best-way-to-remove-pid-files-before-starting-a-service |
I made the change I suggested, but further research suggests that this option is only useful when applied to a forking service? Is there any reason |
latest commit sudomesh/exitnode@920847d appears to solve the issue assuming that |
babeld is not I think what I read online suggests that the PIDfile option is needed when using systemd to start a forking process, otherwise systemd will think the main process has exited when really only the bootstrapping startup process has exited (after using In our case, there's no forking going on, but babeld was written to explicitly check for an existing PIDfile to avoid running multiple babeld daemons at once. It sounds like including the PIDfile option in the service definition might achieve our goal, which is to get systemd to clean up the PID file when it notices the babeld process has terminated. |
@paidforby did you test |
@bennlich makes perfect sense, thanks for the explanation! |
Yes, this makes sense. With To be clear, |
Glad to hear that systemd is doing the cleanup by including the pid reference in .service file. Also, am pretty excited about the relay style handling of the issue. Re-closing issue. |
HE and psychz successfully patched and tested with |
@paidforby babeld should start automatically. Did you follow the specific reboot instructions on the "old" exit node? |
@jhpoelen yes, but it happened on both the "old" and the "new" exit node, not sure if it's related to this patch, here's what I saw after rebooting both of them,
|
For some reason, a specific node, successfully creates a tunnel digger interface, but exitnode fails to create a route for the node. This is why the client's ping to the 100.64.0.42 go unanswered, client drops the tunnel and tries to reconnect all over again. As far as I can tell, this only happens with one specific node. mesh IP address of home node can be shared privately if needed.
Please see attached log file.
tcpdumpL2TP-2018-03-13.txt
The text was updated successfully, but these errors were encountered: