Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nebula not connecting to lighthouse on linux boot #372

Closed
brucealthompson opened this issue Feb 3, 2021 · 10 comments
Closed

Nebula not connecting to lighthouse on linux boot #372

brucealthompson opened this issue Feb 3, 2021 · 10 comments

Comments

@brucealthompson
Copy link

I am running a reasonably large Nebula network using Debian clients and lighthouse. I have noticed that my Nebula clients are not able to connect to the lighthouse when Nebula is started after initial boot. However, if I restart Nebula after the initial failure, the client connects to the lighthouse with no errors or issues. Here is the error I get from Nebula when I start it after initial boot:
level=error msg="Lighthouse unreachable" error="Lighthouse 10.217.0.1 does not have a static_host_map entry.

I have attached my config.yml which does include a static_host_map entry for the lighthouse.
config.yml.txt

Nebula Version: 1.3.0

@nbrownus
Copy link
Collaborator

nbrownus commented Feb 3, 2021

Sounds like you may be starting nebula before the network is up.

Which init system are you using?

@brucealthompson
Copy link
Author

Which init system are you using?

systemd

I have attached my systemd service file.
nebula.service.txt

@dcwynar
Copy link

dcwynar commented Feb 7, 2021

I had the exact same problem.
Solved it by changing service Unit to:

[Unit]
Description=nebula
After=network-online.target
Wants=network-online.target

If it doesn't help, try adding ExecStartPre, like that:

[Unit]
Description=nebula
Wants=basic.target
After=basic.target network.target
[Service]
SyslogIdentifier=nebula
StandardOutput=syslog
StandardError=syslog
ExecReload=/bin/kill -HUP 
ExecStartPre=/bin/sh -c 'until ping -c1 [your-nebula-lighthouse-public-host-or-ip]; do sleep 1; done;'
ExecStart=/usr/bin/nebula -config /etc/nebula/config.yml
Restart=always
[Install]
WantedBy=multi-user.target

@jimpea21
Copy link

I have had this as well and was able to repro easily on 2 Linux computers (Ubuntu 16.04 server, Deepin 15.11) plus a Windows computer (Win10 Pro 20H2).

Tested with Nebula 1.3.0.

Reproduction on Linux as follows:

  1. Setup Nebula and specify the lighthouse in config.yml by FQDN
  2. Install using the systemd script
  3. Restart the computer - Nebula will fail to start with the lighthouse unreachable error on the service log

Windows encounters the same issue when installed as a service, when using an FQDN for the lighthouse, on system startup.

It is also possible to replicate this by starting Nebula manually with the lighthouse as an FQDN when the network is disconnected (assuming no locally installed DNS, tested on both Linux hosts and the Windows host).

My guess for the startup issue is that the DNS resolution is failing due to the early startup of the service on both OS. For some reason, Nebula never re-checks to see if the lighthouse is alive, it just gets marked as bad (at least, I waited 10 minutes and no retry was attempted).

My solution is to code the lighthouse IP directly in the config, and this works fine in both test cases (startup and no network). I'd rather be able to use an FQDN though, as it means no config updates if we have to change a lighthouse IP for some reason.

@dcwynar I like your exec prestart idea. I'm not sure how we would achieve the same thing in Windows though. I guess I could write a wrapper script, but that seems like more trouble than it's worth.

@nbrownus Perhaps if DNS resolution of a lighthouse fails on startup, it could be reattempted (for example) on a binary exponential backoff schedule starting with 5 seconds? The current method of marking it as dead on the first attempt seems overly aggressive.

@wildardoc
Copy link

wildardoc commented Feb 10, 2021 via email

@jamescorbett
Copy link

jamescorbett commented Apr 14, 2021

I also have this issue. Retrying in nebula makes the most sense to me.

@nbrownus
Copy link
Collaborator

image

Jokes aside, we should support re-querying these names since we support DNS names. We will give this some brain time after we cut the v1.4.0 release.

@wildardoc
Copy link

wildardoc commented Apr 15, 2021 via email

@johnmaguire
Copy link
Collaborator

@brucealthompson @dcwynar @jimpea21 @wildardoc

If any of you are still experiencing this issue could you try adding a Wants=nss-lookup.target line to your nebula systemd unit file? Thanks!

For Windows, there's a similar solution mentioned here: #176 (comment)

@johnmaguire
Copy link
Collaborator

Hi all - I'm closing this issue out as stale. We believe that #791 should solve the race by ensuring that the DNS server is available before Nebula boots.

Additionally, #796 is released and working in v1.7.1 and should re-query for DNS even if the initial query for DNS fails. By default, we will re-query on a 30s cadence, but this can be configured via static_map.cadence.

Please let us know if you continue to experience issues!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants