Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Packet flooding and high CPU usage #779

Open
darkain opened this issue Jun 5, 2018 · 93 comments
Open

Packet flooding and high CPU usage #779

darkain opened this issue Jun 5, 2018 · 93 comments
Labels
BSD BSD-related issue Central & Network Management ZeroTier Central & networking management Status: Backlog Older issues that are awaiting resolution Windows Windows-related issue

Comments

@darkain
Copy link
Contributor

darkain commented Jun 5, 2018

I'm trying to create a basic ZT path between two buildings. Each building has a OPNsense 18.1.9 edge router with the ZT 1.2.8 plugin installed.

Building A: LAN 192.168.2.0/24 - ZT 192.168.5.2
Building B: LAN 192.168.3.0/24 - ZT 192.168.5.3
ZT: 192.168.5.0/24
3 ZT managed routes: one for the ZT network, and one for each of the building LANs with the their respective ZT IP listed as their respective gateways.

The two OPNsense nodes are the only nodes in the ZT network. Both have bridging enabled, and auto-assign IP disabled. Flow rules in my.zt are all default. Network is idle other than a Windows box on one building's LAN pinging a Windows box on the other building's LAN (less than 2KiB/sec)

ZT is generating a MASSIVE amount of packets that is spiking the CPU to 100% regularly, yet the packets never go anywhere, and they're not generated from any of the nodes on either network. When this CPU spike happens, all connectivity over ZT is entirely dropped.

Reference: https://drive.google.com/file/d/1NIkdnilV0HSXuytMPn3zHzragyAcEa33/view?usp=sharing

You can see in the screen shot from the OPNsense interface stats that ZT has generated over 600GiB of content total, yet WAN has only transfered around 35GiB and LAN only 21GiB. These stats are for around a 24 hour period.

Nothing is matching the ZT network at all in pfTop or Firewall log, so at this point I'm not sure where next to investigate this particular issue?

@adamierymenko
Copy link
Contributor

That is indeed extremely strange.

Can you try building one side with ZT_TRACE enabled? make ZT_TRACE=1

@adamierymenko
Copy link
Contributor

Also what are these packets? What happens if you tcpdump the zt interface?

@darkain
Copy link
Contributor Author

darkain commented Jun 5, 2018

Well crap, found the issue actually. It is a dup of #759

So basically, zt-one needs to be aware of managed routes and not attempt to connect to them at all. That would solve this issue. ZT kept bouncing between the remote router's WAN address and private LAN address (which would no longer be accessible once it doesn't know the route because it literally just broke it )

@darkain darkain closed this as completed Jun 5, 2018
@adamierymenko
Copy link
Contributor

Yes, that would be it. I've heard this phenomenon called a "software laser." :)

@darkain
Copy link
Contributor Author

darkain commented Dec 21, 2018

Re-opening, because it is still bugged.

More specific details in my particular case.

I have two OPNsense nodes, both with ZeroTier. They each have static routes pointing to each other's LANs so two different buildings can fully cross-communicate. Static routes have been tried manually, as "managed routes" in the my.zerotier interface, and through OSPF (none of these three make a difference, bug exists regardless of how they're set)

ZeroTier attempts all available IP addresses to find ZeroTier peers. The problem is that this ALSO includes the ZeroTier private IP addresses and LAN addresses as well. ZeroTier is attempting to communicate with the remote ZeroTier instance over ZeroTier itself because it sees access to the remote node's LAN address through the static routes. As soon as this connection is made, the WAN IP address connection is disabled. Because of this, the remote LAN address is no longer available, and the ZeroTeir connection is broken. At this stage, the static route is also unreachable, so ZeroTier reverts back to the proper WAN address, and re-establishes the connection. This flapping back and forth between WAN and LAN addresses is creating an entirely unstable connection while also packet flooding. In the past 12 hours, this has consumed 1TB of bandwidth just attempting to re-establish connections. If I was not already on an unmetered internet connection, this could be literally costing me hundreds of dollars a day in bandwidth.

Example:

ZT-A > WAN > ZT-B (working)
ZT-A > ZT > ZT-B > LAN (seeing the LAN address as available)
ZT link switches from WAN address to LAN address
ZT link breaks
ZT re-establishes on WAN address

This process repeats over and over again generating a massive amount of packets flooding the system and chewing away at CPU cycles in the process as well.

Up until yesterday, "drop dport 9993;" worked by setting it in the my.zerotier interface. This prevented the ZT communication packets from transferring over the ZT interface, stabilizing the connection. No idea what changed, but this no longer functions. Prior to this, I was using a local.conf file on every single node specifying which addresses it was not allowed to connect to, but this defeats half the point of ZeroTier being a centralized management interface. This also becomes a huge pain as new routers/buildings are onboarded, every single other router in the network needs to have its configuration updated to be made aware to not allow LAN addresses from the new router. We switched from IPsec to ZeroTier+OSPF specifically for centralized and automated configuration, just to be put back where we were in the first place.

Config for individual node (note: each time a new building is added, it must be added to ALL other routers)
{"physical": {"192.168.1.0/24":{"blacklist":true}}}

@darkain darkain reopened this Dec 21, 2018
@laduke
Copy link
Contributor

laduke commented Dec 21, 2018

#if defined(__linux__) || defined(linux) || defined(__LINUX__) || defined(__linux)

Do we need one of these sections for BSDs?

@darkain
Copy link
Contributor Author

darkain commented Dec 21, 2018

The issue at hand is not about binding to a particular interface. In this case, it is binding to WAN and LAN interfaces. The issues is as soon as LAN subnets are bridged between two different locations (via Managed Routes) or otherwise, the two ZT nodes will then attempt to communicate between each other via the LAN instead of WAN addresses. The LAN addresses should still be bound for local nodes.

Instead, I think ZT traffic should be flagged and filtered out from being allowed to be passed over a ZT tunnel. Is there ever a case when a ZT network should be encapsulated inside of another ZT network?

@glimberg
Copy link
Contributor

glimberg commented Dec 22, 2018

Is there ever a case when a ZT network should be encapsulated inside of another ZT network?

Yes there is. For instance, Google Kubernetes Engine only has link local ipv6 addresses on it's kuberneres nodes, so we use a ZeroTier network to pipe in a routable /64 to kuberneres. This is controller traffic, but it's still ZeroTier packets encapsulated in a ZeroTier network

@chacal
Copy link

chacal commented Dec 29, 2018

I seem to have this same problem. Instead of having ZT traffic going over itself using IPv4 address of the remote LAN my problem seems to caused by IPv6 address propagated over the ZT link to the remote site.

After setting up ZT between two LANs everything usually works fine for some time, but eventually it ends up to the same state described here earlier: address of the peer fluctuates between proper public IPv4 address and private IPv6 address that was propagated to other side of the ZT link via IPv6 RA. When peer listing shows this private IPv6 address as peer's active address, CPU usage hits 100%, the connection brakes and huge volume of traffic is generated. Strangely the generated traffic has same IPv6 address both as source and destination (the address of remote peer).

Problem with my situation is that even blacklisting the IPv6 network in local.conf doesn't solve this situation, however. :/

Any ideas that might help here?

@chacal
Copy link

chacal commented Jan 4, 2019

It seems that I was able to fix my problem by adding the bridge interface on the remote LAN end to interfacePrefixBlacklist on local.conf. Now the IPv6 address still propagates there properly, but it doesn't seem to be used for ZT traffic anymore and thus sending ZT traffic "over itself" is avoided.

@darkain
Copy link
Contributor Author

darkain commented Jan 29, 2019

I've switched up to trying to same for now to see how it goes. I have the following local.conf that I'm starting to test as of today:

{ "settings": { "interfacePrefixBlacklist": ["zte"], "allowTcpFallbackRelay": false } }

Right now I'm trying to create a standardized configuration for easier deployment in multiple data centers. I plan on doing a full write up of basically an autonomous multi-network routing system using ZeroTier, essentially a private virtualized internet on top of the internet itself. Hopefully with this simple config, I can now have ZT entirely stable and focus on the other services on top of it!

@cferrey
Copy link

cferrey commented Feb 5, 2019

I'm also having this issue over a bridged setup, and adding "drop dport 9993;" to my flow rules also helped for a few days but no longer works. I'm planning to try the above blacklisting method. Can anyone advise as to where my local.conf file would live on Raspbian/Debian, or where I should create it? I'm pretty new to Linux, and Googling interestingly hasn't helped answer this seemingly straightforward question. Thanks!

@chacal
Copy link

chacal commented Feb 5, 2019

Here's some information about the local.conf file: https://github.com/zerotier/ZeroTierOne/tree/master/service

On Debian it should be placed to /var/lib/zerotier/local.conf (assuming you have installed ZeroTier from prebuilt .deb package).

@cferrey
Copy link

cferrey commented Feb 5, 2019

Here's some information about the local.conf file: https://github.com/zerotier/ZeroTierOne/tree/master/service

On Debian it should be placed to /var/lib/zerotier/local.conf (assuming you have installed ZeroTier from prebuilt .deb package).

Thank you very much -- I did not have that file, but created it with sudo nano and added this single line:

{ "settings": { "interfacePrefixBlacklist": ["br0"], "allowTcpFallbackRelay": false } }

Unfortunately, this didn't work. I also tried adding my ZT interface instead of the br0 interface, but no luck. Do you have any thoughts on what I'm doing wrong? My br0 interface bridges the ZT and eth0 interfaces, and br0 receives a static IP while eth0 has no IP assigned.

Edit: I also added a physical route blacklist for the common subnet being used on my ZT network and at both remote LANs in my L2 bridged setup. This also did not work. My full local.conf file is below -- hoping someone can point out any issues.

{ "physical": { "10.0.0.0/16": { "blacklist": true } }, "settings": { "interfacePrefixBlacklist": [“br0"], "allowTcpFallbackRelay": false } }

@chacal
Copy link

chacal commented Feb 5, 2019

Don't know about your specific setup, but for me blacklisting using IP address helped. My local.conf (with IP address obfuscated):

{
  "physical": {
    "2001:2003:xxxx:xxxx::/56": {
      "blacklist": true
    }
  }
}

The mentioned IPv6 network is the one that is propagated to the remote site using IPv6 router advertisements.

@darkain
Copy link
Contributor Author

darkain commented Feb 5, 2019

As an update, the interface blacklist didn't work. Also, I now know why the flow rules for 9993 don't work, but that'll be a separate issue.

@cferrey
Copy link

cferrey commented Feb 6, 2019

Don't know about your specific setup, but for me blacklisting using IP address helped. My local.conf (with IP address obfuscated):

{
  "physical": {
    "2001:2003:xxxx:xxxx::/56": {
      "blacklist": true
    }
  }
}

The mentioned IPv6 network is the one that is propagated to the remote site using IPv6 router advertisements.

Unfortunately I've hit a dead end here. I have no IPv6 addresses in my setup, as I am assigning IPv4 addresses to the bridging devices manually through ZT Central. I don't see any IPv6 addresses when I do listpeers on the bridging devices, so I'm at a loss as to what else to try.

I'll keep an eye on #915. Hope that can be resolved and that it'll fix all these issues, as setting a single flow rule seems much more scalable than editing configs on all ZT clients.

@StrikerTwo
Copy link

This (and #759) is still broken, if anybody is interested :(
I still have ZT nodes with their INTERNAL ip address in my peer list.

4ccda4xxxx 1.4.6  LEAF      94 DIRECT 0        16406    10.4.0.2/9993

This IP can only be reached over the ZT tunnel itself. Zerotier tries to do just that, using one CPU core to 100% and sending millions of packets that never go anywhere, until something resets and it goes back to normal. The annoying thing is that this causes connections to all other nodes to drop or at least go bad, because the CPU usage causes a general latency spike (up to 1500 ms, then pings time out).

Why does Zerotier not blacklist all ZT interfaces and all internal routes internally as default? Is there any use case for allowing ZT connections over a ZT tunnel?

@StrikerTwo
Copy link

And this is how a packet spike looks like:

Ethernet Type IP Protocol Source Address Destination Address Source Port Destination Port Service Name Status Packets Count Total Packets Size Total Data Size Data Speed Maximum Data Speed Average Packet Size Maximum Packet Size First Packet Time Last Packet Time Duration Latency Process ID Process Filename TCP Ack TCP Push TCP Reset TCP Syn TCP Fin Maximum Segment Size TCP Window Size TCP Window Scale TTL Source Country Destination Country
IPv4 UDP 10.0.0.3 10.4.0.2 28053 9993 2.480.905 1.525.624.328 1.456.158.988 10723.3 KiB/Sec 614.9 1460 09.01.2020 11:45:23 09.01.2020 11:50:21 00:04:57.213 1992 zerotier-one_x64.exe 0 0 0 0 0 127

@laduke
Copy link
Contributor

laduke commented Jan 9, 2020

@StrikerTwo are you on a BSD?

@StrikerTwo
Copy link

Nope, Windows Server on both sides (2012 R2 / 2016)

@laduke
Copy link
Contributor

laduke commented Jan 10, 2020

Heh looks like it doesn't avoid binding 'zt' interfaces on windows either, but I dunno

@rexxfan
Copy link

rexxfan commented Jan 22, 2020

FWIW this issue is affecting me as well. I have a very simple zt network, defined with all defaults. Nothing was customized. I have one Windows 10 PC on the LAN running zt, and one PC with same on another LAN. I use zt for remote access using RDP. Every few days or so the whole LAN grinds to a halt for about a minute and then mysteriously clears up. I traced one such incident using wireshark and there's millions of packets flowing over the LAN heading towards zt nodes. I uninstalled zt and the problem went away. This is a shame. It is such a great product, but this is a fatal flaw.

@janjaapbos
Copy link
Contributor

Are you sure this is not valid traffic generated by the windows 10 pc's? E.g. windows update traffic between them? See https://www.digitalcitizen.life/how-set-windows-10-get-updates-local-network-internet

@darkain
Copy link
Contributor Author

darkain commented Aug 14, 2021

It's not exactly what I usually like to do, but is there anything on the horizon for this ?
(assuming one wants to have access between the routers, so some solutions do not seem exactly ideal)

The "work around" is just a few comments above in this thread: #779 (comment)

@danmanners
Copy link

In my particular case, it is caused by ZeroTier trying to route through flannel over ZeroTier. This is not a ZeroTier bug. My solution was "interfacePrefixBlacklist": [ "flannel", "cni" ].

I'm actually trying to route flannel over ZeroTier for remote k3s hosts, so blacklisting the Flannel/CNI doesn't work for me.

@glimberg
Copy link
Contributor

glimberg commented Aug 16, 2021

I'm actually trying to route flannel over ZeroTier for remote k3s hosts, so blacklisting the Flannel/CNI doesn't work for me.

Blacklisting the flannel/CNI interface is what you want, then. This prevents ZeroTier from using the flannel interface to transport packets. It does not prevent you from routing flannel packets over ZeroTier.

@danmanners
Copy link

Holy hell, that totally fixed it for me in both Azure and Google Cloud. Having a hell of a time getting things working in an automated way now, and it totally doesn't work on a reboot by default, but things are operational and stable after manually making all of these changes.

/var/lib/zerotier-one/local.conf:

{
  "settings": {
    "interfacePrefixBlacklist": [ "flannel", "cni" ]
  }
}

/etc/systemd/network/01-zerotier.network:

[Match]
Name=zt... # Actual Network Interface Name here

[Link]
Unmanaged=yes # If this isn't here, Zerotier will never pull an IP Address

[Network]
DHCP=yes
UseDNS=true
DNS=10.45.0.1 # Remote DNS server

@laduke
Copy link
Contributor

laduke commented Nov 3, 2021

Can anyone give me some tips on reproducing this easily? On digital ocean or vultr or something simple like that. I have a couple opnsense vms running on my workstation and nothing is happening yet.

@darkain
Copy link
Contributor Author

darkain commented Nov 3, 2021

Can anyone give me some tips on reproducing this easily? On digital ocean or vultr or something simple like that. I have a couple opnsense vms running on my workstation and nothing is happening yet.

There needs to be an available IP route inside of the tunnel that ZT is listening on, where ZT flaps between the normal route, and its own network, which then breaks its own network, and reverts back to the other network. Its a route flapping issue. ZT alone on OPNsense wont cause it. BUT, if you have to OPNsense (or any routers) with ZT on it, and have routable private networks between those two routers, ZT will attempt to use that private network instead of public internet, and thus breaks itself.

@laduke
Copy link
Contributor

laduke commented Nov 4, 2021

Right. thanks!

So... This doesn't happen on linux.
I set up 2 freebsd and 2 linux VPSs with private networks behind them and added managed routes for all of them.

Visual aid: Screen Shot 2021-11-04 at 9 59 23 AM

As soon as the bsds start interacting, this happens:

{
  "address": "a80de2684f",
  "isBonded": false,
  "latency": 10,
  "paths": [
   {
    "active": true,
    "address": "207.246.96.60/9993",
    "expired": false,
    "lastReceive": 1636029873423,
    "lastSend": 1636029872458,
    "preferred": false,
    "trustedPathId": 0
   },
   {
    "active": true,
    "address": "10.5.96.3/9993",
    "expired": false,
    "lastReceive": 1636029873423,
    "lastSend": 1636029880943,
    "preferred": true,
    "trustedPathId": 0
   }
  ],
  "role": "LEAF",
  "version": "1.6.6",
  "versionMajor": 1,
  "versionMinor": 6,
  "versionRev": 6
 }

and here's a80de2684f's ifconfig

vtnet0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=6c07bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
        ether 56:00:03:a8:7a:80
        inet 207.246.96.60 netmask 0xfffffe00 broadcast 207.246.97.255
        media: Ethernet 10Gbase-T <full-duplex>
        status: active
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
vtnet1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=6800bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
        ether 5a:00:03:a8:7a:80
        inet 10.5.96.3 netmask 0xff000000 broadcast 255.255.240.0
        media: Ethernet 10Gbase-T <full-duplex>
        status: active
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
        options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
        inet6 ::1 prefixlen 128
        inet6 fe80::1%lo0 prefixlen 64 scopeid 0x3
        inet 127.0.0.1 netmask 0xff000000
        groups: lo
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
zt1b694qoob1o2i: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 5000 mtu 2800
        options=80000<LINKSTATE>
        ether 32:48:bd:d2:be:6b
        hwaddr 58:9c:fc:10:ff:c9
        inet6 fe80::5a9c:fcff:fe10:ffc9%zt1b694qoob1o2i prefixlen 64 scopeid 0x4
        inet 10.241.201.101 netmask 0xffff0000 broadcast 10.241.255.255
        groups: tap
        media: Ethernet autoselect
        status: active
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        Opened by PID 4060
#

10.5.96.3 is a physical address, but it's only reachable over zerotier.

This hasn't happened on the debian VMs yet after like 12 hours.

Any Unix folks want to take a look too? Maybe around here

In the meantime, I'm going to go try to get a note added to the opnsense zerotier docs about the local.conf blacklist trick.

@joseph-henry
Copy link
Contributor

I haven't fully read up on this read but based on a cursory glance I'd take a look at this function:

bool shouldBindInterface(const char* ifname, const InetAddress& ifaddr)

And this block:

// Make sure we're not trying to do ZeroTier-over-ZeroTier

@joseph-henry
Copy link
Contributor

Spent a little time catching up on this and something came to mind: vanilla ZT will select a path based on its scope. This scope combined with an address family check will determine its preference rank.

See:

inline unsigned int preferenceRank() const

Scopes are defined here:

InetAddress::IpScope InetAddress::ipScope() const

As a result ZT will prefer to switch from a working global path to a working private path. If this private path is part of a managed route and we didn't check for overlap:

OneService.cpp:

		/* Note: I do not think we need to scan for overlap with managed routes
		 * because of the "route forking" and interface binding that we do. This
		 * ensures (we hope) that ZeroTier traffic will still take the physical
		 * path even if its managed routes override this for other traffic. Will
		 * revisit if we see recursion problems. */

We could find ourselves in a non-working and eventually flappalicious state.

@darkain
Copy link
Contributor Author

darkain commented Nov 12, 2021

In addition, it may NOT be a "managed" path as well, but similar ZT path. For instance, RIP, OSPF, BGP, or similar routing protocol on top of ZeroTier to manage more complex routes automatically. In my particular case, I'm using OSPF with redundant routers for each route. So we'd need some way to explicitly tell ZT to not listen on specific routes, which currently for me is to blacklist the entire public IP space in use for all routes. But this also means that redundant routers cannot create direct peer-to-peer links on the same network, so I have a separate network just to handle those links. It gets complicated quite quickly!

laduke added a commit that referenced this issue Dec 17, 2021
Adds some temporary debug output
And tries to reject zerotier over zerotier paths
via nodePathCheckFunction

I've never seen:
fprintf(stderr, "    HERE2: local: %s remote %s \n", buf1, buf2);
get printed. This check has been here since forever.

on freebsd, sometimes you'll see:
"a zt managed target [10.12.0.0] contains this remote path [10.147.17.2], so"
but mostly
"a zt managed target [10.11.0.0/24] contains this remote path [10.11.0.1], so"

On mac and linux, I've only seen
"a zt managed target [10.11.0.0/24] contains this remote path [10.11.0.1], so"

This change is probably incorrect and in the wrong level of the system,
but it's:
- stopping the problem on freebsd
- i haven't found it breaking anything yet

What is the problem?
see:
#779
start at the bottom

----

network is like this:

10.147.17.0/24 (LAN)
10.11.0.0/24 via 10.147.17.1 bsd
10.12.0.0/24 via 10.147.17.2 bsd
10.13.0.0/24 via 10.147.17.3 linux
192.168.192.0/24 via 10.147.17.192 mac

there's nothing on the subnets except dummy interfaces/addresses on the node itself

ip -o -4 a
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
2: ens18    inet 192.168.82.144/24 brd 192.168.82.255 scope global dynamic ens18\       valid_lft 20176sec preferred_lft 20176sec
3: ens19    inet 10.13.0.1/24 brd 10.13.0.255 scope global ens19\       valid_lft forever preferred_lft forever
8: zt5u4uptmb    inet 10.147.17.3/24 brd 10.147.17.255 scope global zt5u4uptmb\       valid_lft forever preferred_lft forever

506  ifconfig feth16 create
507  ifconfig feth16 192.168.192.1 netmask 255.255.255.0 up
@laduke
Copy link
Contributor

laduke commented Jan 4, 2022

Was looking at this again.

#ifdef __LINUX__
// Bind Linux sockets to their device so routes that we manage do not override physical routes (wish all platforms had this!)
if (ii->second.length() > 0) {
char tmp[256];
Utils::scopy(tmp, sizeof(tmp), ii->second.c_str());
int fd = (int)Phy<PHY_HANDLER_TYPE>::getDescriptor(udps);
if (fd >= 0)
setsockopt(fd, SOL_SOCKET, SO_BINDTODEVICE, tmp, strlen(tmp));
fd = (int)Phy<PHY_HANDLER_TYPE>::getDescriptor(tcps);
if (fd >= 0)
setsockopt(fd, SOL_SOCKET, SO_BINDTODEVICE, tmp, strlen(tmp));
}
#endif // __LINUX__

I tested commenting out this linux only code and it reproduced the packet flooding and high cpu usage.

There's no SO_BINDTODEVICE api in freebsd though.

Alternatively, it seems like one could do the equivalent of route get 10.11.0.1 and if it's on a zerotier interface, don't use it.

In addition, it may NOT be a "managed" path ... So we'd need some way to explicitly tell ZT to not listen on specific routes

I can't currently imagine a way to get this automatically. But it would be nice to prevent this problem for the basic common case.

@michmoor0725
Copy link

So what exactly was done on the opnsense side? create a rule that blocks what exactly?

@darkain
Copy link
Contributor Author

darkain commented Jan 13, 2022

So what exactly was done on the opnsense side? create a rule that blocks what exactly?

The bug isn't in OPNsense. But you can create a ZeroTier local rule to handle this.

This is what I came up with:

The work around for this issue is to block ZeroTier from routing ZeroTier packets over itself. I do this by blocking ZeroTier from listening on OPNsense LAN subnets.

In my particular case, I have 192.168.1.0/24 on one OPNsense router, and 192.168.2.0/24 on another router. I setup a simple range that covers both, plus more (for future expansion). I personally opted for blocking the entirety of 192.168.0.0/16

This is a standard configuration that I deploy on every OPNsense node in my router mesh.

{
	"physical": {
		"192.168.0.0/16": { "blacklist": true }
	}
}

@mgiammarco
Copy link

mgiammarco commented May 2, 2022

I have the same problem: opnsense+zerotier+ospf (frr)
Using

{ "settings": { "interfacePrefixBlacklist": [“br0"], "allowTcpFallbackRelay": false } }
Works only for few hours.
If I use (because I route many prefix of 10.0.0.0):

{ "physical": { "10.0.0.0/8": { "blacklist": true } }, "settings": { "interfacePrefixBlacklist": [“br0"], "allowTcpFallbackRelay": false } }
The entire zerotier stops working (and this is very strange to me).

@mgiammarco
Copy link

I am still debugging. I have seen that if I disable ospf and I build static routes (obviously I would like to avoid this workaround) among my three OPNSense the problem disappears.
I repeat that in my case the interfacePrefixBlackList does not work.

@laduke
Copy link
Contributor

laduke commented May 9, 2022

Can you run zerotier-cli info -j on there to check the local.conf config is loading? (The above json is invalid from “br0" but that's probably just from pasting into here.)

If it's the latest version of zerotier, it'll also show what addresses it's listeningOn

@mgiammarco
Copy link

{
 "address": "6c5e84b564",
 "clock": 1652340455022,
 "config": {
  "settings": {
   "allowTcpFallbackRelay": false,
   "interfacePrefixBlacklist": [
    "zt"
   ],
   "listeningOn": [
    "10.1.3.1/9993",
    "10.129.0.3/9993",
    "10.1.2.184/9993",
    "10.1.3.1/29994",
    "10.129.0.3/29994",
    "10.1.2.184/29994",
    "10.1.3.1/33408",
    "10.129.0.3/33408",
    "10.1.2.184/33408"
   ],
   "portMappingEnabled": true,
   "primaryPort": 9993,
   "secondaryPort": 0,
   "softwareUpdate": "disable",
   "softwareUpdateChannel": "release",
   "tertiaryPort": 0
  }
 },
 "online": true,
 "planetWorldId": 149604618,
 "planetWorldTimestamp": 1644592324813,
 "publicIdentity": "6c5e84b564:0:09ad0117bde4933eeabc5554daefea5813d827cb6bc98136ff672dc97478543572bfae51af6ae6afb2285722296d6f056e34d99a3f1a651a7b5e5fe3e303b673",
 "tcpFallbackActive": false,
 "version": "1.8.6",
 "versionBuild": 0,
 "versionMajor": 1,
 "versionMinor": 8,
 "versionRev": 6
}

@mgiammarco
Copy link

10.1.2.184 is wan 10.129.0.3 is lan 10.1.3.1 is lan2

@darkain
Copy link
Contributor Author

darkain commented May 13, 2022

"The entire zerotier stops working (and this is very strange to me)."

Right, if your WAN is within 10.0.0.0/8 and you blacklist 10.0.0.0/8, you're blacklisting your WAN address. This is a side effect of having a non-public WAN IP address. You'd had to do more fine-grained blocking of just the LAN addresses at that point.

@mgiammarco
Copy link

"The entire zerotier stops working (and this is very strange to me)."

Right, if your WAN is within 10.0.0.0/8 and you blacklist 10.0.0.0/8, you're blacklisting your WAN address. This is a side effect of having a non-public WAN IP address. You'd had to do more fine-grained blocking of just the LAN addresses at that point.

Sorry I do not want to mislead you: I have a test framework with several OPNSense firewalls. Some are on real hardware, some other on virtual machines. It is unlucky that I have shown the only example with wan on 10.129.0.0. But also other firewalls with wan on 192.168.x.x or public ip stop communication.

Anyway the real problem is that interfaceprefixblacklist and tcpfallback are not enough.

@mgiammarco
Copy link

I confirm that:

  1. { "settings": { "interfacePrefixBlacklist": [“zte"], "allowTcpFallbackRelay": false } } is correctly recognized at zerotier startup
  2. the settings above on my OPNSense setup do nothing: zerotier uses interfaces with zte prefix as gateways
    Tried with 1.8.9 too.
    What can I do to solve this problem that is a showstopper for me?

@vadonka
Copy link

vadonka commented Aug 14, 2022

I have many Opnsense (freebsd based) firewalls where i using zerotier.
This is my experience:
zerotier 1.8.6 or below versions works in every scenario.
zerotier 1.8.9 or 1.10.1 break everything no matter if im on freebsd 13.0 or 13.1.
All these firewalls are vmware VM using with vmxnet adapter.
When i say break it literally break the whole network! Its generate enormous amount of bogus traffic and eventually the state table and the mbuf usage exhausted. When the bogus traffic is generated every member of the network got those packets, around 100Mbit/sec flow, which is zeroed the whole network legit communication! Even more strange, this only happens after the 4th freebsd node is upgraded to 1.8.9 or 1.10.1. Also only happens when the freebsd nodes connected to the same network.

I use this on all node:
{
"physical": {
"10.0.0.0/8": {
"blacklist": true
},
"172.16.0.0/12": {
"blacklist": true
},
"192.168.0.0/16": {
"blacklist": true
}
}
}

This should prevent this behavior, which is working as intended in the version 1.8.6 or below, but broken in every newer versions!
So if you use multiple node multiple cross routing scenarios stay away from any newer version!
I did report this to zerotier because we are a paid user, but they didnt know the reason yet.
It seems its still broken.

@vadonka
Copy link

vadonka commented Aug 14, 2022

You can downgrade zerotier on opnsense to 1.8.6 even in the new 22.7.x version like so:
curl https://pkg.opnsense.org/FreeBSD:13:amd64/22.1/MINT/22.1.6/OpenSSL/Latest/zerotier.txz -o /tmp/zerotier.txz
pkg add -f /tmp/zerotier.txz
pkg lock zerotier

The last command needs because that is locked the package and prevent from upgrading.
I dont see any other solutions for now.
I still waiting for the official zerotier support answer.

@laduke
Copy link
Contributor

laduke commented Aug 18, 2022

sorry about that vadonka. still not sure what is causing that.


has anyone tried using multiple routing tables (fibs)? I just came across this old issue from a different search
#580 (comment)

Seems like you could start zerotier in a fib with only the needed routes set up and it won't see any other routes (that it creates).

@vadonka
Copy link

vadonka commented Aug 18, 2022

I dont use any centralized routes. Even the address is given by hand to an interface. I even tried the bind feature so zerotier only listen on a specific IP. We have firewallt with multiple wan IP (virtual IP), so i thought this was the case, but no. If im using 1.8.6 or below no issues, once i upgrade to 1.8.9 or above strange things starts to happen. I have no clue whats going on. What did changed between 1.8.6 and 1.8.9? Any zerotier devs could look into it? Something must have changed if its causing this.

@darkain
Copy link
Contributor Author

darkain commented Aug 19, 2022

@vadonka it sounds like you have a different issue, I'd suggest opening up a new fresh issue and reporting what issues you're seeing, any logs or errors, and what steps you've done to reproduce the issue and attempted to solve it.

@masx200
Copy link

masx200 commented Mar 16, 2024

If you run VXLAN on zerotier, the traffic loops between the two virtual interfaces indefinitely, and the packets are encapsulated an infinite number of times, and the packets will get bigger and bigger.
I have an idea to check the size of the packet on the zerotier interface and if the upper limit of the size is exceeded, then the packet loss is handled.
The upper limit of the packet size can be set to a few kilobytes.

如果在zerotier上运行vxlan,流量在这两个虚拟接口之间无限循环,那么数据包进过无限次的封装,那么数据包会越来越大.
我有一个想法,可以在zerotier接口上检查数据包的大小,如果超过大小的上限,则进行丢包处理.
数据包大小的上限可以设定为几千字节.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BSD BSD-related issue Central & Network Management ZeroTier Central & networking management Status: Backlog Older issues that are awaiting resolution Windows Windows-related issue
Projects
None yet
Development

No branches or pull requests