New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
openthread: bind IP is randomly selected causing off-mesh communication problems #12203
Comments
@rlubos @tbursztyka @jukkar In my head, I could see solving this by allowing L2 layers to override the method that is used by the interface to determine the binding IP. That way the OpenThread L2 layer can do the necessary work to look at routing prefixes which match the local IPs, etc. |
This is kind of a train wreck: |
Adding @turon who was also working with OpenThread. |
@mike-scott do you have more details or a hard example of a set of IP addresses assigned and which source IP addresses are causing issues when selected for a given destination? As a basic sanity check is Zephyr at least choosing a GUA (Global Unicast Address) for external routes? |
@turon Zephyr currently does a "diff" against the non-local link IP addresses to calculate how much of the IP matches the destination (most return 0 length match) and randomly ends up with one. Have you tried testing the OpenThread BorderRouter setup yet? It runs into issues immediately. |
@mike-scott, yes I see now what you describe, if source isn't LL it just picks the first well-formed address with most matching bytes: https://github.com/zephyrproject-rtos/zephyr/blob/master/subsys/net/ip/net_if.c#L1978 It needs to be extended to at least choose a GUA if the destination is a global prefix such as 2xxx:: or 3xxx. |
@turon yes, and to do that I think we need to add a way for each interface's L2 layer to define it's own way of selecting a bind IP. I'm looking at OpenThread's source now and I think we can call into the existing OpenThread logic for selecting the correct IP address without having to write our own. The CoAP sample does a bind, etc. |
@turon if you want to setup the OpenThread BorderRouter (https://openthread.io/guides/border-router) and need a Zephyr sample to use with it, I can setup PR's to add a conf file for the LwM2M client as well as the hack patch to make the interface select the right IP. It's a much better "real world' sample than echo-client / echo-server. |
@mike-scott sounds good! Though zephyr may benefit from improved logic for other L2 interfaces. |
agreed! |
I am using CoAP over OT, and I indeed noticed some issue with the source addresses. I don't think that's really a problem for TCP, but for UDP and especially for CoAP, the source address in the reply must match the destination address of the request. For now I am using a small hack getting the ML-EID address and binding only to this one. I think that in general we really want the
It probably also means that Note that I don't think this problem is specific to OT, I am sure it can be reproduced for instance on an Ethernet interface with 2 IPv6 addresses. |
Came here from #12343. Not knowing much about OpenThread. The description here seems very vague and generic. This report seems to be destined to folks who worked on OpenThread like yesterday. (I'm not sure everyone would remember its apparent addressing idiosyncrasies if they worked on it a week ago). I immediately get questions which were already sounded in comments here:
These questions were never answered... So far, the impression I get at the back of my mind is "something's wrong with Thread the protocol, which uses gazillion of obscure addresses". |
Where's exactly the train wreck? The only obvious issue I see is that instead of clear logic like:
it obfuscates it into:
And of course, it lacks code comments describing what exactly it does. But then it's "conspiracy", not a "train wreck" ;-) P.S. Of course, there's a possibility that since the link was posted, the code lines shifted, and I'm looking at a different function than initially was meant. |
The problem happens because COAP (used by LwM2M) uses UDP as its transport protocol, and because OpenThread uses multiple IPv6 addresses. When the COAP server receives a request from a client inside an UDP packet a request from a client, it answers with another UDP packet. The source address that is used in the answer to the client does not match the address the client used in the request. In that case the packet is simply discarded by the client. This is actually described there: https://tools.ietf.org/html/draft-ietf-lwig-coap-06#section-5.1.1 I guess that other protocols using UDP might be affected. |
@aurel32: Thanks, I understood a generic problem from the discussion above. I don't see a specific information which backs a problem report. Like: "With OpenThread enabled I get (on a particular invocation, as (some?) addresses are randomly generated: 1) XXXX:YYYY::ZZZZ, type: X; 2) ... 3) ... 4) ... . When sending a reply to XXX1:YYY1::ZZZ1, the source address Q:P::M is selected, whereas I expect A:B::C selected, because ..." etc. etc.
Good. I suppose, it doesn't say the way to resolve this is to add adhoc layer-piercing callbacks to the IP stack, as proposed in #12343 ? So, let's make everyone understand ins and outs of the problem (a specific example is a must for that), and enumerate solutions we can come up with. |
Yes, because the problem was vague and generic 19 days ago. Turns out, Zephyr has a problem routing IPv6 packets ANY TIME there is more than 1 valid unicast IP on the interface where only 1 works correctly for routing off the local network.
The "train wreck" is only doing a "diff" check of IPv6 prefixes (which almost always comes back with 0 match) to determine if we should use the interface IP or not. It's only part of a spec and it's the central problem here. "train wreck" is also a slight exaggeration.
Woke up on the wrong side of the bed today?
Complain much? |
Nope, just getting back to work and trying to go thru the backlog for reviews I'm set as a reviewer. And trying to understand PRs submitted and reasoning behind them, and if I can't, trying to explain why. As I said, I'm not OpenThread expert, I assume other folks can understand you from half a word. I can't, sorry about that. |
Let's try and stay positive and respectful, everyone. I for one am super excited that OpenThread is running on zephyr. When I start a Leader device, it creates a number of IP addresses:
Really, the interface should not expose RLOC addresses (fdde:ad00:beef:0:0:ff:fe00:*) to the higher layer -- that violates the Thread specification. I'm not sure where fdde:ad00:beef:0:0:0:0:2 is coming from. My expectation would be that only the following addresses would be exposed: ML-EID and LL64. and that the OpenThread netif should filter out the invalid addresses from being exposed to the zephyr IP stack.
I could see how all those invalid Mesh Local (Realm Local) addresses would confuse the current implementation. @mike-scott are you saying link local and GUA addressing isn't working either? |
@turon I can ping nodes with link local addresses. What do you see when you add a border router prefix? On my end, I see an additional unicast address added to every node (using ML-EID suffix). It's the presence of those 2 unicast IPs that Zephyr doesn't know how to deal with (it doesn't understand the mesh-local concept). Under non-border router usecases communication is fine. |
@turon I can post a patch which keeps the RLOC addresses from being added to Zephyr. Are there any of the multicast addresses which should be removed as well? |
I agree here, if we discard these we could implement source address selection without any extra rules needed for Thread.
It's just a static address registered by an application.
That's correct, OpenThread will register a global address when it's given a BR prefix, and this address should be used as source for off-mesh communication. But to achieve this we need a proper logic for source address selection which would at least consider IPv6 address scope. |
@turon :
Thanks, that's more specific, but still misses some information. Each IP address in Zephyr has some properties assigned to it. Here's example output from "net iface" command of standard "echo_server" sample as ran via QEMU SLIP networking:
For example, 2001:db8::1 has properties of "manual" [assigned], "preferred", "infinite" [lifetime]. These properties largely come from Internet RFC, though some may be custom too. And again, we (or I) would need to consult RFCs, but it might be possible to conjecture that an address not marked as "preferred" should not be selected as a source address when performing automatic address binding. If the above doesn't/wouldn't work (e.g. if semantics of "preferred" is strictly specified by some RFC and overloading it seems like too much), we can introduce adhoc properties. E.g., could have "internal address" property, which makes sure that the address is never selected for any automatic binding, period (could be used only with explicit bind() to it). And again, all the above just comes from speculative contemplating the issue described here (with a noticeable lack of details in the description, as was pointed out). But the right way to go about it is to read RFCs, look how Linux deals with it, etc. |
Which interface and what higher layer is meant here? In terms of Zephyr network networking (which follows the spirit of POSIX/Unix), each network interface can have a number of addresses assigned. But as pointed above, each address has its own properties. Can the following be concluded to be true:
? In addition to that, is following true:
? If both of these statement can be classified as true, then indeed, there's no need to assign "RLOC addresses (fdde:ad00:beef:0:0:ff:fe00:*)" on the level of Zephyr interface. Well, such a choice would come with its own pros and cons. For example, I may imagine that only bare minimum of such a protocol is generated within OpenThread core. More complex setups, e.g., a border router, might actually implement parts of Thread LL protocol in "user space", and then exposing corresponding addresses is useful. But such addresses should come with corresponding properties set, precluding their usage by unaware user applications. |
Here is the info on a node without border router, so without global IP:
The RLOC IP is fd2f:57da:bdf0::ff:fe00:3c00 and is not stable. The ML-EID IP is fd2f:57da:bdf0:0:a72b:f9b7:1828:f861, this one is stable. Here is a tcpdump of what happening whhen trying to execute a COAP get request from host fd2f:57da:bdf0:0:58f2:20d:da9:bd64:
As you can see the source IP is the RLOC one, so the host answer with an ICMP6 destination unreachable answer and retries the request a few seconds after. After applying the patches from #12343, the right source IP is selected and everything works as expected:
|
I've finally found some time to test the BR setup and look deeper into Zephyr's source address selection. The point is I (almost) wasn't able to reproduce the issue on my setup. This is the configuration I got, running
And with this configuration I was easily able to ping Google's DNS server:
Three caveats:
I also took some time and compared Zephyr's and OpenThread's soruce address resolution alghorithms, and the conclussion is that actually they're not that different in terms of address selection for global destintation. Sure, looking at OpenThread's implementation it's clear it follows the RFC, but in terms of scope/prefix match Zephyr does more or less the same, even if it's not obvious at the first glance. So in the above example, all addresses but GUA got 0 bits of prefix match, while GUA got 4 bits and got selected. And unless we have a valid GUA address provided by OT, this should always happen, no place for randomness here. So, to summarize, is there a chance that you provide more details about the setup you run (destination address you try to reach, what addresses you have on interface in Zephyr) so that I had more data to reproduce the issue? |
First I apologize to everyone on this issue. I was working with OpenThread and Zephyr heavily during the holidays and then immediately got drug off onto another project. I haven't had much time till now to get back into this. @rlubos In my case, I'm working with an edge gateway setup where we translate IPv6 to IPv4. This resembles the https://openthread.io/guides/border-router set up, except that instead of Tayga we use a Jool stateful NAT64 kernel module: https://www.jool.mx/en/index.html). We also use DNS64 so that the OpenThread nodes get IPv6 versions of the IPv4 DNS requests.
In my setup, the border router advertises the IPv6 prefix: fd11:22::/64 There is a rule in OpenThread's logic which selects the advertised prefix over the mesh-local prefix. I need to go back and find which one. On the Zephyr device is looks like this:
|
Thanks @mike-scott for the details. From the description, I would like to add it's actually a different issue than the one I encountered, and my issue should be fixed by filtering out RLOC and ALOC from Zephyr's interface. @rlubos I am available for testing such a patch when you have one ready. |
@mike-scott Thanks for putting more light into the problem. Now I understand what did you mean by "Zephyr doesn't understand the mesh-local concept". And as according to IPv6 RFC, for unicast addressing it's only link-local or everyting else, scope matching won't be much help here. OpenThread bypassed this by introducing scope overriding for mesh-local addresses, but to be honest, it took me a moment to understand the logic behind it, and I'm not sure if the change would be generic enough to apply it here. A solution that come up to my mind, inspired by @pfalcon response, would be to introduce a new address property, let's call it This rule might not be explicitly specified by RFC6724, but still it allows for some flexibility here:
And in this case, I'm convinced we know better. |
This is more or less what I had in mind: https://github.com/rlubos/zephyr/commits/ipv6-source-address-selection |
@rlubos thanks for the investigation. This indeed sounds like a simple solution for the issue and the patch is easy to understand. |
@rlubos to confirm: I tested your ipv6-source-address-selection branch patches and they do indeed solve my routing issues. My interface looks like this now:
Both fdde:ad00:beef:0::/64 IPs are marked as "local". And when I alter the logic to include: #12686
|
I have just tested that, and as expected it also fixes my issue. Thanks. |
So as you seem happy with the change, I'll make a PR out of once #12686 is in. I'll stick to the |
NET_ADDR_LOCAL is rather ambiguous. You mentioned "link-local" yourself, and this term is one of the most important terms in IPv6. But there're more "*-local" terms in IPv6, https://en.wikipedia.org/wiki/IPv6_address#Multicast . So, calling something NET_ADDR_LOCAL is confusing and will definitely lead to misuse. If we talk about mesh-local here, let's call it like that: NET_ADDR_MESH_LOCAL. Mesh-under is well-known concept, and some implementation may still use IPv6 formatted addresses for it, to ease introspection and management. That's to address @jukkar's concern of adding something adhocly for OpenThread. And that argument is if we need both of NET_ADDR_INTERNAL and NET_ADDR_LOCAL, like https://github.com/rlubos/zephyr/commits/ipv6-source-address-selection currently has. If we need just one, NET_ADDR_INTERNAL is the obvious generic choice. |
I'll use If we'd keep registering RLOCs, then we'd need both types. As we agreed to drop RLOC regsitration, only a mesh local concept is needed. And finally, I'm thinking about adding a new flag into Let me finish the code and open a PR. We can move the discussion there then. |
"manual" means explicitly assigned by a human. I don't think that any of IPv6 "support" addresses can be like that. They are by definition managed by the stack, which is the opposite of "manual". |
But this is not about whether address is an IPv6 support address or not. We need information whether address was created as mesh-local (which can only be used for on-mesh communication) or not. And mesh-local addresses should not be used for external destinations. Apart from the RLOC and ALOC addresses (which we can call "support" addresses, but we've already agreed to drop them) you have one mesh-local address autoconfigured by Thread interface which is stable over reboots and which can safely be used by an application. Additionally, in echo samples, we manually add one address with mesh-local prefix, so that we can bootstrap the communication without a need for any extra discovery mechanism. These addresses are created in a different way, yet both should not be used for off-mesh destinations. |
So, my last reply was to your phrase: "I'm thinking about adding a new flag into net_if_addr instead of overwriting net_addr_type. Why? Because we lose some information when we overwrite address type (autoconf/manual).". My argument is that we wouldn't lose any info, from a quick think.
Well, that would qualify as "manual" address. But does is have to be marked as special "mesh local"? All in all, let's indeed look at the patch, I just hope that you start minimal and their will good explanations (with examples) in commit messages if we really need to add "more". |
Describe the bug
OpenThread generates quite a lot of unicast and multicast IP addresses for MeshLocal, default route, and other routable prefixes. When it comes time for Zephyr to bind to a local IP for creating a connection to an external service, there just isn't enough logic to select the correct IP.
This results in communication problems when dealing with off-mesh services.
I can place a weight on local IPs that hacks Zephyr into picking the right bind IP.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Local OpenThread nodes should be able to select the correct local IP address to use for connections
Impact
Essentially OpenThread is unusable for practical communications until this issues is solved.
Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: