-
Notifications
You must be signed in to change notification settings - Fork 6.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ways to reduce Bluetooth Mesh message loss #13553
Comments
I opened the log information of MESH / NET and did not receive a reply broadcast packet when it failed. |
My message is less than 8 bytes and not segmented. |
What kind of network transmit do you have configured? Have you tried increasing the number of retransmissions using it? |
FWIW, it's a known issue that there's room for improvement with the reliability message reception, even though with mesh it can never be 100%. @cvinayak @bluetooth-mdw FYI |
@jhedberg I didn't use retransmission, it is a unreliable message. My number of nodes is just two. Surprisingly, the success rate of A to B is 99%, but the success rate of B response A is only 60%. |
@xiaoliang314 usually unreliable messages are defined as those which don't solicit any response from the other side. It's unclear to me what this has to do with the Network Transmit state? It's usually a good idea to have at least a few transmissions, and that's also what the samples in the Zephyr tree do, i.e. they have something like the following in their bt_mesh_cfg_srv struct definition:
@xiaoliang314 yes, that could be one of the reasons. |
My configuration is:
menuconfig opens these functions:
mesh/adv.c modified these parameters:
@jhedberg FYI |
This on the surface looks like a buffers configuration issues. With the nods being relays they will be processing multiple messages that have been repeated and will consume precious resources. Billy.. |
@xiaoliang314 a critical part of the configuration that can influence packet loss is the number of RX buffers. You didn't mention that. It'd defined through |
I've reclassified this as a question/enhancement request, since we haven't identified any obvious thing that's broken with the current implementation. Bluetooth Mesh has an inherent property of being an unreliable transport, so there will always be some message loss. What needs to be focused on is how to minimize it. A lot can already be done with the current implementation by carefully fine-tuning the various configuration parameters. What will further improve matters are the HCI extensions for Mesh, however those will appear earliest in Zephyr 1.15. |
@jhedberg I don't think so. I put the two nodes in a closed environment and get the same result. The advertising message between them is only 3-6 per second. But there is also a phenomenon in which the probability of loss is relatively large. This phenomenon exists only when another node receives a message and immediately replies. |
I manually perform a round-trip communication per second, and the probability of loss is the same. |
@xiaoliang314 in mesh terms that's quite a lot. The maximum allowed by the spec is 10 per second (actually 100 per 10 seconds, so you can have some bursts), however depending on your Network Transmit state and controller support the practical maximum may be as low as 3 per second. With the Zephyr controller and the default Network Transmit state that's present in most sample applications, one packet will take about 120ms to transmit. If you have a logic analyzer, it'd be very helpful if you could enable Adding @cvinayak as a second assignee since part of the improvement solution will be the vendor HCI extensions for mesh. |
@xiaoliang314 that's something that @cvinayak will need to answer (I only know the controller internals on a fairly high-level). |
@jhedberg Thanks for your answer, I know that the Bluetooth Mesh specification allows 100 messages to be transmitted per 10 seconds. But as a chip or module manufacturer, we should find the reason why this problem contradicts the theory, and improve our products. In an environment without RF interference we believe that their communication success rate should be 100%. But it is actually lower than this. If we solve this problem, it will improve the overall communication quality. |
@xiaoliang314 even in a interference-free environment you cannot have 100% reception probability unless you have 4 independent radios on the receiver: 3 for each advertising channel, each doing continuous scanning on their own channel, and 1 for advertising (so that the advertising doesn't interrupt scanning). With a single radio you cannot be listening on every channel all the time, and since you cannot know on which channel and when a packet will be transmitted there will always be the chance to miss that packet. Here's another Kconfig change you could try:
With that I was able to reduce the time from the host requesting the controller to advertise, to the controller actually starting advertising. It basically eliminated one scan window from the latency. |
@jhedberg The configuration you provided is very useful. I found that the communication quality has been greatly improved in the test. But the receiving node often has a BUS FAULT error. I changed the scan interval to 15ms because the 30ms effect is not very good.
|
@xiaoliang314 does 0x185a8 resolve to some meaningful location in the code? (addr2line -e zephyr.elf 0x185a8) First thing to check is if all threads are staying within bounds, i.e. that what you're seeing is not a stack overflow. Also, you should look up which thread 0x20000a80 is. Doing e.g. "p (void *)0x20000a80" in gdb should show that. That would be the first thread of interest in the stack usage analysis. |
@jhedberg I am using gdb to reproduce this problem, it is <tx_thread_data>.
|
I found the configuration of the TX thread, it is CONFIG_BT_HCI_TX_STACK_SIZE, I try to increase this size and try again. |
I configured CONFIG_BT_HCI_TX_STACK_SIZE to 2048. This problem still exists. I use gdb to view and the configuration is effective.
|
@xiaoliang314, Hope you have fixed your problem by now.....from my own experiments, I've seen that the Network layer keeps relaying the packet until its TTL has reached zero (even with the replay protection and message cache), probably causing increased traffic. (you have to enable the Network Layer log to see this). So maybe you could try lowering the I am having a similar problem with packet reception ratio when multi-hop is required to deliver a packet and I've tried some of the recommendations @jhedberg made in this thread. So far, I've seen the best results by increasing the Regards, |
@EddLeon No solution, I have replaced the chip. One of the most probable reasons I found was that both the sending and receiving of the controller worked through time slots. When the controller sends it, it grabs a scan interval (30ms) but only uses about 3ms, and the next 27ms is not doing anything. This should be an important cause of message loss. |
@xiaoliang314 can you please retry your benchmarks using the new split controller? Make sure that |
@carlescufi Sorry, with the update of the version, this test code can no longer work in the new version, and my time does not allow me to continue debugging it. |
So here conclusion is for better performance (to solve the mentioned issue) we have to set
& if necessary. increase values of
I found values of (as per build/zephyr/.config), |
Closing, as there is no planned work related to this issue. |
I created two Bluetooth mesh nodes. I control one of the nodes to send a message to another node, and the other node will reply immediately after receiving the message. The next message will not be sent until I receive this reply or timeout.
Found through my test. A and B communicate. The success rate of A to B is 99%. B response A success rate is 60%. I guarantee that B will successfully send the response message to air.
My node uses nrf52_pca10040, I modified the broadcast interval and scan window to 20ms and 15ms in mesh/adv.c. Because the default parameters are worse. I based on the sample/mesh_demo example.
The text was updated successfully, but these errors were encountered: