Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ways to reduce Bluetooth Mesh message loss #13553

Closed
xiaoliang314 opened this issue Feb 20, 2019 · 27 comments
Closed

Ways to reduce Bluetooth Mesh message loss #13553

xiaoliang314 opened this issue Feb 20, 2019 · 27 comments
Assignees
Labels
area: Bluetooth Mesh area: Bluetooth Enhancement Changes/Updates/Additions to existing features

Comments

@xiaoliang314
Copy link

xiaoliang314 commented Feb 20, 2019

    I created two Bluetooth mesh nodes. I control one of the nodes to send a message to another node, and the other node will reply immediately after receiving the message. The next message will not be sent until I receive this reply or timeout.

    Found through my test. A and B communicate. The success rate of A to B is 99%. B response A success rate is 60%. I guarantee that B will successfully send the response message to air.

    My node uses nrf52_pca10040, I modified the broadcast interval and scan window to 20ms and 15ms in mesh/adv.c. Because the default parameters are worse. I based on the sample/mesh_demo example.

@xiaoliang314 xiaoliang314 added the bug The issue is a bug, or the PR is fixing a bug label Feb 20, 2019
@xiaoliang314
Copy link
Author

xiaoliang314 commented Feb 20, 2019

I opened the log information of MESH / NET and did not receive a reply broadcast packet when it failed.

@xiaoliang314
Copy link
Author

My message is less than 8 bytes and not segmented.

@jhedberg
Copy link
Member

What kind of network transmit do you have configured? Have you tried increasing the number of retransmissions using it?

@jhedberg
Copy link
Member

FWIW, it's a known issue that there's room for improvement with the reliability message reception, even though with mesh it can never be 100%.

@cvinayak @bluetooth-mdw FYI

@xiaoliang314
Copy link
Author

xiaoliang314 commented Feb 20, 2019

@jhedberg I didn't use retransmission, it is a unreliable message. My number of nodes is just two. Surprisingly, the success rate of A to B is 99%, but the success rate of B response A is only 60%.

    I used an ammeter to see the process of chip transfer. I find, the controller used a scan interval time slot to send. However, after the TX is completed, the RX is not immediately turned on, but waits for the next scan interval time slot to be turned on. Is this one of the reasons?

@jhedberg
Copy link
Member

@jhedberg I didn't use retransmission, it is a unreliable message.

@xiaoliang314 usually unreliable messages are defined as those which don't solicit any response from the other side. It's unclear to me what this has to do with the Network Transmit state? It's usually a good idea to have at least a few transmissions, and that's also what the samples in the Zephyr tree do, i.e. they have something like the following in their bt_mesh_cfg_srv struct definition:

        /* 3 transmissions with 20ms interval */
        .net_transmit = BT_MESH_TRANSMIT(2, 20),

after the TX is completed, the RX is not immediately turned on, but waits for the next scan interval time slot to be turned on. Is this one of the reasons?

@xiaoliang314 yes, that could be one of the reasons.

@xiaoliang314
Copy link
Author

My configuration is:

static struct bt_mesh_cfg_srv cfg_srv = {
	.relay = BT_MESH_RELAY_DISABLED,
	.beacon = BT_MESH_BEACON_ENABLED,
#if defined(CONFIG_BT_MESH_FRIEND)
	.frnd = BT_MESH_FRIEND_ENABLED,
#else
	.frnd = BT_MESH_FRIEND_NOT_SUPPORTED,
#endif
#if defined(CONFIG_BT_MESH_GATT_PROXY)
	.gatt_proxy = BT_MESH_GATT_PROXY_ENABLED,
#else
	.gatt_proxy = BT_MESH_GATT_PROXY_NOT_SUPPORTED,
#endif
	.default_ttl = 7,

	/* 3 transmissions with 20ms interval */
	.net_transmit = BT_MESH_TRANSMIT(2, 20),
	.relay_retransmit = BT_MESH_TRANSMIT(2, 20),
};

menuconfig opens these functions:

[*] Relay support
[ ] Support for Low Power features
[ ] Support for acting as a Friend Node
[*] Support for Configuration Client Model
[ ] Support for Health Client Model

mesh/adv.c modified these parameters:

/* Window and Interval are equal for continuous scanning */
#define MESH_SCAN_INTERVAL_MS 15
#define MESH_SCAN_WINDOW_MS   15
#define MESH_SCAN_INTERVAL    ADV_SCAN_UNIT(MESH_SCAN_INTERVAL_MS)
#define MESH_SCAN_WINDOW      ADV_SCAN_UNIT(MESH_SCAN_WINDOW_MS)

/* Pre-5.0 controllers enforce a minimum interval of 100ms
 * whereas 5.0+ controllers can go down to 20ms.
 */
#define ADV_INT_DEFAULT_MS 20
#define ADV_INT_FAST_MS    20

@jhedberg FYI

@nashif nashif added the priority: medium Medium impact/importance bug label Feb 21, 2019
@WilliamGFish
Copy link
Collaborator

This on the surface looks like a buffers configuration issues. With the nods being relays they will be processing multiple messages that have been repeated and will consume precious resources.
Start by removing the Relay option and see what effect that has on reliability? I would then suggest looking at the mesh buffers.

Billy..

@jhedberg
Copy link
Member

jhedberg commented Mar 7, 2019

@xiaoliang314 a critical part of the configuration that can influence packet loss is the number of RX buffers. You didn't mention that. It'd defined through CONFIG_BT_RX_BUF_COUNT.

@jhedberg jhedberg added Enhancement Changes/Updates/Additions to existing features question and removed bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug labels Mar 7, 2019
@jhedberg jhedberg changed the title nrf52_pca10040 Bluetooth mesh message loss rate is high Ways to reduce Bluetooth Mesh message loss Mar 7, 2019
@jhedberg
Copy link
Member

jhedberg commented Mar 7, 2019

I've reclassified this as a question/enhancement request, since we haven't identified any obvious thing that's broken with the current implementation. Bluetooth Mesh has an inherent property of being an unreliable transport, so there will always be some message loss. What needs to be focused on is how to minimize it. A lot can already be done with the current implementation by carefully fine-tuning the various configuration parameters. What will further improve matters are the HCI extensions for Mesh, however those will appear earliest in Zephyr 1.15.

@xiaoliang314
Copy link
Author

@jhedberg I don't think so. I put the two nodes in a closed environment and get the same result. The advertising message between them is only 3-6 per second. But there is also a phenomenon in which the probability of loss is relatively large. This phenomenon exists only when another node receives a message and immediately replies.

@xiaoliang314
Copy link
Author

I manually perform a round-trip communication per second, and the probability of loss is the same.

@jhedberg
Copy link
Member

jhedberg commented Mar 8, 2019

The advertising message between them is only 3-6 per second.

@xiaoliang314 in mesh terms that's quite a lot. The maximum allowed by the spec is 10 per second (actually 100 per 10 seconds, so you can have some bursts), however depending on your Network Transmit state and controller support the practical maximum may be as low as 3 per second. With the Zephyr controller and the default Network Transmit state that's present in most sample applications, one packet will take about 120ms to transmit.

If you have a logic analyzer, it'd be very helpful if you could enable CONFIG_BT_CTLR_DEBUG_PINS for both devices and compare how their radio state timings compare with each other. Btw, the 30ms value for scanning should yield a better probability of reception than your modified 15ms value.

Adding @cvinayak as a second assignee since part of the improvement solution will be the vendor HCI extensions for mesh.

@jhedberg
Copy link
Member

jhedberg commented Mar 8, 2019

I used an ammeter to see the process of chip transfer. I find, the controller used a scan interval time slot to send. However, after the TX is completed, the RX is not immediately turned on, but waits for the next scan interval time slot to be turned on. Is this one of the reasons?

@xiaoliang314 that's something that @cvinayak will need to answer (I only know the controller internals on a fairly high-level).

@xiaoliang314
Copy link
Author

@jhedberg Thanks for your answer, I know that the Bluetooth Mesh specification allows 100 messages to be transmitted per 10 seconds. But as a chip or module manufacturer, we should find the reason why this problem contradicts the theory, and improve our products. In an environment without RF interference we believe that their communication success rate should be 100%. But it is actually lower than this. If we solve this problem, it will improve the overall communication quality.

@jhedberg
Copy link
Member

jhedberg commented Mar 8, 2019

@xiaoliang314 even in a interference-free environment you cannot have 100% reception probability unless you have 4 independent radios on the receiver: 3 for each advertising channel, each doing continuous scanning on their own channel, and 1 for advertising (so that the advertising doesn't interrupt scanning). With a single radio you cannot be listening on every channel all the time, and since you cannot know on which channel and when a packet will be transmitted there will always be the chance to miss that packet.

Here's another Kconfig change you could try:

CONFIG_BT_CTLR_ADVANCED_FEATURES=y
CONFIG_BT_CTLR_JOB_PRIO=1

With that I was able to reduce the time from the host requesting the controller to advertise, to the controller actually starting advertising. It basically eliminated one scan window from the latency.

@xiaoliang314
Copy link
Author

@jhedberg The configuration you provided is very useful. I found that the communication quality has been greatly improved in the test. But the receiving node often has a BUS FAULT error. I changed the scan interval to 15ms because the 30ms effect is not very good.

***** BUS FAULT *****
  Precise data bus error
  BFAR Address: 0xc9f7ffc2
***** Hardware exception *****
Current thread ID = 0x20000a80
Faulting instruction address = 0x185a8
Fatal fault in thread 0x20000a80! Aborting.

@jhedberg
Copy link
Member

jhedberg commented Mar 8, 2019

@xiaoliang314 does 0x185a8 resolve to some meaningful location in the code? (addr2line -e zephyr.elf 0x185a8)

First thing to check is if all threads are staying within bounds, i.e. that what you're seeing is not a stack overflow.

Also, you should look up which thread 0x20000a80 is. Doing e.g. "p (void *)0x20000a80" in gdb should show that. That would be the first thread of interest in the stack usage analysis.

@xiaoliang314
Copy link
Author

xiaoliang314 commented Mar 8, 2019

@jhedberg I am using gdb to reproduce this problem, it is <tx_thread_data>.

Program received signal SIGTRAP, Trace/breakpoint trap.
0x00017a98 in _is_t1_higher_prio_than_t2 (t2=0x20000b7c <adv_thread_data>, t1=0x20000b7c <adv_thread_data>) at /home/ubuntu/zephyr/kernel/sched.c:95
95              if (t1->base.prio < t2->base.prio) {
(gdb) p (void*)0x20000a80
$1 = (void *) 0x20000a80 <tx_thread_data>
(gdb) p *(struct k_thread *)0x20000a80
$3 = {base = {{qnode_dlist = {{head = 0x0 <crc32_ieee>, next = 0x0 <crc32_ieee>}, {tail = 0x0 <crc32_ieee>, prev = 0x0 <crc32_ieee>}}, qnode_rb = {children = {0x0 <crc32_ieee>, 0x0 <crc32_ieee>}}}, 
    pended_on = 0x0 <crc32_ieee>, user_options = 0 '\000', thread_state = 8 '\b', {{prio = -9 '\367', sched_locked = 0 '\000'}, preempt = 247}, order_key = 0, swap_data = 0x0 <crc32_ieee>, timeout = {
      node = {{head = 0x0 <crc32_ieee>, next = 0x0 <crc32_ieee>}, {tail = 0x0 <crc32_ieee>, prev = 0x0 <crc32_ieee>}}, dticks = 0, fn = 0x0 <crc32_ieee>}}, caller_saved = {<No data fields>}, 
  callee_saved = {v1 = 11, v2 = 536912088, v3 = 536897824, v4 = 536912264, v5 = 0, v6 = 0, v7 = 0, v8 = 0, psp = 536897744}, init_data = 0x0 <crc32_ieee>, fn_abort = 0x0 <crc32_ieee>, errno_var = 0, 
  stack_info = {start = 536897184, size = 640}, resource_pool = 0x0 <crc32_ieee>, arch = {basepri = 0, swap_return_value = 4294967285}}
(gdb) 

@xiaoliang314
Copy link
Author

I found the configuration of the TX thread, it is CONFIG_BT_HCI_TX_STACK_SIZE, I try to increase this size and try again.

@xiaoliang314
Copy link
Author

I configured CONFIG_BT_HCI_TX_STACK_SIZE to 2048. This problem still exists. I use gdb to view and the configuration is effective.

stack_info = {start = 536897184, size = 2048}

@EddLeon
Copy link

EddLeon commented Aug 8, 2019

@xiaoliang314, Hope you have fixed your problem by now.....from my own experiments, I've seen that the Network layer keeps relaying the packet until its TTL has reached zero (even with the replay protection and message cache), probably causing increased traffic. (you have to enable the Network Layer log to see this). So maybe you could try lowering the .default_ttl = 7 and disabling the [*] Relay support if you're not using it.

I am having a similar problem with packet reception ratio when multi-hop is required to deliver a packet and I've tried some of the recommendations @jhedberg made in this thread. So far, I've seen the best results by increasing the BT_MESH_ADV_BUF_COUNT as in my case, I get <err> bt_mesh_net: Out of relay buffers whenever I increase the traffic that needs to be relayed. So you can try this out!

Regards,
Ed

@xiaoliang314
Copy link
Author

@EddLeon No solution, I have replaced the chip. One of the most probable reasons I found was that both the sending and receiving of the controller worked through time slots. When the controller sends it, it grabs a scan interval (30ms) but only uses about 3ms, and the next 27ms is not doing anything. This should be an important cause of message loss.

@carlescufi
Copy link
Member

@xiaoliang314 can you please retry your benchmarks using the new split controller? Make sure that CONFIG_BT_LL_SW_SPLIT=y is set in your build/zephyr/.config file.

@xiaoliang314
Copy link
Author

@carlescufi Sorry, with the update of the version, this test code can no longer work in the new version, and my time does not allow me to continue debugging it.

@vikrant8052
Copy link
Contributor

So here conclusion is for better performance (to solve the mentioned issue) we have to set

CONFIG_BT_CTLR_ADVANCED_FEATURES=y
CONFIG_BT_CTLR_JOB_PRIO=1
CONFIG_BT_LL_SW_SPLIT=y

& if necessary. increase values of

BT_MESH_ADV_BUF_COUNT
CONFIG_BT_HCI_TX_STACK_SIZE

I found values of (as per build/zephyr/.config),
CONFIG_BT_RX_BUF_COUNT=3
CONFIG_BT_HCI_TX_STACK_SIZE=640
after building samples/boards/nrf52/mesh/onoff_level_lighting_vnd_app.

@cvinayak
Copy link
Contributor

cvinayak commented May 2, 2021

Closing, as there is no planned work related to this issue.

@cvinayak cvinayak closed this as completed May 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Bluetooth Mesh area: Bluetooth Enhancement Changes/Updates/Additions to existing features
Projects
None yet
Development

No branches or pull requests

9 participants