Ways to reduce Bluetooth Mesh message loss #13553

xiaoliang314 · 2019-02-20T08:41:21Z

    I created two Bluetooth mesh nodes. I control one of the nodes to send a message to another node, and the other node will reply immediately after receiving the message. The next message will not be sent until I receive this reply or timeout.

    Found through my test. A and B communicate. The success rate of A to B is 99%. B response A success rate is 60%. I guarantee that B will successfully send the response message to air.

    My node uses nrf52_pca10040, I modified the broadcast interval and scan window to 20ms and 15ms in mesh/adv.c. Because the default parameters are worse. I based on the sample/mesh_demo example.

xiaoliang314 · 2019-02-20T08:45:59Z

I opened the log information of MESH / NET and did not receive a reply broadcast packet when it failed.

xiaoliang314 · 2019-02-20T09:28:50Z

My message is less than 8 bytes and not segmented.

jhedberg · 2019-02-20T14:20:46Z

What kind of network transmit do you have configured? Have you tried increasing the number of retransmissions using it?

jhedberg · 2019-02-20T14:25:48Z

FWIW, it's a known issue that there's room for improvement with the reliability message reception, even though with mesh it can never be 100%.

@cvinayak @bluetooth-mdw FYI

xiaoliang314 · 2019-02-20T15:14:11Z

@jhedberg I didn't use retransmission, it is a unreliable message. My number of nodes is just two. Surprisingly, the success rate of A to B is 99%, but the success rate of B response A is only 60%.

I used an ammeter to see the process of chip transfer. I find, the controller used a scan interval time slot to send. However, after the TX is completed, the RX is not immediately turned on, but waits for the next scan interval time slot to be turned on. Is this one of the reasons?

jhedberg · 2019-02-20T18:30:47Z

@jhedberg I didn't use retransmission, it is a unreliable message.

@xiaoliang314 usually unreliable messages are defined as those which don't solicit any response from the other side. It's unclear to me what this has to do with the Network Transmit state? It's usually a good idea to have at least a few transmissions, and that's also what the samples in the Zephyr tree do, i.e. they have something like the following in their bt_mesh_cfg_srv struct definition:

        /* 3 transmissions with 20ms interval */
        .net_transmit = BT_MESH_TRANSMIT(2, 20),

after the TX is completed, the RX is not immediately turned on, but waits for the next scan interval time slot to be turned on. Is this one of the reasons?

@xiaoliang314 yes, that could be one of the reasons.

xiaoliang314 · 2019-02-21T01:38:00Z

My configuration is:

static struct bt_mesh_cfg_srv cfg_srv = {
	.relay = BT_MESH_RELAY_DISABLED,
	.beacon = BT_MESH_BEACON_ENABLED,
#if defined(CONFIG_BT_MESH_FRIEND)
	.frnd = BT_MESH_FRIEND_ENABLED,
#else
	.frnd = BT_MESH_FRIEND_NOT_SUPPORTED,
#endif
#if defined(CONFIG_BT_MESH_GATT_PROXY)
	.gatt_proxy = BT_MESH_GATT_PROXY_ENABLED,
#else
	.gatt_proxy = BT_MESH_GATT_PROXY_NOT_SUPPORTED,
#endif
	.default_ttl = 7,

	/* 3 transmissions with 20ms interval */
	.net_transmit = BT_MESH_TRANSMIT(2, 20),
	.relay_retransmit = BT_MESH_TRANSMIT(2, 20),
};

menuconfig opens these functions:

[*] Relay support
[ ] Support for Low Power features
[ ] Support for acting as a Friend Node
[*] Support for Configuration Client Model
[ ] Support for Health Client Model

mesh/adv.c modified these parameters:

/* Window and Interval are equal for continuous scanning */
#define MESH_SCAN_INTERVAL_MS 15
#define MESH_SCAN_WINDOW_MS   15
#define MESH_SCAN_INTERVAL    ADV_SCAN_UNIT(MESH_SCAN_INTERVAL_MS)
#define MESH_SCAN_WINDOW      ADV_SCAN_UNIT(MESH_SCAN_WINDOW_MS)

/* Pre-5.0 controllers enforce a minimum interval of 100ms
 * whereas 5.0+ controllers can go down to 20ms.
 */
#define ADV_INT_DEFAULT_MS 20
#define ADV_INT_FAST_MS    20

@jhedberg FYI

WilliamGFish · 2019-03-04T11:43:04Z

This on the surface looks like a buffers configuration issues. With the nods being relays they will be processing multiple messages that have been repeated and will consume precious resources.
Start by removing the Relay option and see what effect that has on reliability? I would then suggest looking at the mesh buffers.

Billy..

jhedberg · 2019-03-07T09:00:55Z

@xiaoliang314 a critical part of the configuration that can influence packet loss is the number of RX buffers. You didn't mention that. It'd defined through CONFIG_BT_RX_BUF_COUNT.

jhedberg · 2019-03-07T09:05:08Z

I've reclassified this as a question/enhancement request, since we haven't identified any obvious thing that's broken with the current implementation. Bluetooth Mesh has an inherent property of being an unreliable transport, so there will always be some message loss. What needs to be focused on is how to minimize it. A lot can already be done with the current implementation by carefully fine-tuning the various configuration parameters. What will further improve matters are the HCI extensions for Mesh, however those will appear earliest in Zephyr 1.15.

xiaoliang314 · 2019-03-08T03:03:13Z

@jhedberg I don't think so. I put the two nodes in a closed environment and get the same result. The advertising message between them is only 3-6 per second. But there is also a phenomenon in which the probability of loss is relatively large. This phenomenon exists only when another node receives a message and immediately replies.

xiaoliang314 · 2019-03-08T03:06:37Z

I manually perform a round-trip communication per second, and the probability of loss is the same.

jhedberg · 2019-03-08T06:51:33Z

The advertising message between them is only 3-6 per second.

@xiaoliang314 in mesh terms that's quite a lot. The maximum allowed by the spec is 10 per second (actually 100 per 10 seconds, so you can have some bursts), however depending on your Network Transmit state and controller support the practical maximum may be as low as 3 per second. With the Zephyr controller and the default Network Transmit state that's present in most sample applications, one packet will take about 120ms to transmit.

If you have a logic analyzer, it'd be very helpful if you could enable CONFIG_BT_CTLR_DEBUG_PINS for both devices and compare how their radio state timings compare with each other. Btw, the 30ms value for scanning should yield a better probability of reception than your modified 15ms value.

Adding @cvinayak as a second assignee since part of the improvement solution will be the vendor HCI extensions for mesh.

jhedberg · 2019-03-08T06:54:03Z

I used an ammeter to see the process of chip transfer. I find, the controller used a scan interval time slot to send. However, after the TX is completed, the RX is not immediately turned on, but waits for the next scan interval time slot to be turned on. Is this one of the reasons?

@xiaoliang314 that's something that @cvinayak will need to answer (I only know the controller internals on a fairly high-level).

xiaoliang314 · 2019-03-08T07:21:04Z

@jhedberg Thanks for your answer, I know that the Bluetooth Mesh specification allows 100 messages to be transmitted per 10 seconds. But as a chip or module manufacturer, we should find the reason why this problem contradicts the theory, and improve our products. In an environment without RF interference we believe that their communication success rate should be 100%. But it is actually lower than this. If we solve this problem, it will improve the overall communication quality.

jhedberg · 2019-03-08T07:52:37Z

@xiaoliang314 even in a interference-free environment you cannot have 100% reception probability unless you have 4 independent radios on the receiver: 3 for each advertising channel, each doing continuous scanning on their own channel, and 1 for advertising (so that the advertising doesn't interrupt scanning). With a single radio you cannot be listening on every channel all the time, and since you cannot know on which channel and when a packet will be transmitted there will always be the chance to miss that packet.

Here's another Kconfig change you could try:

CONFIG_BT_CTLR_ADVANCED_FEATURES=y
CONFIG_BT_CTLR_JOB_PRIO=1

With that I was able to reduce the time from the host requesting the controller to advertise, to the controller actually starting advertising. It basically eliminated one scan window from the latency.

xiaoliang314 · 2019-03-08T08:52:39Z

@jhedberg The configuration you provided is very useful. I found that the communication quality has been greatly improved in the test. But the receiving node often has a BUS FAULT error. I changed the scan interval to 15ms because the 30ms effect is not very good.

***** BUS FAULT *****
  Precise data bus error
  BFAR Address: 0xc9f7ffc2
***** Hardware exception *****
Current thread ID = 0x20000a80
Faulting instruction address = 0x185a8
Fatal fault in thread 0x20000a80! Aborting.

jhedberg · 2019-03-08T09:09:21Z

@xiaoliang314 does 0x185a8 resolve to some meaningful location in the code? (addr2line -e zephyr.elf 0x185a8)

First thing to check is if all threads are staying within bounds, i.e. that what you're seeing is not a stack overflow.

Also, you should look up which thread 0x20000a80 is. Doing e.g. "p (void *)0x20000a80" in gdb should show that. That would be the first thread of interest in the stack usage analysis.

xiaoliang314 · 2019-03-08T09:40:53Z

@jhedberg I am using gdb to reproduce this problem, it is <tx_thread_data>.

Program received signal SIGTRAP, Trace/breakpoint trap.
0x00017a98 in _is_t1_higher_prio_than_t2 (t2=0x20000b7c <adv_thread_data>, t1=0x20000b7c <adv_thread_data>) at /home/ubuntu/zephyr/kernel/sched.c:95
95              if (t1->base.prio < t2->base.prio) {
(gdb) p (void*)0x20000a80
$1 = (void *) 0x20000a80 <tx_thread_data>
(gdb) p *(struct k_thread *)0x20000a80
$3 = {base = {{qnode_dlist = {{head = 0x0 <crc32_ieee>, next = 0x0 <crc32_ieee>}, {tail = 0x0 <crc32_ieee>, prev = 0x0 <crc32_ieee>}}, qnode_rb = {children = {0x0 <crc32_ieee>, 0x0 <crc32_ieee>}}}, 
    pended_on = 0x0 <crc32_ieee>, user_options = 0 '\000', thread_state = 8 '\b', {{prio = -9 '\367', sched_locked = 0 '\000'}, preempt = 247}, order_key = 0, swap_data = 0x0 <crc32_ieee>, timeout = {
      node = {{head = 0x0 <crc32_ieee>, next = 0x0 <crc32_ieee>}, {tail = 0x0 <crc32_ieee>, prev = 0x0 <crc32_ieee>}}, dticks = 0, fn = 0x0 <crc32_ieee>}}, caller_saved = {<No data fields>}, 
  callee_saved = {v1 = 11, v2 = 536912088, v3 = 536897824, v4 = 536912264, v5 = 0, v6 = 0, v7 = 0, v8 = 0, psp = 536897744}, init_data = 0x0 <crc32_ieee>, fn_abort = 0x0 <crc32_ieee>, errno_var = 0, 
  stack_info = {start = 536897184, size = 640}, resource_pool = 0x0 <crc32_ieee>, arch = {basepri = 0, swap_return_value = 4294967285}}
(gdb)

xiaoliang314 · 2019-03-08T10:02:55Z

I found the configuration of the TX thread, it is CONFIG_BT_HCI_TX_STACK_SIZE, I try to increase this size and try again.

xiaoliang314 · 2019-03-08T10:16:49Z

I configured CONFIG_BT_HCI_TX_STACK_SIZE to 2048. This problem still exists. I use gdb to view and the configuration is effective.

stack_info = {start = 536897184, size = 2048}

EddLeon · 2019-08-08T09:10:07Z

@xiaoliang314, Hope you have fixed your problem by now.....from my own experiments, I've seen that the Network layer keeps relaying the packet until its TTL has reached zero (even with the replay protection and message cache), probably causing increased traffic. (you have to enable the Network Layer log to see this). So maybe you could try lowering the .default_ttl = 7 and disabling the [*] Relay support if you're not using it.

I am having a similar problem with packet reception ratio when multi-hop is required to deliver a packet and I've tried some of the recommendations @jhedberg made in this thread. So far, I've seen the best results by increasing the BT_MESH_ADV_BUF_COUNT as in my case, I get <err> bt_mesh_net: Out of relay buffers whenever I increase the traffic that needs to be relayed. So you can try this out!

Regards,
Ed

xiaoliang314 · 2019-08-09T01:38:36Z

@EddLeon No solution, I have replaced the chip. One of the most probable reasons I found was that both the sending and receiving of the controller worked through time slots. When the controller sends it, it grabs a scan interval (30ms) but only uses about 3ms, and the next 27ms is not doing anything. This should be an important cause of message loss.

carlescufi · 2019-08-15T11:28:55Z

@xiaoliang314 can you please retry your benchmarks using the new split controller? Make sure that CONFIG_BT_LL_SW_SPLIT=y is set in your build/zephyr/.config file.

xiaoliang314 · 2019-09-16T05:33:24Z

@carlescufi Sorry, with the update of the version, this test code can no longer work in the new version, and my time does not allow me to continue debugging it.

vikrant8052 · 2019-11-18T13:14:18Z

So here conclusion is for better performance (to solve the mentioned issue) we have to set

CONFIG_BT_CTLR_ADVANCED_FEATURES=y
CONFIG_BT_CTLR_JOB_PRIO=1
CONFIG_BT_LL_SW_SPLIT=y

& if necessary. increase values of

BT_MESH_ADV_BUF_COUNT
CONFIG_BT_HCI_TX_STACK_SIZE

I found values of (as per build/zephyr/.config),
CONFIG_BT_RX_BUF_COUNT=3
CONFIG_BT_HCI_TX_STACK_SIZE=640
after building samples/boards/nrf52/mesh/onoff_level_lighting_vnd_app.

cvinayak · 2021-05-02T13:53:26Z

Closing, as there is no planned work related to this issue.

xiaoliang314 added the bug The issue is a bug, or the PR is fixing a bug label Feb 20, 2019

nashif assigned jhedberg Feb 21, 2019

nashif added the priority: medium Medium impact/importance bug label Feb 21, 2019

jhedberg added Enhancement Changes/Updates/Additions to existing features question and removed bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug labels Mar 7, 2019

jhedberg changed the title ~~nrf52_pca10040 Bluetooth mesh message loss rate is high~~ Ways to reduce Bluetooth Mesh message loss Mar 7, 2019

jhedberg assigned cvinayak Mar 8, 2019

laperie added the area: Bluetooth label Mar 11, 2019

carlescufi added the area: Bluetooth Mesh label Aug 15, 2019

carlescufi removed the question label May 14, 2020

cvinayak closed this as completed May 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ways to reduce Bluetooth Mesh message loss #13553

Ways to reduce Bluetooth Mesh message loss #13553

xiaoliang314 commented Feb 20, 2019 •

edited

Loading

xiaoliang314 commented Feb 20, 2019 •

edited

Loading

xiaoliang314 commented Feb 20, 2019

jhedberg commented Feb 20, 2019

jhedberg commented Feb 20, 2019

xiaoliang314 commented Feb 20, 2019 •

edited

Loading

jhedberg commented Feb 20, 2019

xiaoliang314 commented Feb 21, 2019

WilliamGFish commented Mar 4, 2019

jhedberg commented Mar 7, 2019

jhedberg commented Mar 7, 2019 •

edited

Loading

xiaoliang314 commented Mar 8, 2019

xiaoliang314 commented Mar 8, 2019

jhedberg commented Mar 8, 2019

jhedberg commented Mar 8, 2019

xiaoliang314 commented Mar 8, 2019

jhedberg commented Mar 8, 2019

xiaoliang314 commented Mar 8, 2019

jhedberg commented Mar 8, 2019 •

edited

Loading

xiaoliang314 commented Mar 8, 2019 •

edited

Loading

xiaoliang314 commented Mar 8, 2019

xiaoliang314 commented Mar 8, 2019

EddLeon commented Aug 8, 2019

xiaoliang314 commented Aug 9, 2019

carlescufi commented Aug 15, 2019

xiaoliang314 commented Sep 16, 2019

vikrant8052 commented Nov 18, 2019

cvinayak commented May 2, 2021

Ways to reduce Bluetooth Mesh message loss #13553

Ways to reduce Bluetooth Mesh message loss #13553

Comments

xiaoliang314 commented Feb 20, 2019 • edited Loading

xiaoliang314 commented Feb 20, 2019 • edited Loading

xiaoliang314 commented Feb 20, 2019

jhedberg commented Feb 20, 2019

jhedberg commented Feb 20, 2019

xiaoliang314 commented Feb 20, 2019 • edited Loading

jhedberg commented Feb 20, 2019

xiaoliang314 commented Feb 21, 2019

WilliamGFish commented Mar 4, 2019

jhedberg commented Mar 7, 2019

jhedberg commented Mar 7, 2019 • edited Loading

xiaoliang314 commented Mar 8, 2019

xiaoliang314 commented Mar 8, 2019

jhedberg commented Mar 8, 2019

jhedberg commented Mar 8, 2019

xiaoliang314 commented Mar 8, 2019

jhedberg commented Mar 8, 2019

xiaoliang314 commented Mar 8, 2019

jhedberg commented Mar 8, 2019 • edited Loading

xiaoliang314 commented Mar 8, 2019 • edited Loading

xiaoliang314 commented Mar 8, 2019

xiaoliang314 commented Mar 8, 2019

EddLeon commented Aug 8, 2019

xiaoliang314 commented Aug 9, 2019

carlescufi commented Aug 15, 2019

xiaoliang314 commented Sep 16, 2019

vikrant8052 commented Nov 18, 2019

cvinayak commented May 2, 2021

xiaoliang314 commented Feb 20, 2019 •

edited

Loading

xiaoliang314 commented Feb 20, 2019 •

edited

Loading

xiaoliang314 commented Feb 20, 2019 •

edited

Loading

jhedberg commented Mar 7, 2019 •

edited

Loading

jhedberg commented Mar 8, 2019 •

edited

Loading

xiaoliang314 commented Mar 8, 2019 •

edited

Loading