-
-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Observing eventually stops on City Hub and Technic Hub #1096
[Bug] Observing eventually stops on City Hub and Technic Hub #1096
Comments
A few people at LEGO World ran into this as well. It's been a while, but I think forcing observe to stop/start works around this issue. |
Updated reproduction code to match latest, and also make it consistent with #1278 in case the two are related. So the case with little data here eventually stops working after ~15 minutes (but doesn't crash; restarting City Hub gets it to resume) and the case with more data in #1278 crashes within 5 minutes. |
Is this statement still true since you updated the program or does it change to violet? |
Using builds from commit bdf20a81, I've had this running for an hour without issue on both City and Technic hubs (minimal other Bluetooth devices around - a couple of laptops that occasionally broadcast). |
Well, sometime between the 1 and 2 hour mark the lights went violet on both hubs. There was no more Bluetooth traffic on the Technic hub (it is the only one I have connected to a logic analyzer), so it looks like an issue with the Bluetooth chip. |
Violet.
Since simply restarting the program (not the hub) fixes it, could we restart observing under some conditions, or maybe send something else to wake it up? |
I happen to be one of them! I was using the Mindstorms hub for the robot arm, and using pybricks BETA to broadcast a signal to the loader unit that would open if the robot arm was in the right position.
And this is the code for sending the signal on the robotarm:
I noticed after a week of LEGO world that it would crash in both the true and false statement. It would only crash when recieving signals by the Mindstorms hub. Hope this gives you more information for a potential fix! |
Thanks @Noahdlange for chiming in. I was referring to you but could not find your GitHub handle 😄 Can you elaborate what you mean by "crash"? There was also #1278 which was recently fixed, although your example should probably not have run into that, as your case is only broadcasting |
When the Technic hub "crashed" the technic hub would stay on but would be stuck in the True or False statement. Only by restarting the program (so pressing the green button on the technic hub twice) it would reconnect to the Mindstorms hub. |
Thanks for clarifying! That does sound like it's the same issue as in this topic, so good to know that this is a consistent / universal issue. |
There seems to be a faster way to reproduce this (or reproduce another bug, that is).
This works normally for a few tries, but sometimes, the Technic Hub remains in the violet state even after the Prime Hub starts sending again. I'm able to do this consistently under a minute. Here is a video where it happens after 25 seconds or so. The light colors aren't easy to see in this video but you can tell that it reaches the violet state when it stops blinking. ble_violet_faster.mp4 |
Some additional debugging. We can make the Technic Hub conveniently keep a record of its last received data: diff --git a/pybricks/common/pb_type_ble.c b/pybricks/common/pb_type_ble.c
index f728885bf..44d5d316d 100644
--- a/pybricks/common/pb_type_ble.c
+++ b/pybricks/common/pb_type_ble.c
@@ -97,6 +97,8 @@ STATIC observed_data_t *lookup_observed_data(uint8_t channel) {
return NULL;
}
+#include <pbsys/program_load.h>
+
/**
* Handles observe event from the bluetooth driver.
*
@@ -113,6 +115,9 @@ STATIC void handle_observe_event(pbdrv_bluetooth_ad_type_t event_type, const uin
// here but due to a Bluetooth firmware bug on city hub, we have to allow other
// advertisement types. This would filter out broadcasts from the experimental
// feature in official LEGO Robot Inventor firmware.
+
+ pbsys_program_load_set_user_data(0, data, length);
+
if (length >= 5 && data[1] == MFG_SPECIFIC && pbio_get_uint16_le(&data[2]) == LEGO_CID) {
uint8_t channel = data[4]; And update the Technic Hub program to print it: hub = TechnicHub(observe_channels=[12])
print(hub.system.storage(0, read=32)) # < ---- insert this And remembering to disconnect since we want to debug the non-connected case. Since the hub doesn't crash, we can just stop the program, connect and print to see what this reads. This may of course pick up other data so maybe some additional filtering needs to be done or cache multiple values to get something useful. The expected messages would be
But I do see something different in some cases (still testing if this happens consistently in the "bad" case above). Sometimes it is garbage, but sometimes it is a bad version of the above. EDIT: As-is, maybe this isn't particularly definitive --- sometimes the last message before it got to a bad state was valid, so maybe we need to do this at a lower level. |
I wonder if this quicker (< 60 sec) test just makes the likelihood of a bad/corrupted/aborted message from the Prime Hub side much higher because we might stop the program mid-broadcast, and the Technic Hub currently isn't expecting this type of message, and fails to process it. So maybe this eventually also happens randomly in ambient conditions after a long time, and we'd just need to discard such packets if we can find out what they look like. |
💡 This looks to be also reproducible while the Technic Hub is connected to Pybricks Code. Use the same approach as above to trigger it in less than a minute, so the light eventually staying violet. (not just occasionally violet, as it always does when connected to Pybricks Code under normal operations). When it gets in this state, the program is still printing data and can still be stopped normally from Pybricks Code, so maybe the Bluetooth chip isn't in such a bad state and maybe our code is just waiting for something that never happens? Anyway, this should make debugging a lot easier 😄 |
We don't seem to be getting the Are there maybe some conditions where observing automatically stops, but we just don't know/handle that it did? Likewise, can we query for some kind of |
Getting a bit closer. There is a way to recover from the bad state as hypothesized above. Below the BLE cleanup is added to the BLE init as a test, to have a handle to stop and restart observing by just initializing the hub again. diff --git a/pybricks/common/pb_type_ble.c b/pybricks/common/pb_type_ble.c
index f728885bf..ab632b28c 100644
--- a/pybricks/common/pb_type_ble.c
+++ b/pybricks/common/pb_type_ble.c
@@ -506,6 +506,10 @@ STATIC MP_DEFINE_CONST_OBJ_TYPE(pb_type_BLE,
* @throws ValueError If either parameter contains an out of range channel number.
*/
mp_obj_t pb_type_BLE_new(mp_obj_t broadcast_channel_in, mp_obj_t observe_channels_in) {
+
+ pb_type_BLE_cleanup();
+ mp_hal_delay_ms(1000);
+
// making the assumption that this is only called once before each pb_type_BLE_cleanup()
assert(observed_data == NULL); And adapt the Technic Hub program slightly to have a hook to reset, in this case with the force sensor: from pybricks.hubs import TechnicHub
from pybricks.parameters import Color, Port
from pybricks.tools import wait, StopWatch
from pybricks.pupdevices import ForceSensor
from pybricks.experimental import hello_world
hub = TechnicHub(observe_channels=[12])
sensor = ForceSensor(Port.A)
watch = StopWatch()
while True:
data = hub.ble.observe(12)
if data is None:
# bad state
hub.light.on(Color.VIOLET)
else:
if data == "red":
hub.light.on(Color.RED)
else:
hub.light.on(Color.GREEN)
wait(10)
# On sensor press, reset call BLE reset, then start observing again.
# If it was stuck on violet, this makes it blink red/green again.
if sensor.pressed():
hub = TechnicHub(observe_channels=[12])
while sensor.pressed():
wait(10) |
What remains then seems to reliably detect whether we are in this state. If we can't tell from from the chip, maybe we could stop+restart observing if no advertising data has been detected for N seconds? If so, choosing N=1 second may be reasonable, since it would only be used when observing is actually used in a program anyway. And in those cases, it's nice if it is responsive. For applications with continuous data broadcasting, there won't be any resets until it is actually needed this way either, since those applications typically send data more than once a second. |
Any ideas @dlech on if and how we should implement this? I was thinking to release RC1 today without this last remaining fix. |
💡 Restarting the Prime Hub program to trigger this bug on the Technic Hub directly correlates to GAP_deviceInit(GAP_PROFILE_PERIPHERAL | GAP_PROFILE_CENTRAL, maxScanResponses, NULL, NULL, 0); It always happens at ~ Could this be because the Prime Hub is broadcasting from a new Bluetooth address every time? So maybe this is getting at the underlying issue. Scanning stops if the scan que is full, and this is perhaps why stopping and restarting the scan clears it. Some BT chips also allow erasing this without stopping the scan. Does ours? Maybe this is also related to why sometimes the Technic Hub won't find the remote in certain conditions for some people. Edit: TI forums suggest to
|
Possibilities to implement:
I'm beginning to see what the This might also help with the remote in the long run. This may be too invasive for this release, so I think we can do the workaround approach in |
The workaround is implemented in pybricks/pybricks-micropython@533ace0. |
I think a better workaround would be to just restart observing every 10 seconds or so in the background. It should hopefully be as simple as saying "if |
I agree that a background process is a better solution as mentioned above, but adding such a process this close to the release did not seem like the best timing. When we have that process, we shouldn't arbitrarily reset with a timer, though. That would mean up to 10 seconds of no data, and resetting very often even when it is receiving data just fine. In a typical continuous broadcast scenario, the current approach never resets until needed. So perhaps combined to |
FWIW, no new process is needed. We can just use the one already in the driver. Also, restarting observing takes significantly less that 100ms so users would likely never notice. If it takes at least 25 seconds to trigger the issue, then resetting every 10 seconds should mean that the issue never happens in the first place and therefore we would never see a 10 second span without data. No objection to waiting for more testing though. |
Sounds good. If you feel like this is not that much of an invasive change, please do feel free to make a PR and we'll get it in before the release. Non-invasive, as in, won't affect anything outside of programs that use broadcast/observe. I mainly wanted to avoid potentially breaking changes to normal download and connecting to the remote. |
If we do that, it could potentially be even simpler, and with less waiting on Bluetooth chip:
We shouldn't have to send stop commands this way, I think? |
The current workaround should work well enough for this release, I think. This test program has been running well for hours now. We can release the proper fix in the driver as part of the next beta firmware. Due to a new upcoming editor feature, I think multi-tasking is going to get a lot more testing fairly soon. I kind of suspect there will still be some issues to resolve when lots of multitasking is being used, and this is probably one of them. |
I can confirm that I have this kind of issue without using the "observe" function but many multitasks. I'm still testing but if it can help I can share the code I wrote. |
Yes, feel free to open a new issue for multitasking issues. Thanks! |
I will try to improve my code and open a multitasking issue then. Thanks for your answer. |
Another interesting observation. The observed data messages do not just stop altogether suddenly. It is per-remote device. So a City/Technic hub can still be receiving data from one hub A but stop receiving data from hub B. |
A reasonable suggestion, but I don't think it is simpler to implement at this point. We changed the scan time to infinite to facilitate being able to cancel scanning for the remote control. So we would have to add a bunch of code to change various config parameters when starting and stopping scanning for the remote or change how that code works as well. We also currently have really poor support for queuing an action in response to an event. Namely #1145. So this would need to be fixed first as well. |
This could be consistent with the N scannable devices theory above. In this theory, when full, it will only still observe new data from the same devices. What's interesting then is what could trigger it to consider it a new device on its list after some time. This may be something we could log somehow. |
Maybe we can query the 'final' scan results when this occurs, to see what made it consider the same hub appear as a new device on this theoretical finite list. Naturally we won't see the final one because it's not on the list, but we may see the same broadcasting hub a few times on the list as distinct devices. |
The Bluetooth chip on the City and Technic hubs will stop receiving advertisements after a while when observing. It isn't clear why this is happening, but it seems to that there is a bug in the Bluetooth chip firmware that causes it to start filtering out advertisements from an individual advertiser after a while (i.e. it can still be receiving advertisements from one device but stops receiving them from another). Eventually it will stop receiving advertisements from all devices. To work around this, we restart the observation process every 10 seconds. In testing, we were never able to trigger the bug in less than 25 seconds or so. Fixes: pybricks/support#1096
potential fix available for testing at pybricks/pybricks-micropython#216 (comment) |
The Bluetooth chip on the City and Technic hubs will stop receiving advertisements after a while when observing. It isn't clear why this is happening, but it seems to that there is a bug in the Bluetooth chip firmware that causes it to start filtering out advertisements from an individual advertiser after a while (i.e. it can still be receiving advertisements from one device but stops receiving them from another). Eventually it will stop receiving advertisements from all devices. To work around this, we restart the observation process every 10 seconds. In testing, we were never able to trigger the bug in less than 25 seconds or so. Therefore, 10 seconds should be more than enough to ensure that we never miss an advertisement. Fixes: pybricks/support#1096
ran it overnight and still green this afternoon |
Could we expand pybricks/pybricks-micropython#216 to also restart scanning during the scan-and-connect task? In order to solve the connectivity to the remote (insert link later). |
Or maybe via a variation of this. As for the difficulty pointed out above, maybe we can let the scan duration just be selectable/cancellable in multiples of the time unit as we'd pick for the ever restarting scan duration, say 3 or 5 seconds. This way we wouldn't have to change the user api, we'd just round the input argument to something technically reasonable. |
I would handle them separately. It's not pretty, but the same sort of restart could be added to scan and connect like this: diff --git a/lib/pbio/drv/bluetooth/bluetooth_stm32_cc2640.c b/lib/pbio/drv/bluetooth/bluetooth_stm32_cc2640.c
index 7cc20006..fea856ad 100644
--- a/lib/pbio/drv/bluetooth/bluetooth_stm32_cc2640.c
+++ b/lib/pbio/drv/bluetooth/bluetooth_stm32_cc2640.c
@@ -485,8 +485,19 @@ static PT_THREAD(scan_and_connect_task(struct pt *pt, pbio_task_t *task)) {
}
}
- // start scanning
+ /* TODO: start restart timer */
+ // skip cancel since we haven't started yet
+ goto start_scanning;
+
+restart_scanning:
+ PT_WAIT_WHILE(pt, write_xfer_size);
+ GAP_DeviceDiscoveryCancel();
+ PT_WAIT_UNTIL(pt, hci_command_status);
+
+ PT_WAIT_UNTIL(pt, device_discovery_done);
+
+start_scanning:
PT_WAIT_WHILE(pt, write_xfer_size);
GAP_DeviceDiscoveryRequest(GAP_DEVICE_DISCOVERY_MODE_ALL, 1, GAP_FILTER_POLICY_SCAN_ANY_CONNECT_ANY);
PT_WAIT_UNTIL(pt, hci_command_status);
@@ -508,6 +519,10 @@ try_again:
if (task->cancel) {
goto cancel_discovery;
}
+ if (/* TODO: test restart timer expired */) {
+ // TODO: reset restart timer
+ goto restart_scanning;
+ }
advertising_data_received;
})); |
I did try that while looking into this issue. It made things worse (I don't remember exactly what the side effects were - possibly broke other stuff). |
Describe the bug
A program like this eventually stops working on City Hub / Technic Hub. The program keeps running but data is not received or updated (the light goes violet).
Sometimes it takes a few minutes, sometimes 20 minutes. This has been tried in several hubs and in several conditions (with and without lots of other devices around).
The text was updated successfully, but these errors were encountered: