-
Notifications
You must be signed in to change notification settings - Fork 54
Fix issue where switchbot stops responding due to "disc" status #39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Commenting line 283 out to avoid the issue where switchbot stops responding until HA restart. If the switchbot is in disc state, it tries to read a response but it will never get one. By removing disc from the while loop, it will fall into BTLEDisconnectError exception and retry properly the next time around, without any deadlocks.
@RenierM26 any comment? |
Hi There, This will cause a lot of retries if the user has multiple devices. The bluepy helper returns disconnect after any operation finishes...usually an issue if you have 3+ devices. A disconnect could also happen before the scan finishes. (The bluepy library does a lot of connect/disconnects...) The connect function also has a built in timeout so it won't be stuck waiting for a disconnect (The default is close to a min, can't remember the exact time). It would probably be better just to call the connect function with a shorter timeout. (code is already there) The actual problem is most likely related to something else and this would just be a workaround. Just on a side note: I have a raspberry pi 4 with latest firmware (manually updated) running for 2.5 months without any issues. |
I agree this is a workaround, but for all the users having issues from December (i.e. home-assistant/core#61535) this workaround will be really useful as it is really frustrating having to restart HA every time switchbot stops working. In my case that happens just every 3-4 days (I think that is because my switchbot and my HA server are really close), but some people mention they have to restart every day, which is painful. |
@RenierM26 I also agree this is a workaround. But for me, Im using Switchbot since this week, and it hasn't been stable for a day. I'm also happy to test any other fixes, but what @heisenberg2980 says. Can we in the meantime merge this PR?
Which firmware do you mean? The Switchbot curtain firmware? My curtains are running on v3.1. |
Hi @pascalwinters, The raspberry itself. There where a few Bluetooth fixes in the firmware and wasn't included in Hass OS at that time. (October/November I think) My DEV and Main PI both only have the Switchbot bluetooth integration running. (no other bluetooth integrations) Are you running any other bluetooth integrations in addition to Switchbot? Might explain the lockups. |
Hi @RenierM26 , No, I have special bought for this integration a bluetooth dongle. So it's only used for the Switchbot integration. |
Thank you for the feedback. I have no objections to this pull request, just trying to find the root cause. |
Do you perhaps have the make and model of the dongle? |
I agree, the root cause is important to find. Will it help if there is more debug logging added, so we can enable it and provide you with more info?
@Danielhiversen Can you merge this PR please? |
TP LINK UB400 in combination with an Ondroid N2+ |
@RenierM26 I also have a Raspberry Pi 4 with the latest version of HA Core and Supervisor and this is the only integration using Bluetooth, but I still had this issue with my switchbot every 3-4 days until I removed the line proposed in this PR. Not sure why yours is not failing, maybe you could try moving your switchbot far away from your Raspberry to see if that force the issue to occur. |
Hi @heisenberg2980, Thanks for the info! I'm able to replicate the issue by moving it out of range. I'm going to test a timeout on the connect function and see if this solves the problem. |
It’s very good news you can reproduce it. When do you expect you can test this and creating a PR? |
Looks like timeout won't be a practical fix. There's more methods that need to be overridden before this will work. Looks like this pull request is best fix for now. |
Perfect, let´s merge it and if we find a better solution in the future we will implement it. @Danielhiversen can you merge this PR? |
Ok. Thanks for your time to test it. I appreciate it. |
Hi All, Found the reason why the timeout isn't working. I have also added a 40 seconds timeout if the device is not reachable for some or other reason - should prevent the lockup issue. Mind testing? #40 Nevermind....Looks like PR 40 needs some additional bug fixing. |
Hi @pascalwinters and @heisenberg2980, Made a couple of improvements to the handling of disc states. Mind testing #40 ? |
HI @RenierM26 , I’m happy to test. I'm running the integration on your branch. First impression looks good, sometimes the delay is bigger because of this warning: I see this warning frequently, I'm not sure if it's a good sign. |
Happy to test it but haven´t tested a PR before, are there any instructions/tutorial I can follow? do I need to disable the original integration and download the branch to the custom component folder? |
Yes, you can copy the switchbot integration from the core repository to the custom component directory and replace the manifest content with this:
This all will automatically overwrite the core integration until you delete the custom switchbot component. |
@pascalwinters when you say "copy the switchbot integration from the core repository" do you mean the original version of the integration (master branch) or the PR (pyswitchbot_timeout branch)? Also where is the manifest file? |
I mean this repository: |
Thanks @pascalwinters, testing it now |
Nice. Good luck. |
Bluepy doesn't handle the "disc" status all that well. It captures any disconnect from any device (disconnect is the last stage of any connection to any device) I have added warning logging when this happens to aid in troubleshooting. Could change it to a lower level later on to prevent log spamming (Debugging perhaps) Connection failures should work a lot better. I think the btle scanning function, in the bluepy library, might ultimately be the root cause (it has aggressive retries if the helper drops a connection before the scan completes....which kind of leads to a controlled race condition.) You can connect to more than one device at a time but the last bluepy release doesn't handle it all that well. Sorry for the long explanation. |
@RenierM26 Thank you for explaining. For me it's stable since the beginning of using this fix. Two remarks/questions:
|
After 24 hours of testing, the fix looks promising, with no lockups yet, but there is something weird I would like to share, my switchbot seems to take in average much longer to respond than before, taking sometimes up to 30 seconds to switch, which very rarely happened before. I don´t have exact numbers, but I would say now it takes 20-30 seconds 50% of the times (taking 5-10 seconds the other 50%), and before it usually took less than 10 seconds most of the times (taking over 20 seconds just 10-15% of the times). Obviously this could be due to external factors or interferences and not related with this fix, but wanted to mention it just in case @pascalwinters or someone else is also experiencing it. BTW, even with this performance issue, if I have to decide between now and before the fix I would choose the fix, it is always better to have a response (even if the response is a bit slow) rather than a lockup that requires a HA restart. |
@heisenberg2980 I have the same experience. Sometimes it takes more time to respond. I think it's waiting for a timeout or so? Maybe @RenierM26 can explain the delays? @RenierM26 Is it an idea to merge this and make a PR for HA (I can do that if you want) so it has been fixed for all the users in HA with the next release. But we must make progress, because the beta is almost there. |
@pascalwinters Out of curiosity, how did you update the requirements line of the manifest to point to the PR? is there any link/document to understand the logic? |
|
@RenierM26 @pascalwinters considering #40 looks promising, unless you have any objection I will close this PR |
Seems good to me. |
Closing this PR as #40 seems to be a better solution |
Commenting line 283 out to avoid the issue where switchbot stops responding until HA restart.
If the switchbot is in disc state, it tries to read a response but it will never get one. By removing disc from the while loop, it will fall into BTLEDisconnectError exception and retry properly the next time around, without any deadlocks.