-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spurious "manual change" #51
Comments
Hi Emily, thanks for the detailed report. What you are seeing is the sad reality when working with the Hue lights API. Individual bulbs behave strange from time to time and have in inaccuracy of 3%... Could you please make sure that the firmware of all lights and the bridge is up to date? You can check this inside the official hue app. If this is the case our only possible fix is to increase the accuracy limit as you described from 2 to 3... |
I have automatic updates on and I just confirmed everything is up-to-date (android app and API agree):
That sounds a bit odd to me... How do you mean inaccurate? Like it doesn't take a write? Or it forgets a write after some time? Or it just writes the wrong data? Are we sure 3 will be the right value? Right now I'm basing this of a sample size of one... I will setup a logging script on my desktop to periodically (0.5s) scrape and record the light state over the bridge API and see if I can get more detailed insight on what transitions the light makes when this occurs. |
The closest is: "It forgets the write after some time". But this is really not reliable. Sometimes the values match perfectly, sometimes they deviate after some time, sometimes imitatively... The magic value "3" has been there from the beginning of Kelvin (see 5ce86f3#diff-034d51f7c28f2cfd686dbd3fb8375706R41 from the initial import). After Philips updated the bulb firmware it got much better and I switched to "2"... This worked for a long time... until you reported this error. :-) I am really hesitating to move back to "3" because it will make the scene detection fuzzy... but this (at least to my knowledge) is the only compromise we could make... Any ideas are welcome. |
Another data point:
Suggesting a threshold of 4 for colorTemperature here: Line 199 in e7c23ff
And a bump of the threshold for XY comparison. I know the light is in XY mode but the temperature values are written and read too AFAICT. |
Massive data dump incoming! Kelvin logs (edited for clarity) timestamps approx +2 seconds ahead of my logs below:
Unique state transitions from Hue bridge (21:13:24 start of log, not an actual transition)
Eyeballing the logs I see no spurious changes between writes. But I do see some changes shortly following a write. These two timestamps are interesting:
I will attempt to intersperse my logs with kelvin around this time after adjusting kelvin times -2 s to align the time lines:
Not sure what to make of that due to the low sampling rate of 0.5s on my logs. Lets look at three successive successful updates (again adjusting timestamps to align and interspersing):
Looking at the above it seems as if hue first takes the write into it's own cache and when queried it returns the written value until it communicates it to the bulb and gets it's opinion back of what it has actually stored. The bulb probably doesn't support so many levels of colour/ct as the API can specify so it truncates the value to what is actually on the light. There is probably a non-linearity here because this seems to trigger the manually changed threshold more frequently in the low CTs (seems like every evening). That's is my theory for what is going on. This procedure seems to take between 1-8 seconds from my limited data set. Based on the variance I don't think this is intentional but probably a function of how many lights there are and how busy the hub is or maybe even just the lights being lazy. Based on the variance and my observations above I think that the sleep here is probably causing the following check to read 50-50 the cached value or the truncated "true" value. Line 177 in e7c23ff
How do you feel about after writing a light, wait for say 10-ish seconds and then read back the light state and if the read value is a plausible truncation of the set value, just update the target value for the purpose of detecting a manual change? The manual change detection at that point wouldn't need thresholds as they moved to the "plausible truncation" logic. I guess one could try all possible CT values with a script and see what they truncate to to determine what is plausible and also confirm this theory. Sorry if this is a bit rambly, I'm writing as I go. Side note: Why does kelvin use xy mode when only ct is supported by kelvin? |
I have been testing Kelvin for a while and sadly have the same issues. I tried configuring Kelvin to change only the color temperature, but that doesn't help. All the lights still lose sync at some point of the transition, seemingly at random but pretty much guaranteed. |
If it is of any help, I'm fairly certain that my bulbs are 4th gen bulbs (white bottom). |
Line 177 in e7c23ff
I started work inspired by @EmilyBjoerk’s proposal here: https://github.com/smockle/kelvin/commit/e8cacb3b170a0c59ecb68c7b4b42420213da3c18. When writing a new state, I store an If more than 10 seconds have elapsed, the pre-existing I haven’t attempted to list or check “plausible truncations” yet. Also, while I did build successfully (with Update 1 Update 2 NB: The light output of Hue White Ambiance Bulbs is 2200K–6500K (see ”Technical Specifications”). I wonder whether setting a target temperature (2000K) below the stated lower limit (2200K) is what is causing Kelvin to log |
I'm not yet convinced that the culprit is that the bulbs don't support all ct-levels. That would imply that identical bulbs would drop out at the same time, but that's not the case. I have been playing with the light update interval setting and brightness and color difference threshold values a little during the weekend. It seems no matter what lightUpdateIntervalInSeconds I choose, it takes one cycle for the light to report the correct state. Even 30 seconds isn't enough. With 15 sec interval:
With 30 sec interval:
Also, scene updates fill the log with |
I will be traveling over the holidays but when I come back in January I intend to make a script that run through all CT values writes them to a colour/ambience light respectively and then reads the value back 20 s later. By plotting written to read values I expect this to be able to prove/disprove the truncation hypothesis by the shape of the read curve. |
Also I should have better mean/stdev/min/max values for the delay if my theory holds. |
Hi Emily, I am really interested in your results of this. I observed many different behaviours over the years. With the recent firmware version the precision got a little better but is still far from perfect. |
I have just started the experiment. Expected runtime 5h ish. Actual script that I'm running:
Run like so:
Will output progress to stdout and a bzipped log. |
Output for ambience light: https://drive.google.com/open?id=1xawuyfmxlwwUIIlOQyGalM7wK29Ft4de There didn't seem to be much there, the lights seemed to operate as intended with the occasional off-by-one. Kelvin was not running during the experiments. My script waits until they differ once and then moves on which might be a thing so I'm going to just make it measure the full period and not move on when it finds a diff. I will also try different polling rates. I'll hold of on making plots and further analysis until I have more data. |
It seems that this had not occurred for me on the last weeks which agrees with my experiment not showing anything interesting |
I've had a few days to play with this again. The first unexpected improvement actually came from moving from an old raspberry to a new 3 b+ and docker. Don't know why but it made things a lot more stable and reduced the errors. |
Hi probus, I am currently working on a fix that will make Kelvin single threaded and therefore be much more predictable in what requests are made towards the bridge. This should not only help if you have many lights but also reduce the overall requests a lot! I hope to find the needed time for testing over the weekend... Anyway: Thanks for the detailed analysis and the digging! Very much appreciated! 👍 |
The behaviour is back for me again. I'm starting to suspect it might have to do with how long Kelvin had been running? Add that seems to be the one thing connecting the troubles I've had. |
@EmilyBjoerk What happens if you restart Kelvin? |
This would mean lights would not set the right temperature/brightness if powered on mid-transition, no? |
@probus I only have anecdotal evidence but it feels like after restarting the lights do not desync for a while. I think adding rate limiting to the API sounds like a sane solution if too many queries in a short window is indeed the cause. Thought technically any quota management or rate limiting should be applied server side (hue hub) in a sanely designed system 🙄 I think that Kelvin queries each light individually for it's status right? If that's the case then one could reduce the number of API calls from O(num_lights) to O(1) by querying the /lights endpoint to get data for all lights at once for the purpose detecting light power-on or scene change. And then as @probus said only write the light value when it is expected to change or on power on. This should also make it consume less CPU (less json to parse, less network requests) and allow to increase the poll rate to quicker detect power on events. Although from what I have seen this runs counter to the current modular, per light design of kelvin. |
You can continue polling the light state, just handle the transitions with one long transition instead of multiple 1 second transitions
According to the api documentation the bridge will first buffer the commands. This increases the latency even up to several seconds. When the buffer is full, the bridge will start dropping commands and returning error 503. Just to be clear, I don't expect this to make Kelvin 100% accurate. It's only one source of error. But it should be (I'm guessing) fairly easy to reduce the number of api calls and control the rate. When using hue switches (enableWhenLightsAppear=false) even more so since the light state doesn't need to be polled that often. That said, I don't see a big problem in increasing the tolerance for inaccuracies. The differences are undetectable to my eye. |
@probus I think the problem with increasing the tolerance is that it makes detection of manual scene changes more "fuzzy" and tweaking tolerances such as this where it isn't well understood what the tolerance should be is a slippery slope to unending tweaking. But I do agree, it wouldn't be detectable by the eye. To be fair, I never use manual scene changes and I want a static year around schedule so all I really need is a bash script that curls the api to set the lights once per second from a cron job ¯_(ツ)_/¯ |
I finally found the time to work on this and just pushed a big bunch of code. From now on Kelvin will only use on thread and query the bridge much less. This should mitigate a lot of the observed problems. The changes still need some testing and I am running them know at home. Feel free to give them a try and report any problems! |
Kelvin 1.2.0 is out and should greatly improve the reliability after switching to a single thread architecture. Give it a spin... |
I've updated I'll let it run for some weeks and get back to you. |
Great job! I've been running the single thread versions since the beginning and especially after the latest changes Kelvin has worked flawlessly. I haven't even increased the tolerances at all. |
I have still observed desyncing, although it is much better with the patch. During regular maintenance of my RPi I decided to check netstat and found something interesting:
By using:
I could determine that Kelvin still seems to be creating a high number of TCP connections (10-15 per second for 5 lights). And some of these connections seem to not clean up for a long time. |
Hey Emily, thanks for sharing this observation! I allowed my to analyze and discover a bug in the Go library I use in Kelvin (https://github.com/stefanwichmann/go.hue). This bug seems to have been there from the beginning and nobody noticed until now. It prevented the reuse of HTTP connections, resulting in a new socket for every single request. This is not only inefficient but also puts a heavy load on the bridge (especially when using HTTPS). I fixed the bug in the underlying library and committed the change. My initial tests confirm that now only a single connection to the bridge is used. Will give it some more testing and then release a fixed version. I'm really curios if this will improve the reliability... |
Just released Kelvin 1.3.1 It contains the mentioned fix in go.hue and should now be able to reuse the existing TCP connection to the bridge instead of establishing a new one for every request. |
I updated to the new patch a couple of days ago and I haven't observed any change. A few bulbs still desync in every transition. |
I just updated to 1.3.1. The CPU usage of Kelvin on my RPi Zero is now down to ~1% from the previous 6% (originally 20% before the previous patch). This alone is a nice improvement. My "gut feeling" in #45 that 20% was way too high was right :D. Thanks for the hard work here! I can also confirm that Kelvin is now only creating one socket connection to the bridge so the patch seems to be WAI on that front. I will let it run for a few weeks again and report back. |
@probus Could you paste a debug log extract of the bulbs that still desync? I added a lot of additional information to the log. Maybe that helps to track down the issue! Thanks! |
I have still seen some occasional "manual change", here is an excerpt from the logs:
Based on the above it looks like now the desyncs happen mostly ~10-30s after the light has been turned off/on using the power switch. Could it be that the regular schedule and the handling for new lights appearing are having some kind of unexpected interaction? |
What I'm observing seems like a glitch when turning on a light, it takes around 10-30s for the light to be set to the expected CT/Brightness. With a matching "manual" change in the logs... |
Hi Emily, I can definitely confirm that there are still some glitches. Here is an extract from my logs showing a similar behavior:
This log spans multiple days and there are some interesting things in there:
Especially the magic values 366 and 254 look really suspicious but I can't make any sense of them yet. Could you please also post a debug log with the specific color and brightness values in there? Do you also observe these specific values? |
I enabled debug logging. I will get back to you when I have some data. W.r.t. "magic numbers" you mentioned. Could there be a precision problem in some calculation that is visible with just these numbers? |
I had one event last night, here are the logs. for the time from where the light was written to where the manual change was detected. Not all of my lights were enabled at the time. It's only one sample point but it's also showing current colour temperature value of 366. But the brightness is also deviating. I think the "adjusting light temperature to light capability" means that the requested value from the schedule was out of range for the light? But why is it that line printed so often? Shouldn't that happen once when writing the value?
|
Interesting that in this case both attributes (color temperature and brightness) have the magic value of 366/254... This should not be a calculation error as these are the real values written to and read from the Hue API. And the difference (454 vs. 366 and 101 vs. 254 in your case) is really big... For some time I suspected another app or routine but could not find anything... and the occurrence seems random. Regarding the adjustment to the light capabilities: Yes, this could be refactored to only happen once... that would require some deeper changes though. Didn't want to introduce more changes here just yet. |
There are multiple occurrences of both attributes differing in your log above as well. I too, don't think another app is involved, as my experiment from some time ago didn't show anything interfering. And nothing else has access to the hue hub at my place, that I know of. Does the "Adjusted ...." thing communicate with the hub? I'm wondering if that puts more load on the hub overloading it causing the writes to not take? I think we found the limits in #51 (comment). I'm considering logging all the network packets to/from the hue-hub/RPi... but I'm not sure what I would gain from that and what tools to use to analyse this data as I expect it to be kinda big like finding a needle in a haystack... do you have any thoughts? |
Sorry for taking a while to reply. Here are a couple of logs from this morning from Kelvin running in debug mode. The bulbs are all white ambience e27 and running identical schedules in the same light fixtures (i.e. three bulbs in one, two bulbs in the other). https://www.dropbox.com/sh/4z7seog6n5uz2ti/AACC8EbwswODGO097ES46r22a?dl=0 edit. I added two more from today |
Sorry for the delay, I have bzipped the logs from the past few days with debug output enabled (apparently journald wouldn't keep more in memory on my raspberry pi). https://drive.google.com/open?id=1H7ga2R3YGhwP5-LjFT3OF8dY3cHppmu6 The juicy bits:
During this time I have made no manual scene changes. I have used the hue remote to turn on/off lights and also the light switches (which cut power) I have also used my Google Home assistant to turn on or off lights but not for brightness, scene or colour. Interestingly once it detected a manual change after 3h17m... Observation: The hue bridge has a default colour/value transition for lights that come back on after power was cut. Could there be a race condition with this transition and kelvin's scene detection? I don't use scenes, can we have a workaround to disable scene detection all together and just have kelvin write the light values blindly? This would solve the problem for me at least. |
Hi Emily, To your questions: When a light gets powered on after power was cut, it will be initialized to its startup color with the default transition time (400ms). Usually after this it will become available to the bridge because the zigbee communication takes some time to stabilize. Once the bridge becomes aware of the light it will be exposed to Kelvin via the bridge API. So in general there should be no race condition with lights coming up after a power cut. In fact all the lights in my house a controlled by conventional light switches so power is cut every time I turn them off. But, just to be sure, Kelvin has an "initialization phase" for all appearing lights (currently 3 seconds, see light.go). Regarding scenes there might indeed be some problems hidden in Kelvin. For the scene detection Kelvin has to use multiple requests which are currently issued in a rapid rate. This might lead to an overload in the bridge. Over the last few days I investigated this and implemented a rate limiting feature in the hue library. In combination with some mutexes it should ensure that Kelvin will not be able to issue requests faster than every 100ms... I'm testing this feature now, but can still see manual light changes after very different durations like you reported them... I'll keep searching... (Side note: I would like to keep the scene feature automatic, but of cause it should not cause any errors in the communication) |
NPYW. I will setup TCP logging to log all the TCP data to see what can be learned from what actually goes over the wire, |
The fact that kelvin is using HTTPS makes it harder to capture a network dump. Could you add a flag to force kelvin to use HTTP instead of HTTPS please? |
A new flag to disable HTTPS is now in master. I also committed my rate limiting patch... I'll test it for a few days and push a new release. |
@EmilyBjoerk I just release Kelvin 1.3.2. If you start the application with the flag "-disableHTTPS" you will now be able to debug the plain TCP traffic. |
Sorry for the delay, life's been busy and a few desynced lights didn't make it to my priority list... However, today when I was fidgeting around with the Hue App on my Android phone after a light had desynced, I noticed something odd: The lights that were still in sync were in RGB Colour mode, while the light that had desynced was in Colour Temperature mode. All three lights that were active at the time were RGB lights, and all my CT lights were off. I have not adjusted the light colours or intensity manually for months so this is not my doing. It could be related to the soft-on feature of the lights but it might also be related to how Kelvin writes the colour, I think it puts both the XY colour tuplet and the CT value in the POST request. Note: I have a setup with mixed RGB and CT lights. |
I have a setup with a hue bridge, 3 hue lights (colour + temperature capable) and 2 ambience lights (temperature capable). I've been spuriously getting problems where some lights seemingly randomly "desync" from their expected state. In the logs I find the "changed manually" message which I have not done.
After running with debug enabled I have been able to gather some data:
Logs for color lamp 1 around and before the time of the first event:
It looks like the last successful update was:
The light brightness was then updated to 95 and 94 before the next set was attempted 19 minutes later:
Formatting the interesting bits of the changed message for readability:
The value diff is 3 but the threshold for equality here:
kelvin/huelight.go
Line 205 in 712f498
is 2. I think a value of 3 would have prevented it in this case but I'm not sure if this is the actual fix or if the problem was the light not being updated for 19 minutes?
The text was updated successfully, but these errors were encountered: