-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use our own HomeKit server. #127
Conversation
Thanks for this. I think it's definitely needed. I had started doing the same thing a while back and just never merged any of it. Have you looked at all at updating to the latest wolfssl library? |
No, I have not thought of doing any more updates, but if you know of some worthwhile then I think we should consider it. What I did is fork from the original, then merge in the changes from the previous fork we were using (@mrthiti), then merge in from @JKoss2 (had to resolve some conflicts) and then added my test for IP address not being set... and bumped version to 1.5. I preserved all commit history "just-in-case" which has made it look somewhat ugly. I chose to merge in from @JKoss2 because all I did was sort all the forks (260+) by recent activity, to look at what others were doing. I felt like @JKoss2 has the most complete. |
@dkerr64, how is this testing out so far? |
I am at 2 days 7 hours so a little early to know. |
I have two ratgdo's running with this version of firmware. One crashed and rebooted itself at ~5 days 16 hours, the other at ~4 days 6 hours. Unfortunately I was not able to capture the log... when I got back to VScode to grab it, VScode demanded that I reload and that lost the history. But I strongly suspect the same as what I posted in #94, which looks like stack overflow. Having said that... 4+ days is the longest it has run for me, and pairing was not lost, HomeKit still worked. I'm going to have another go at capturing logs (which will take another 4-5 days). |
@dkerr64 I just realized one of my devices rebooted sometime today. I just put a serial logger on it in hopes of catching something. In the mean time I'm trying to break it by hammering it with network traffic. So far it's holding up surprisingly well. |
I’m installed the V0.12 firmware and set my RATGDO to reboot every 48 hours. It has been working perfectly ever since. I have a full Ubiquiti network and the RATGDO is assigned to a single AP. |
Caught the reboot. If the stack is really 4K, then it didn't overflow here as this is just over 2k. But I think the pointers being passed to write() in the client_send_encrypted_() function have probably been corrupted based on the unaligned address error. --------------- CUT HERE FOR EXCEPTION DECODER --------------- Exception (9): LoadStoreAlignmentCause: Load or store to an unaligned address
ctx: cont 0x40228f54 in client_send_encrypted_(_client_context_t*, unsigned char*, unsigned int) at ??:? --------------- CUT HERE FOR EXCEPTION DECODER --------------- |
I don't know how Arduino sets the stack, but if this define is setting it to 2048, then I've overflowed https://github.com/dkerr64/Arduino-HomeKit-ESP8266/blob/master/src/port.h#L45 |
I've captured a few more errors. Both look like this. It always looks like corrupted memory and the stack pointer is always the same value when it craters. Exception (28): LoadProhibited: A load referenced a page mapped with an attribute that does not permit loads stack>>> ctx: cont |
if the stack pointer is 3ffff7b0 and the top of the stack is 3fffffd0 (probably actually 3fffffff) then some quick hex math... What consumes stack? AFAIK it is only function parameters, local variables, return addresses. Allocated memory comes from the heap (which we have >20KB available). So... we can go looking for 1) large arrays/structs in the stack that could be malloc'd? 2) recursive/very deep function calls?, 3) a way to increase the stack size. Again, As far as I can tell, the Arduino based firmware is all single threaded, so there is only the one stack. I put a check for stack size into the main loop and included it in my heartbeat messages to the client javascript... at the point I grabbed the stack size it was in the 100's bytes, not KBs. It looks like all cases where we crash, the function call hierarchy all trace back to the HomeKit server. So any stack overflow is within that. |
Tell me I'm wrong... but |
I think you are right, that data gets allocated on the stack. If we convert that to a malloc, I think it will go on the heap instead. |
This is a great find. I converted to a malloc and I'm running a test now. This particular module Im testing on now has been crashing fairly frequently. So, we'll see how it goes. There are lots of arrays being created in the 16 and 32byte range. By themselves, those don't seem to be an issue, but there are quite a few being created fairly far down the call stack, so they start to add up. |
I think there are quite a few functions that are only called in one location, such as this client_send_encrypted_() function. We can inline those functions. |
I'm not sure that buys much savings in stack space... just parameters and return addresses. But as you say, it all adds up. |
You are right it's not the low hanging fruit. Just thinking about how we can reduce stack usage. I'm always seeing stack at 2080 bytes which means are overflowing but also means we already have 1kB on the stack, which to me seems a bit high for this thing. In any case I'm already seeing a big improvement moving that 1kB buffer. I have a secondary device I setup for debugging this issue. It crashes about every hour but is running about 4 hours with this change. This is probably stating the obvious, but there is mounting evidence the problem is with the HK networking side. I can hammer the device with various packet types and even hammer the web server and nothing bad happens other than some dropped packets. However the big difference between my stable device and my unstable device is the good one is connected to a HomeKit hub and the other is not. So one is getting requests from just the home hub and the other is seeing traffic from multiple phones, laptops etc. |
In the meantime, I have been googling to see if I can find out how the stack size can be increased. It is probably a tradeoff between stack and maximum heap. I've not found it yet. But I agree, the HomeKit part, exceeding available stack size, is most likely where the problem is. I think it goodness to look for large local array/struct declarations and removing them. If the function can never be called in parallel (ie always the same thread, non-interruptible), then we could consider global declaration. If not then malloc() -- but lets be real careful to always free before exiting the function. |
by-the-way... something else I noticed... HomeKit pairing takes a long time, during which there is a ~30 second period that the main loop is stuck inside HomeKit. I know this because my once-every-second heartbeat from the web server stops for about 30 seconds, meaning nothing else is being allowed to run. This could be in the encryption key generation, but I don't know. Anyway... point of this is to suggest that we maybe look for opportunities inside the HomeKit code to add yield()... https://stackoverflow.com/questions/34497758/what-is-the-secret-of-the-arduino-yieldfunction |
I sent you a PR to malloc the buffer. You could probably make it static instead of malloc, but this works. And yeah, I made sure to free it everywhere. If you want to merge that or just fix it yourself a better way, then I'll release another build as I think this will definitely help improve stability. |
Jumping in late, but why not make the stack bigger? If it was esp32 for sure would do thatI have had tasks that need 3k of stack But this is esp8266 not sure what kind of ram we gotSent from my iPhoneOn Mar 15, 2024, at 6:12 PM, jgstroud ***@***.***> wrote:
I sent you a PR to malloc the buffer. You could probably make it static instead of malloc, but this works. And yeah, I made sure to free it everywhere.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>
|
Mostly because I couldn't figure out how to do it! But even if we did change the stack size, we shouldn't be abusing it by pushing kB's at a time on it |
Oh i saw a line highlighted that a define was 2048Well if the var is just temp I see no harm using stack… once the function exits the mem is freeSent from my iPhoneOn Mar 15, 2024, at 7:08 PM, jgstroud ***@***.***> wrote:
Mostly because I couldn't figure out how to do it! But even if we did change the stack size, we shouldn't be abusing it by pushing kB's at a time on it
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>
|
Well, no harm in putting it on the stack if you don't overflow the stack. With this change, I ran clean for 6 hours, then hit 4 back to back crashes. The crash dumps no longer have the really deep stacks and crazy looking backtraces. https://gist.github.com/jgstroud/e258b693f958b678b1d7c3ce7543b435 |
Yes, but if you grep for SERVER_TASK_STACK it appears to never be used. So if we want to adjust stack size then I think it has to be done elsewhere. |
@jgstroud I merged your PR into my HomeKit fork. I am now running this version on one of my ratgdo's and will likely update the other as well. |
on another note, i switched to @dkerr64 PR version over 4 days ago... and no reboots yet |
i saw someone posted the standard 8266 arduino core uses 4096 as stack size |
yes, I found that as well... but I think it is overly complicated. Best first thing for us to do is reduce stack usage where we can. |
do we have any consistency to which method we get the exception? |
Looks to me that there are two distinct crashes here... fourth is dup of second, third is dup of first. I thought I had fixed the first one, I saw this myself and reported it in #126 but I just closed that because I thought I fixed it in #127 here The second one needs investigation. |
No recursive calling that I can see... but yes, that function was allocating an array of 1KB size on the stack. @jgstroud just changed that to a alloc/free from heap and we're now testing with that. |
ok, im still waiting for my 1st crash on your firmware from 4+ days ago :) |
yeah, there are 2 distinctly different crashes here, but they both look like corrupt memory to me. I didn't realize that 2 were a dup of one you already reported. I kinda think the stack trace may be misleading as I think the memory is corrupted from some other operation. |
Well, here's another one. Just looks like more corrupt memory though. I added some debug code, so it's likely just moving the error around now. https://gist.github.com/jgstroud/3577e93ee796e02611cb968088c9b356 |
the one bad thing is when i turned off compile optimization, for more debug detail... it wasnt stable at all build_type = debug |
To try and fix various HomeKit stability problems I have created my own fork of the Arduino-HomeKit-ESP8266 server and merged in fixes others have implemented plus also a fix for issue #126
This PR updates homekit-ratgdo to pull the homekit server from my repository, plus also reports the available free heap space