-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stability Improvements #129
Conversation
All of the other projects I see using wolfSSL are having to use this function. Example: Also, wonder if we should look at this ESPHap library more. |
Trying this now. I observe that the freeHeap is down by about 4K... still above 20KB though, so fine. |
I got 4 of these crashes last night: They all look identical. The program counter is in the WiFi Keepalive, but the backtrace shows client_send_encrypted. I'm guessing this wifi keep alive is running on an interrupt and we are hitting some shared memory contention? I may be crazy, but I feel like even though it's still crashing a lot, I feel like we're getting close. I started with crash dumps that made no sense and looked like a corrupted back trace. now they all look very similar and with sane looking stacks and back traces. |
it looks like the ESPHap project was doing a lot of messing around with this Wifi Keepalive. One very notable difference is here: They commented out this call which just so happens to come inside the client_send_encrypted. |
Your stack history shows that the keep alive was called from client_send_encrypted. So if it is not required then we should try removing (commenting out) as well. |
Yeah. I started a test a few hours ago with the keep alive completely disabled. Not sure what consequence that will have but hopefully lead me a little closer to a proper solution. |
@jgstroud I have invited you to collaborate in my HomeKit server repository, so you can modify this without waiting for me to merge PRs. |
75b9d38
to
f94d6cb
Compare
This was added when the code used fetch() in an interval timer to protect against parallel reqeusts to our server-side code. But moving to Server-Sent Events means we no longer do that and I don't see any risk we may need to protect against, so lets remove the semaphore code.
@jgstroud I rebased a branch I am working on to improve client-side error handling onto this branch and have opened a PR to you to merge. |
@dkerr64. Cool, I'll review it. I managed to get a stable build. It ran overnight, which is a first. However, I had to disable the heartbeat in the web code. Which I don't think we really need but it also seems to break OTA by having that disabled. A better fix would be to understand the error, but for now disabling it has made a huge difference. I saw a number of these: |
I think we are getting closer. The crash in heartbeat is probably here I propose removing the PSTR and _P and using plain old printf... We do need some sort of heartbeat (I think) to keep the Server-Sent Events socket open. It is used "on demand" to push any changes (like door open) to the browser client. |
Yeah I removed the PSTR part but haven't started testing it yet. Just letting it run a while longer with the heartbeat disabled |
I'll start testing that version this afternoon. |
@dkerr64 after turning the heartbeat back on, things are mostly ok. But I'm noticing that if I have the web ui open on my laptop and then I open it on my phone, they both update nicely, but if my laptop goes to sleep, then my phone will start updating very very slowly. Surely just getting lots of TCP retries. The free heap starts dropping by a lot, and I got it to crash like this. I suspect it ran out of memory. I really don't understand all of this logic, so I'm not sure how to fix it. For now, I'll continue testing with only 1 client active. |
Thanks, I am testing this version now. |
@dkerr64 I just merged one more change into the HK library. I'm running stable now for 12+ hrs. This is the longest it has run other than the time I had the heartbeats disabled. If this run clean the rest of the day I'm going to go ahead and release. |
Excellent. My systems take 4+ days to crash so it will take a while for me to confirm, but I have confidence. I'm traveling this week so can't mess with it anyway! The heartbeat is pretty much a no-op unless any browsers are open/actively connected. There are a couple of other PSTR() in the web.cpp which maybe we should get rid of too? Much more infrequently called, but if we suspect problems accessing that block of memory then I think it safer to remove. I have just done that in my client-error-handling branch. |
I sent a PR that removes remaining PSTR() from web.cpp and fixes a clearInterval() problem in the javascript. |
Wait, don't merge that PR. I am seeing a problem similar to what you observed when multiple browser clients try to connect. Let me figure out what is going on. |
We have a race condition with the heartbeat... which exposes itself when browser client is on slow connection. Server allocates say.. I was running into a similar problem when stress testing client error handling, and was seeing what happens when the EventSource connection failed. That was one of the outstanding error cases I still needed to tackle. I have to think how best to handle this. Oh... and it's got nothing to do with last PR so you can merge it. |
you guys been busy :) mine still hasnt rebooted, up 6 days! since #pr127 |
I've opened issue #130 to track this race condition. Lets discuss over there on how to fix. |
flashed mine, and sent to my friend who has 2 running... |
Client error handling
@dkerr64 Great, I just merged that PR. I'm pretty sure I have run into this issue. How confident are you in these changes? I'll do a test build, but looks like I'm going to have a busy week and probably won't be able to spend much more time looking at this for a while, so I'm tempted to just merge everything andrelease an new image. |
@dkerr64 what are your thoughts on releasing this as a 1.0 build and removing the pre-release designation? |
Treat more like a git submodule and add a specific commit ID and don't just pull from HEAD
We fixed the network problems too :D
|
@jgstroud , i just unpaired... after the reboot, i paired it to my homekit (iPhone) |
I am away from home so am not able to test unpair/pair. But I have pushed a fix to the 2nd problem described in #130, I opened a PR on your branch. While implementing the fix I discovered that I was setting the ticker for each connected client... so instead of the ticker running once per second, it was running once per second for each connected client, so could be 2 or 3. And then the ticker looped through all clients. I fixed it by having one global heartbeat timer, rather than one for each client. Maybe not the most elegant, but works. I think we have dramatically improved things. So I vote for getting this out to folks to test. There will always be more improvements, but I think we have reached a significant milestone, so should push it out. |
@dkerr64 I just fixed the wifi provisioning timeout errors! Ready for a release 1.0 |
Timeout on no client connecting to SSEHandler. Solves 2nd problem in ratgdo#130
No description provided.