Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stability Improvements #129

Merged
merged 16 commits into from
Mar 19, 2024
Merged

Stability Improvements #129

merged 16 commits into from
Mar 19, 2024

Conversation

jgstroud
Copy link
Collaborator

No description provided.

@jgstroud
Copy link
Collaborator Author

jgstroud commented Mar 16, 2024

All of the other projects I see using wolfSSL are having to use this function.

Example:
https://github.com/Yurik72/ESPHap/blob/master/examples/EspHap_GarageDoor/EspHap_GarageDoor.ino#L265

Also, wonder if we should look at this ESPHap library more.

@dkerr64
Copy link
Collaborator

dkerr64 commented Mar 16, 2024

Trying this now. I observe that the freeHeap is down by about 4K... still above 20KB though, so fine.

@jgstroud
Copy link
Collaborator Author

I got 4 of these crashes last night:
https://gist.github.com/jgstroud/e04ee6c156b808ee573b035e790be80d

They all look identical. The program counter is in the WiFi Keepalive, but the backtrace shows client_send_encrypted. I'm guessing this wifi keep alive is running on an interrupt and we are hitting some shared memory contention?

I may be crazy, but I feel like even though it's still crashing a lot, I feel like we're getting close. I started with crash dumps that made no sense and looked like a corrupted back trace. now they all look very similar and with sane looking stacks and back traces.

@jgstroud
Copy link
Collaborator Author

jgstroud commented Mar 16, 2024

it looks like the ESPHap project was doing a lot of messing around with this Wifi Keepalive. One very notable difference is here:
https://github.com/Yurik72/ESPHap/blob/master/arduino_homekit_server.cpp#L851
Yurik72/ESPHap@8aa8811

They commented out this call which just so happens to come inside the client_send_encrypted.

@dkerr64
Copy link
Collaborator

dkerr64 commented Mar 16, 2024

Your stack history shows that the keep alive was called from client_send_encrypted. So if it is not required then we should try removing (commenting out) as well.

@jgstroud
Copy link
Collaborator Author

Yeah. I started a test a few hours ago with the keep alive completely disabled. Not sure what consequence that will have but hopefully lead me a little closer to a proper solution.

@dkerr64
Copy link
Collaborator

dkerr64 commented Mar 16, 2024

@jgstroud I have invited you to collaborate in my HomeKit server repository, so you can modify this without waiting for me to merge PRs.

This was added when the code used fetch() in an interval timer to protect against parallel reqeusts to our server-side code.  But moving to Server-Sent Events means we no longer do that and I don't see any risk we may need to protect against, so lets remove the semaphore code.
@dkerr64
Copy link
Collaborator

dkerr64 commented Mar 17, 2024

@jgstroud I rebased a branch I am working on to improve client-side error handling onto this branch and have opened a PR to you to merge.

@jgstroud
Copy link
Collaborator Author

jgstroud commented Mar 17, 2024

@dkerr64. Cool, I'll review it. I managed to get a stable build. It ran overnight, which is a first. However, I had to disable the heartbeat in the web code. Which I don't think we really need but it also seems to break OTA by having that disabled. A better fix would be to understand the error, but for now disabling it has made a huge difference.

I saw a number of these:
https://gist.github.com/jgstroud/9a9680483d4fd888db33fe47a0db334d

@dkerr64
Copy link
Collaborator

dkerr64 commented Mar 17, 2024

I think we are getting closer. The crash in heartbeat is probably here

I propose removing the PSTR and _P and using plain old printf...
subscription[i].client.printf("event: message\nretry: 15000\ndata: %s\n\n", json);

We do need some sort of heartbeat (I think) to keep the Server-Sent Events socket open. It is used "on demand" to push any changes (like door open) to the browser client.

@jgstroud
Copy link
Collaborator Author

Yeah I removed the PSTR part but haven't started testing it yet. Just letting it run a while longer with the heartbeat disabled

@jgstroud
Copy link
Collaborator Author

I'll start testing that version this afternoon.

@jgstroud
Copy link
Collaborator Author

jgstroud commented Mar 17, 2024

@dkerr64 after turning the heartbeat back on, things are mostly ok. But I'm noticing that if I have the web ui open on my laptop and then I open it on my phone, they both update nicely, but if my laptop goes to sleep, then my phone will start updating very very slowly. Surely just getting lots of TCP retries. The free heap starts dropping by a lot, and I got it to crash like this. I suspect it ran out of memory. I really don't understand all of this logic, so I'm not sure how to fix it.

For now, I'll continue testing with only 1 client active.

@jgstroud
Copy link
Collaborator Author

jgstroud commented Mar 17, 2024

@dkerr64 looks to me like the default timeout is set to 5 seconds and we have a new heartbeat every second. I think we should set the timeout to 500ms and if you get a timeout on the printf in the heartbeat function, then we should close the tcp connection.
See this commit 5ef96ea

@dkerr64
Copy link
Collaborator

dkerr64 commented Mar 18, 2024

Thanks, I am testing this version now.

@jgstroud
Copy link
Collaborator Author

@dkerr64 I just merged one more change into the HK library. I'm running stable now for 12+ hrs. This is the longest it has run other than the time I had the heartbeats disabled. If this run clean the rest of the day I'm going to go ahead and release.

@dkerr64
Copy link
Collaborator

dkerr64 commented Mar 18, 2024

Excellent. My systems take 4+ days to crash so it will take a while for me to confirm, but I have confidence. I'm traveling this week so can't mess with it anyway!

The heartbeat is pretty much a no-op unless any browsers are open/actively connected.

There are a couple of other PSTR() in the web.cpp which maybe we should get rid of too? Much more infrequently called, but if we suspect problems accessing that block of memory then I think it safer to remove. I have just done that in my client-error-handling branch.

@dkerr64
Copy link
Collaborator

dkerr64 commented Mar 18, 2024

I sent a PR that removes remaining PSTR() from web.cpp and fixes a clearInterval() problem in the javascript.

@dkerr64
Copy link
Collaborator

dkerr64 commented Mar 18, 2024

Wait, don't merge that PR. I am seeing a problem similar to what you observed when multiple browser clients try to connect. Let me figure out what is going on.

@dkerr64
Copy link
Collaborator

dkerr64 commented Mar 18, 2024

We have a race condition with the heartbeat... which exposes itself when browser client is on slow connection. Server allocates say.. /rest/events/1 to a browser. Now the browser needs to open up an EventSource channel to that URI. But if SSEheartbeat() runs before the client has connected, then we get a client not listening error.

I was running into a similar problem when stress testing client error handling, and was seeing what happens when the EventSource connection failed. That was one of the outstanding error cases I still needed to tackle.

I have to think how best to handle this.

Oh... and it's got nothing to do with last PR so you can merge it.

@mitchjs
Copy link
Contributor

mitchjs commented Mar 18, 2024

you guys been busy :) mine still hasnt rebooted, up 6 days! since #pr127
my friends did lockup and he had to physically power cycle, and it did loose HK pairing...
he was running the main branch of course

@dkerr64
Copy link
Collaborator

dkerr64 commented Mar 18, 2024

I've opened issue #130 to track this race condition. Lets discuss over there on how to fix.

@jgstroud
Copy link
Collaborator Author

@mitchjs it's getting a lot more stable especially if you pull this PR. I'm tempted to do a release, but I might want to wait on a resolution to #130 first as I think I have also seen this.

@mitchjs
Copy link
Contributor

mitchjs commented Mar 18, 2024

flashed mine, and sent to my friend who has 2 running...

@jgstroud
Copy link
Collaborator Author

jgstroud commented Mar 18, 2024

@dkerr64 Great, I just merged that PR. I'm pretty sure I have run into this issue. How confident are you in these changes? I'll do a test build, but looks like I'm going to have a busy week and probably won't be able to spend much more time looking at this for a while, so I'm tempted to just merge everything andrelease an new image.

@jgstroud jgstroud changed the title Disable WPS from using 4k of stack Stability Improvements Mar 18, 2024
@jgstroud
Copy link
Collaborator Author

@dkerr64 what are your thoughts on releasing this as a 1.0 build and removing the pre-release designation?

Treat more like a git submodule and add a specific commit ID and don't
just pull from HEAD
@dkerr64
Copy link
Collaborator

dkerr64 commented Mar 18, 2024

@jgstroud I've figured out how to solve the other problem in #130 when client fails to open EventSource connection. It's simple logic, so I'd like to implement that. Then lets pull the trigger on 1.0.

@jgstroud
Copy link
Collaborator Author

jgstroud commented Mar 18, 2024

We fixed the network problems too :D


PING 192.168.8.205 (192.168.8.205) 56(84) bytes of data.
64 bytes from 192.168.8.205: icmp_seq=1 ttl=254 time=2.38 ms
64 bytes from 192.168.8.205: icmp_seq=2 ttl=254 time=4.31 ms
64 bytes from 192.168.8.205: icmp_seq=3 ttl=254 time=2.93 ms
64 bytes from 192.168.8.205: icmp_seq=4 ttl=254 time=2.11 ms
64 bytes from 192.168.8.205: icmp_seq=5 ttl=254 time=1.98 ms
64 bytes from 192.168.8.205: icmp_seq=6 ttl=254 time=3.18 ms
64 bytes from 192.168.8.205: icmp_seq=7 ttl=254 time=1.89 ms
64 bytes from 192.168.8.205: icmp_seq=8 ttl=254 time=1.98 ms
64 bytes from 192.168.8.205: icmp_seq=9 ttl=254 time=5.37 ms
64 bytes from 192.168.8.205: icmp_seq=10 ttl=254 time=2.83 ms
64 bytes from 192.168.8.205: icmp_seq=11 ttl=254 time=2.08 ms
64 bytes from 192.168.8.205: icmp_seq=12 ttl=254 time=1.87 ms
64 bytes from 192.168.8.205: icmp_seq=13 ttl=254 time=1.94 ms
64 bytes from 192.168.8.205: icmp_seq=14 ttl=254 time=2.67 ms
64 bytes from 192.168.8.205: icmp_seq=15 ttl=254 time=3.12 ms

@jgstroud
Copy link
Collaborator Author

@mitchjs @dkerr64 if you are running the PR, can you test un-pairing and re-pairing?

@mitchjs
Copy link
Contributor

mitchjs commented Mar 19, 2024

@jgstroud , i just unpaired... after the reboot, i paired it to my homekit (iPhone)
and all seem ok

@dkerr64
Copy link
Collaborator

dkerr64 commented Mar 19, 2024

I am away from home so am not able to test unpair/pair. But I have pushed a fix to the 2nd problem described in #130, I opened a PR on your branch.

While implementing the fix I discovered that I was setting the ticker for each connected client... so instead of the ticker running once per second, it was running once per second for each connected client, so could be 2 or 3. And then the ticker looped through all clients. I fixed it by having one global heartbeat timer, rather than one for each client. Maybe not the most elegant, but works.

I think we have dramatically improved things. So I vote for getting this out to folks to test. There will always be more improvements, but I think we have reached a significant milestone, so should push it out.

@jgstroud
Copy link
Collaborator Author

@dkerr64 I just fixed the wifi provisioning timeout errors! Ready for a release 1.0

Timeout on no client connecting to SSEHandler.  Solves 2nd problem in ratgdo#130
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants