Stability Improvements #129

jgstroud · 2024-03-16T04:07:10Z

No description provided.

jgstroud · 2024-03-16T04:15:45Z

See:
https://github.com/esp8266/Arduino/blob/master/cores/esp8266/core_esp8266_app_entry_noextra4k.cpp#L14
https://github.com/esp8266/Arduino/blob/master/cores/esp8266/core_esp8266_main.cpp#L345

jgstroud · 2024-03-16T04:16:23Z

All of the other projects I see using wolfSSL are having to use this function.

Example:
https://github.com/Yurik72/ESPHap/blob/master/examples/EspHap_GarageDoor/EspHap_GarageDoor.ino#L265

Also, wonder if we should look at this ESPHap library more.

dkerr64 · 2024-03-16T15:02:13Z

Trying this now. I observe that the freeHeap is down by about 4K... still above 20KB though, so fine.

jgstroud · 2024-03-16T16:06:18Z

I got 4 of these crashes last night:
https://gist.github.com/jgstroud/e04ee6c156b808ee573b035e790be80d

They all look identical. The program counter is in the WiFi Keepalive, but the backtrace shows client_send_encrypted. I'm guessing this wifi keep alive is running on an interrupt and we are hitting some shared memory contention?

I may be crazy, but I feel like even though it's still crashing a lot, I feel like we're getting close. I started with crash dumps that made no sense and looked like a corrupted back trace. now they all look very similar and with sane looking stacks and back traces.

jgstroud · 2024-03-16T16:49:08Z

it looks like the ESPHap project was doing a lot of messing around with this Wifi Keepalive. One very notable difference is here:
https://github.com/Yurik72/ESPHap/blob/master/arduino_homekit_server.cpp#L851
Yurik72/ESPHap@8aa8811

They commented out this call which just so happens to come inside the client_send_encrypted.

dkerr64 · 2024-03-16T18:59:28Z

Your stack history shows that the keep alive was called from client_send_encrypted. So if it is not required then we should try removing (commenting out) as well.

jgstroud · 2024-03-16T19:14:18Z

Yeah. I started a test a few hours ago with the keep alive completely disabled. Not sure what consequence that will have but hopefully lead me a little closer to a proper solution.

dkerr64 · 2024-03-16T19:40:26Z

@jgstroud I have invited you to collaborate in my HomeKit server repository, so you can modify this without waiting for me to merge PRs.

This was added when the code used fetch() in an interval timer to protect against parallel reqeusts to our server-side code. But moving to Server-Sent Events means we no longer do that and I don't see any risk we may need to protect against, so lets remove the semaphore code.

…eartbeat

dkerr64 · 2024-03-17T15:22:49Z

@jgstroud I rebased a branch I am working on to improve client-side error handling onto this branch and have opened a PR to you to merge.

jgstroud · 2024-03-17T15:36:56Z

@dkerr64. Cool, I'll review it. I managed to get a stable build. It ran overnight, which is a first. However, I had to disable the heartbeat in the web code. Which I don't think we really need but it also seems to break OTA by having that disabled. A better fix would be to understand the error, but for now disabling it has made a huge difference.

I saw a number of these:
https://gist.github.com/jgstroud/9a9680483d4fd888db33fe47a0db334d

dkerr64 · 2024-03-17T15:57:07Z

I think we are getting closer. The crash in heartbeat is probably here

I propose removing the PSTR and _P and using plain old printf...
subscription[i].client.printf("event: message\nretry: 15000\ndata: %s\n\n", json);

We do need some sort of heartbeat (I think) to keep the Server-Sent Events socket open. It is used "on demand" to push any changes (like door open) to the browser client.

jgstroud · 2024-03-17T17:58:53Z

Yeah I removed the PSTR part but haven't started testing it yet. Just letting it run a while longer with the heartbeat disabled

jgstroud · 2024-03-17T17:59:50Z

I'll start testing that version this afternoon.

Client error handling

jgstroud · 2024-03-17T21:47:00Z

@dkerr64 after turning the heartbeat back on, things are mostly ok. But I'm noticing that if I have the web ui open on my laptop and then I open it on my phone, they both update nicely, but if my laptop goes to sleep, then my phone will start updating very very slowly. Surely just getting lots of TCP retries. The free heap starts dropping by a lot, and I got it to crash like this. I suspect it ran out of memory. I really don't understand all of this logic, so I'm not sure how to fix it.

For now, I'll continue testing with only 1 client active.

jgstroud · 2024-03-17T22:40:56Z

@dkerr64 looks to me like the default timeout is set to 5 seconds and we have a new heartbeat every second. I think we should set the timeout to 500ms and if you get a timeout on the printf in the heartbeat function, then we should close the tcp connection.
See this commit 5ef96ea

dkerr64 · 2024-03-18T05:29:35Z

Thanks, I am testing this version now.

jgstroud · 2024-03-18T14:04:20Z

@dkerr64 I just merged one more change into the HK library. I'm running stable now for 12+ hrs. This is the longest it has run other than the time I had the heartbeats disabled. If this run clean the rest of the day I'm going to go ahead and release.

dkerr64 · 2024-03-18T14:34:08Z

Excellent. My systems take 4+ days to crash so it will take a while for me to confirm, but I have confidence. I'm traveling this week so can't mess with it anyway!

The heartbeat is pretty much a no-op unless any browsers are open/actively connected.

There are a couple of other PSTR() in the web.cpp which maybe we should get rid of too? Much more infrequently called, but if we suspect problems accessing that block of memory then I think it safer to remove. I have just done that in my client-error-handling branch.

dkerr64 · 2024-03-18T14:43:26Z

I sent a PR that removes remaining PSTR() from web.cpp and fixes a clearInterval() problem in the javascript.

dkerr64 · 2024-03-18T14:48:30Z

Wait, don't merge that PR. I am seeing a problem similar to what you observed when multiple browser clients try to connect. Let me figure out what is going on.

dkerr64 · 2024-03-18T14:57:41Z

We have a race condition with the heartbeat... which exposes itself when browser client is on slow connection. Server allocates say.. /rest/events/1 to a browser. Now the browser needs to open up an EventSource channel to that URI. But if SSEheartbeat() runs before the client has connected, then we get a client not listening error.

I was running into a similar problem when stress testing client error handling, and was seeing what happens when the EventSource connection failed. That was one of the outstanding error cases I still needed to tackle.

I have to think how best to handle this.

Oh... and it's got nothing to do with last PR so you can merge it.

mitchjs · 2024-03-18T14:59:02Z

you guys been busy :) mine still hasnt rebooted, up 6 days! since #pr127
my friends did lockup and he had to physically power cycle, and it did loose HK pairing...
he was running the main branch of course

dkerr64 · 2024-03-18T15:01:09Z

I've opened issue #130 to track this race condition. Lets discuss over there on how to fix.

jgstroud · 2024-03-18T16:53:22Z

@mitchjs it's getting a lot more stable especially if you pull this PR. I'm tempted to do a release, but I might want to wait on a resolution to #130 first as I think I have also seen this.

mitchjs · 2024-03-18T17:23:50Z

flashed mine, and sent to my friend who has 2 running...

Client error handling

jgstroud · 2024-03-18T19:06:53Z

@dkerr64 Great, I just merged that PR. I'm pretty sure I have run into this issue. How confident are you in these changes? I'll do a test build, but looks like I'm going to have a busy week and probably won't be able to spend much more time looking at this for a while, so I'm tempted to just merge everything andrelease an new image.

jgstroud · 2024-03-18T22:55:22Z

@dkerr64 what are your thoughts on releasing this as a 1.0 build and removing the pre-release designation?

Treat more like a git submodule and add a specific commit ID and don't just pull from HEAD

dkerr64 · 2024-03-18T23:14:55Z

@jgstroud I've figured out how to solve the other problem in #130 when client fails to open EventSource connection. It's simple logic, so I'd like to implement that. Then lets pull the trigger on 1.0.

jgstroud · 2024-03-18T23:25:08Z

We fixed the network problems too :D


PING 192.168.8.205 (192.168.8.205) 56(84) bytes of data.
64 bytes from 192.168.8.205: icmp_seq=1 ttl=254 time=2.38 ms
64 bytes from 192.168.8.205: icmp_seq=2 ttl=254 time=4.31 ms
64 bytes from 192.168.8.205: icmp_seq=3 ttl=254 time=2.93 ms
64 bytes from 192.168.8.205: icmp_seq=4 ttl=254 time=2.11 ms
64 bytes from 192.168.8.205: icmp_seq=5 ttl=254 time=1.98 ms
64 bytes from 192.168.8.205: icmp_seq=6 ttl=254 time=3.18 ms
64 bytes from 192.168.8.205: icmp_seq=7 ttl=254 time=1.89 ms
64 bytes from 192.168.8.205: icmp_seq=8 ttl=254 time=1.98 ms
64 bytes from 192.168.8.205: icmp_seq=9 ttl=254 time=5.37 ms
64 bytes from 192.168.8.205: icmp_seq=10 ttl=254 time=2.83 ms
64 bytes from 192.168.8.205: icmp_seq=11 ttl=254 time=2.08 ms
64 bytes from 192.168.8.205: icmp_seq=12 ttl=254 time=1.87 ms
64 bytes from 192.168.8.205: icmp_seq=13 ttl=254 time=1.94 ms
64 bytes from 192.168.8.205: icmp_seq=14 ttl=254 time=2.67 ms
64 bytes from 192.168.8.205: icmp_seq=15 ttl=254 time=3.12 ms

jgstroud · 2024-03-19T01:52:06Z

@mitchjs @dkerr64 if you are running the PR, can you test un-pairing and re-pairing?

mitchjs · 2024-03-19T02:15:06Z

@jgstroud , i just unpaired... after the reboot, i paired it to my homekit (iPhone)
and all seem ok

…atgdo#130

dkerr64 · 2024-03-19T04:58:30Z

I am away from home so am not able to test unpair/pair. But I have pushed a fix to the 2nd problem described in #130, I opened a PR on your branch.

While implementing the fix I discovered that I was setting the ticker for each connected client... so instead of the ticker running once per second, it was running once per second for each connected client, so could be 2 or 3. And then the ticker looped through all clients. I fixed it by having one global heartbeat timer, rather than one for each client. Maybe not the most elegant, but works.

I think we have dramatically improved things. So I vote for getting this out to folks to test. There will always be more improvements, but I think we have reached a significant milestone, so should push it out.

jgstroud · 2024-03-19T05:01:19Z

@dkerr64 I just fixed the wifi provisioning timeout errors! Ready for a release 1.0

Timeout on no client connecting to SSEHandler. Solves 2nd problem in ratgdo#130

Disable WPS from using 4k of stack

7deb1a4

jgstroud mentioned this pull request Mar 16, 2024

Reboot caused by exception when HomeKit WiFi client apparently has zero for localIP and remoteIP #126

Closed

Track and report minimum detect heap size

f94d6cb

jgstroud force-pushed the stability_debug branch from 75b9d38 to f94d6cb Compare March 17, 2024 05:39

dkerr64 added 3 commits March 17, 2024 10:46

Handle network errors and support mobile swipe to reload

689f440

Increase buffer for JSON strings to 1024, also use global buffer in h…

a255955

…eartbeat

jgstroud and others added 2 commits March 17, 2024 14:33

Merge pull request #1 from dkerr64/client-error-handling

5cd5913

Client error handling

Remove the PSTR and just use strings in ram

ae73aca

Set heatbeat timeout to 500ms and close connection on timeout

5ef96ea

Remove remaining PSTR() from web.cpp

606b582

Clear interval timer when reseting

16c40fa

fix race condition documented in ratgdo#130

eeafa9a

Merge pull request #2 from dkerr64/client-error-handling

7ca6f84

Client error handling

jgstroud changed the title ~~Disable WPS from using 4k of stack~~ Stability Improvements Mar 18, 2024

Add a commit ID to the HomeKit library

1bb4f91

Treat more like a git submodule and add a specific commit ID and don't just pull from HEAD

Timeout on no client connecting to SSEHandler. Solves 2nd problem in r…

d7b4f3c

…atgdo#130

Fix buffer overflow that caused wifi provisioning to fail ratgdo#15

ceb2d1d

Merge pull request #3 from dkerr64/client-error-handling

10457d7

Timeout on no client connecting to SSEHandler. Solves 2nd problem in ratgdo#130

jgstroud merged commit 8aaf5f6 into ratgdo:main Mar 19, 2024

This was referenced Mar 19, 2024

Improv provisioning stalls #15

Closed

Wifi times out #36

Closed

Door shows as "No Response" in HomeKit until rebooted #94

Closed

Can't connect to WiFi after v0.10.0 OTA update #103

Closed

Server Sent Events heartbeat race condition #130

Closed

jgstroud deleted the stability_debug branch May 1, 2024 20:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stability Improvements #129

Stability Improvements #129

jgstroud commented Mar 16, 2024

jgstroud commented Mar 16, 2024

jgstroud commented Mar 16, 2024 •

edited

dkerr64 commented Mar 16, 2024

jgstroud commented Mar 16, 2024

jgstroud commented Mar 16, 2024 •

edited

dkerr64 commented Mar 16, 2024

jgstroud commented Mar 16, 2024

dkerr64 commented Mar 16, 2024

dkerr64 commented Mar 17, 2024

jgstroud commented Mar 17, 2024 •

edited

dkerr64 commented Mar 17, 2024

jgstroud commented Mar 17, 2024

jgstroud commented Mar 17, 2024

jgstroud commented Mar 17, 2024 •

edited

jgstroud commented Mar 17, 2024 •

edited

dkerr64 commented Mar 18, 2024

jgstroud commented Mar 18, 2024

dkerr64 commented Mar 18, 2024

dkerr64 commented Mar 18, 2024

dkerr64 commented Mar 18, 2024

dkerr64 commented Mar 18, 2024

mitchjs commented Mar 18, 2024

dkerr64 commented Mar 18, 2024

jgstroud commented Mar 18, 2024

mitchjs commented Mar 18, 2024 •

edited

jgstroud commented Mar 18, 2024 •

edited

jgstroud commented Mar 18, 2024

dkerr64 commented Mar 18, 2024

jgstroud commented Mar 18, 2024 •

edited

jgstroud commented Mar 19, 2024

mitchjs commented Mar 19, 2024

dkerr64 commented Mar 19, 2024

jgstroud commented Mar 19, 2024

Stability Improvements #129

Stability Improvements #129

Conversation

jgstroud commented Mar 16, 2024

jgstroud commented Mar 16, 2024

jgstroud commented Mar 16, 2024 • edited

dkerr64 commented Mar 16, 2024

jgstroud commented Mar 16, 2024

jgstroud commented Mar 16, 2024 • edited

dkerr64 commented Mar 16, 2024

jgstroud commented Mar 16, 2024

dkerr64 commented Mar 16, 2024

dkerr64 commented Mar 17, 2024

jgstroud commented Mar 17, 2024 • edited

dkerr64 commented Mar 17, 2024

jgstroud commented Mar 17, 2024

jgstroud commented Mar 17, 2024

jgstroud commented Mar 17, 2024 • edited

jgstroud commented Mar 17, 2024 • edited

dkerr64 commented Mar 18, 2024

jgstroud commented Mar 18, 2024

dkerr64 commented Mar 18, 2024

dkerr64 commented Mar 18, 2024

dkerr64 commented Mar 18, 2024

dkerr64 commented Mar 18, 2024

mitchjs commented Mar 18, 2024

dkerr64 commented Mar 18, 2024

jgstroud commented Mar 18, 2024

mitchjs commented Mar 18, 2024 • edited

jgstroud commented Mar 18, 2024 • edited

jgstroud commented Mar 18, 2024

dkerr64 commented Mar 18, 2024

jgstroud commented Mar 18, 2024 • edited

jgstroud commented Mar 19, 2024

mitchjs commented Mar 19, 2024

dkerr64 commented Mar 19, 2024

jgstroud commented Mar 19, 2024

jgstroud commented Mar 16, 2024 •

edited

jgstroud commented Mar 16, 2024 •

edited

jgstroud commented Mar 17, 2024 •

edited

jgstroud commented Mar 17, 2024 •

edited

jgstroud commented Mar 17, 2024 •

edited

mitchjs commented Mar 18, 2024 •

edited

jgstroud commented Mar 18, 2024 •

edited

jgstroud commented Mar 18, 2024 •

edited