Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashing in SonOff Pow R2 v2.0 (v 1.13.5) #1752

Closed
JavierAder opened this issue Jun 2, 2019 · 27 comments
Closed

Crashing in SonOff Pow R2 v2.0 (v 1.13.5) #1752

JavierAder opened this issue Jun 2, 2019 · 27 comments
Labels

Comments

@JavierAder
Copy link

Hi, my SonOff Pow R2 v2.0 with Espurna 1.13.5 (binay release) is restarting after a certain time.

I only have enabled ThingSpeak (with Clear Cache, and reporting the 7 sensors, current, voltage, etc) and Telnet in STA mode (without this last one also happens). Some screenshots of various telnet commands:
Crash-uptime-info
Crash-uptime-info-2
Crash-uptime-info-3

Any suggestions?

@mcspr
Copy link
Collaborator

mcspr commented Jun 6, 2019

Can you upgrade to the dev version? See https://github.com/mcspr/espurna-nightly-builder/releases
(and please post crashes as text, if they happen again:)
Did it crash on it's own or after accessing webui?

May be related to the #1720, but I am trending towards increased loop delay as a solution for this if it is only related to normal operation. current 1.13.6-dev should space things out more evenly. Another option is that tspk queue eventually fragments the memory (flush method), so webui can crash things.

@JavierAder
Copy link
Author

Hi mcspr; I will try (but I am having problems building under Windows; ok, it's another subject).

Two things:

  • power : 3138 mV. Is not it too low? Maybe I have a problem hardware and the micro is simply resetting due to low voltage....
  • Does Thingspeak use SSL? I think I read that SSL can cause instability, but I'm not sure

@mcspr
Copy link
Collaborator

mcspr commented Jun 11, 2019

  • based on some brief research, I would not trust that voltage value too much. I'm more inclined to treat this a sw bug first
  • no. it uses plain http, unless you built the firmware yourself and included async ssl libraries, it actually can't open secure connections

@JavierAder
Copy link
Author

Offtopic: Finally I was able to build the current dev version (commit 0998110; task Build (itead-sonoff-pow-r2)). Now what? Can I simply use esptool to flash firmware.bin (generated in .pioenvs/itead-sonoff-pow-r2/) in my Sonoff Pow R2? Is this correct?

@JavierAder
Copy link
Author

Hi,
"Did it crash on it's own or after accessing webui?"
Crash on it's own, in SAT mode, only thingspeak enabled. The first time that I realized that was because the energy plot in thingspeak was droping to zero after some time; now, it doesn't happen with you build... but when I look the uptime via telnet it seems like is crashing.

info after 24 hours running:

Welcome!
uptime
[919363] Uptime: 17919 seconds
[919364] +OK
info
[924440]

---8<-------

[924441] [MAIN] ESPURNA 1.13.6-dev (0998110)
[924441] [MAIN] xose.perez@gmail.com
[924441] [MAIN] http://tinkerman.cat

[924441] [MAIN] CPU chip ID: 0x3D64E1
[924442] [MAIN] CPU frequency: 80 MHz
[924442] [MAIN] SDK version: 1.5.3(aec24ac9)
[924442] [MAIN] Core version: 2.3.0
[924442] [MAIN] Core revision: 159542381
[924442] [MAIN] Build time: 1560147312
[924443]
[924443] [MAIN] Flash chip ID: 0x1640EF
[924444] [MAIN] Flash speed: 40000000 Hz
[924444] [MAIN] Flash mode: DOUT
[924444]
[924445] [MAIN] Flash size (CHIP)   :  4194304 bytes / 1024 sectors (   0 to 1023)
[924447] [MAIN] Flash size (SDK)    :  4194304 bytes / 1024 sectors (   0 to 1023)
[924447] [MAIN] Reserved            :     4096 bytes /    1 sectors (   0 to    0)
[924447] [MAIN] Firmware size       :   494816 bytes /  121 sectors (   1 to  121)
[924448] [MAIN] Max OTA size        :   528384 bytes /  129 sectors ( 122 to  250)
[924448] [MAIN] EEPROM size         :     4096 bytes /    1 sectors ( 251 to  251)
[924448] [MAIN] Reserved            :    16384 bytes /    4 sectors ( 252 to  255)
[924448]
[924449] [MAIN] EEPROM sectors: 1019, 1018, 1017, 1016
[924449] [MAIN] EEPROM current: 1018
[924449]
[924450] [MAIN] EEPROM:  4096 bytes initially |  1373 bytes used (33%) |  2723 bytes free (66%)
[924451] [MAIN] Heap  : 34552 bytes initially | 18632 bytes used (53%) | 15920 bytes free (46%)
[924451] [MAIN] Stack :  4096 bytes initially |  1272 bytes used (31%) |  2824 bytes free (68%)
[924452]
[924453] [MAIN] Boot version: 31
[924453] [MAIN] Boot mode: 1
[924453] [MAIN] Last reset reason: Software/System restart
[924453] [MAIN] Last reset info: Fatal exception:0 flag:4 (SOFT_RESTART) epc1:0x00000000 epc2:0x00000000 epc3:0x00000000 excvaddr:0x00000000 depc:0x00000000
[924453]
[924454] [MAIN] Board: ITEAD_SONOFF_POW_R2
[924454] [MAIN] Support: ALEXA API BROKER BUTTON DEBUG_TELNET DEBUG_WEB DOMOTICZ HOMEASSISTANT LED MDNS_SERVER MQTT NTP SCHEDULER SENSOR TELNET TERMINAL THINGSPEAK WEB
[924454] [MAIN] Sensors: CSE7766
[924455] [MAIN] WebUI image: SENSOR
[924455]
[924455] [MAIN] Firmware MD5: 33d87223d88c2541fc8ac8308beeecdf
[924456] [MAIN] Power: 3138 mV
[924456] [MAIN] Power saving delay value: 10 ms
[924456]


---8<-------

[068503] +OK

crash
[043993] [DEBUG] Latest crash was at 3967006 ms after boot
[043994] [DEBUG] Reason of restart: 2
[043994] [DEBUG] Exception cause: 28
[043994] [DEBUG] epc1=0x4021ac1f epc2=0x00000000 epc3=0x00000000
[043994] [DEBUG] excvaddr=0x00000000 depc=0x00000000
[043994] [DEBUG] sp=0x3ffffe70 end=0x3fffffb0
[043994] [DEBUG] >>>stack>>>
[DEBUG] 3ffffe70: 000005a0 00000000 3fff7354 4021ac34
[DEBUG] 3ffffe80: 00000000 3fff733c 00000000 402128b2
[DEBUG] 3ffffe90: 3fff7354 3fff7434 00000000 4021acb0
[DEBUG] 3ffffea0: 3fff7284 00000000 3fff7284 402433fc
[DEBUG] 3ffffeb0: 3fff2afc 3fff2b08 00000006 3fff2910
[DEBUG] 3ffffec0: 3fff7284 3fff2918 3fff2914 4021acfb
[DEBUG] 3ffffed0: 3fff7284 3fff2918 3fff2914 40243f78
[DEBUG] 3ffffee0: 000002a7 6764a8c0 00000018 000002a7
[DEBUG] 3ffffef0: 00000018 00000000 3fff1204 401075fc
[DEBUG] 3fffff00: 40100000 00000000 00000064 3fff2b00
[DEBUG] 3fffff10: 3ffed154 3fff2b08 3fff7434 402427fd
[DEBUG] 3fffff20: 00754208 3fff69e4 3fff69e4 3ffeff90
[DEBUG] 3fffff30: 00000000 3fff7434 0000001c 3fff69e4
[DEBUG] 3fffff40: 3ffed146 00000000 3fff7434 40241879
[DEBUG] 3fffff50: 17561634 00000303 00000000 00000013
[DEBUG] 3fffff60: 00000002 00000018 40231063 3ffee980
[DEBUG] 3fffff70: 3ffed120 3fffdcc0 3ffeb070 3ffeb070
[DEBUG] 3fffff80: 40230fd6 3ffee980 00000000 3fff6a34
[DEBUG] 3fffff90: 3fffdc80 00000000 3fff7434 402394f3
[DEBUG] 3fffffa0: 40000f49 3fffdab0 3fffdab0 40000f49
[DEBUG] <<<stack<<<
[044053] +OK

17919 seconds= 5 hours.....

I don't understand why if it is crashing energy data is not resetting; was there some change related to that between 1.13.5 and your build?
energia-no-reseteandose

@JavierAder
Copy link
Author

Well.... It seems that it is not crashing (with your build), but uptime is wrong (maybe there is a overflow in uptime). I did switch off (unplug) my sonoff and then swith on, and energy IS resetted...
energia-reseteandose-luego-de-apagar

@mcspr
Copy link
Collaborator

mcspr commented Jun 13, 2019

Since 820d8c4 it will keep energy value in RAM between reboots. If you power-off, it will try to recover energy value from flash (if you enabled that option)

Note that crash will output the latest crash. And it will not invalidate it until read, it may stay there for a long time. info message shows that you restarted manually, no exception. so crash log may be from some other time before that

What do you mean by uptime overflow? Does it stay the same and not increasing between the times you checked?

@JavierAder
Copy link
Author

"Since 820d8c4 it will keep energy value in RAM between reboots. If you power-off, it will try to recover energy value from flash (if you enabled that option)"
Ok, then reboot and power-off/power-on are differents things.

"What do you mean by uptime overflow? Does it stay the same and not increasing between the times you checked?"

That maybe the variable storing the uptime was too small, but it is not the case ....

Ok, maybe what is wrong is the command uptime:
In terminal.ino


    terminalRegisterCommand(F("UPTIME"), [](Embedis* e) {
        DEBUG_MSG_P(PSTR("Uptime: %d seconds\n"), getUptime());

%d? Is it correct? I'm not sure tha it makes any difference but in other parts of the code is used %lu for debug unsigned longs.

@JavierAder
Copy link
Author

after 1 day and 16 hours running

Welcome!
uptime
[439945] Uptime: 2439 seconds
[439946] +OK
info
[442176]

---8<-------

[442177] [MAIN] ESPURNA 1.13.6-dev (0998110)
[442177] [MAIN] xose.perez@gmail.com
[442177] [MAIN] http://tinkerman.cat

[442178] [MAIN] CPU chip ID: 0x3D64E1
[442178] [MAIN] CPU frequency: 80 MHz
[442178] [MAIN] SDK version: 1.5.3(aec24ac9)
[442178] [MAIN] Core version: 2.3.0
[442179] [MAIN] Core revision: 159542381
[442179] [MAIN] Build time: 1560147312
[442179]
[442179] [MAIN] Flash chip ID: 0x1640EF
[442180] [MAIN] Flash speed: 40000000 Hz
[442180] [MAIN] Flash mode: DOUT
[442180]
[442181] [MAIN] Flash size (CHIP)   :  4194304 bytes / 1024 sectors (   0 to 1023)
[442182] [MAIN] Flash size (SDK)    :  4194304 bytes / 1024 sectors (   0 to 1023)
[442182] [MAIN] Reserved            :     4096 bytes /    1 sectors (   0 to    0)
[442182] [MAIN] Firmware size       :   494816 bytes /  121 sectors (   1 to  121)
[442182] [MAIN] Max OTA size        :   528384 bytes /  129 sectors ( 122 to  250)
[442183] [MAIN] EEPROM size         :     4096 bytes /    1 sectors ( 251 to  251)
[442183] [MAIN] Reserved            :    16384 bytes /    4 sectors ( 252 to  255)
[442183]
[442184] [MAIN] EEPROM sectors: 1019, 1018, 1017, 1016
[442184] [MAIN] EEPROM current: 1016
[442184]
[442184] [MAIN] EEPROM:  4096 bytes initially |  1373 bytes used (33%) |  2723 bytes free (66%)
[442185] [MAIN] Heap  : 34552 bytes initially | 18488 bytes used (53%) | 16064 bytes free (46%)
[442186] [MAIN] Stack :  4096 bytes initially |  1168 bytes used (28%) |  2928 bytes free (71%)
[442187]
[442187] [MAIN] Boot version: 31
[442187] [MAIN] Boot mode: 1
[442188] [MAIN] Last reset reason: Software/System restart
[442188] [MAIN] Last reset info: Fatal exception:0 flag:4 (SOFT_RESTART) epc1:0x00000000 epc2:0x00000000 epc3:0x00000000 excvaddr:0x00000000 depc:0x00000000
[442188]
[442189] [MAIN] Board: ITEAD_SONOFF_POW_R2
[442189] [MAIN] Support: ALEXA API BROKER BUTTON DEBUG_TELNET DEBUG_WEB DOMOTICZ HOMEASSISTANT LED MDNS_SERVER MQTT NTP SCHEDULER SENSOR TELNET TERMINAL THINGSPEAK WEB
[442189] [MAIN] Sensors: CSE7766
[442189] [MAIN] WebUI image: SENSOR
[442189]
[442190] [MAIN] Firmware MD5: 33d87223d88c2541fc8ac8308beeecdf
[442191] [MAIN] Power: 3139 mV
[442191] [MAIN] Power saving delay value: 10 ms
[442191]

---8<-------

[442191] +OK
crash
[445467] [DEBUG] Latest crash was at 28247293 ms after boot
[445468] [DEBUG] Reason of restart: 2
[445469] [DEBUG] Exception cause: 28
[445469] [DEBUG] epc1=0x4021ac1f epc2=0x00000000 epc3=0x00000000
[445469] [DEBUG] excvaddr=0x00000000 depc=0x00000000
[445469] [DEBUG] sp=0x3ffffe70 end=0x3fffffb0
[445469] [DEBUG] >>>stack>>>
[DEBUG] 3ffffe70: 000002d3 3fff2b10 3fff71fc 4021ac34
[DEBUG] 3ffffe80: 00000000 3fff7194 00000000 402128b2
[DEBUG] 3ffffe90: 3fff71fc 3fff76fc 00000000 4021acb0
[DEBUG] 3ffffea0: 3fff75b4 00000000 3fff75b4 402433fc
[DEBUG] 3ffffeb0: 3fff2afc 3fff2b08 00000006 3fff2910
[DEBUG] 3ffffec0: 3fff75b4 3fff2918 3fff2914 4021acfb
[DEBUG] 3ffffed0: 3fff75b4 3fff2918 3fff2914 40243f78
[DEBUG] 3ffffee0: 000002a7 6764a8c0 00000018 000002a7
[DEBUG] 3ffffef0: 00000018 00000000 3ffef360 401075fc
[DEBUG] 3fffff00: 40100000 00000000 00000064 3fff2b00
[DEBUG] 3fffff10: 3ffeb844 3fff2b08 3fff76fc 402427fd
[DEBUG] 3fffff20: 013a4208 3fff69e4 3fff69e4 3ffeff90
[DEBUG] 3fffff30: 00000000 3fff76fc 0000001c 3fff69e4
[DEBUG] 3fffff40: 3ffeb836 00000000 3fff76fc 40241879
[DEBUG] 3fffff50: 17561634 00000303 00000000 0000000b
[DEBUG] 3fffff60: 00000002 00000018 40231063 3ffee980
[DEBUG] 3fffff70: 3ffeb810 3fffdcc0 3ffeafa8 3ffeafa8
[DEBUG] 3fffff80: 40230fd6 3ffee980 00000000 3fff6a3c
[DEBUG] 3fffff90: 3fffdc80 00000000 3fff76fc 402394f3
[DEBUG] 3fffffa0: 40000f49 3fffdab0 3fffdab0 40000f49
[DEBUG] <<<stack<<<
[445532] +OK


@mcspr
Copy link
Collaborator

mcspr commented Jun 15, 2019

Thanks! Seems like something in async client... Need to take a closer look at thingspeak sender

@JavierAder
Copy link
Author

Hi, platform,ini references to the async library versión 1.1.3
https://github.com/me-no-dev/ESPAsyncTCP#55cd520
in my .piolibdeps there are this version and the 1.2.0 (I don't know why)

What is the correct version to use?

@JavierAder
Copy link
Author

Also; if you can, checks this commit:
me-no-dev/ESPAsyncTCP@0450e61
it fixes something that throws a exception 28

@mcspr
Copy link
Collaborator

mcspr commented Jun 18, 2019

You can use --verbose flag to see which espasynctcp source directory it chooses.
me-no-dev/ESPAsyncTCP@0450e61 looks like a solution, have you tried firmware with this patch applied?

@JavierAder
Copy link
Author

Not yet..,. Right now I am trying dump objects from firmware.elf to figure what code trhrows the exception; it is located in addres 0x4021ac1f in the two last crashes (using your build), but I am a newby in ESP8266...
I read in "Kolban’s book on the ESP8266"

If we know the location of the exception, we can analyze the executable (app.out) to
figure out what piece of code caused the problem. For example:
xtensa-lx106-elf-objdump -x app.out -d

Is in PlatformIO some similar tool?

@mcspr
Copy link
Collaborator

mcspr commented Jun 18, 2019

There is, see addr2line util from toolchain.
Located either in C:\.platformio or C:\Users\You\.platformio, \packages\toolchain-xtensa*\bin\
for example, xtensa-lx106-elf-addr2line.exe -f -e <.elf> <addr>

Or you can use https://github.com/me21/EspArduinoExceptionDecoder to decode multiline stack trace saved in text file

@JavierAder
Copy link
Author

Ok


C:\Users\Javier\.platformio\packages\toolchain-xtensa\bin>xtensa-lx106-elf-addr2line.exe -f -e C:\espurna-flash\firmware.elf
0x4021ac1f
_ZN11AsyncClient5_recvEP7tcp_pcbP4pbufa
c:\Users\Javier\Documents\PlatformIO\Projects\xoseperez-espurna\espurna\code/.piolibdeps\ESPAsyncTCP\src/ESPAsyncTCP.cpp:665

in ESPAsyncTCP.cpp:665 there is a blank line between two methods.....

Ok, but it command was runnig using my firmware.elf; and my crash log is based on your build.

In any case, as soon as I have some time, I will try build and test the lastest dev version using the following fix:

in platformio.ini
replace:
https://github.com/me-no-dev/ESPAsyncTCP#55cd520
with
https://github.com/me-no-dev/ESPAsyncTCP#7e9ed22

that is; upgrade espAsyncTcp to the version 1.2.0

@JavierAder
Copy link
Author

Ok, finally I'm ready to test the fix; I run the task Build: itead-sonoff-pow-r2, but I see something odd:
in platformio.ini is read

[env:itead-sonoff-pow-r2]
platform = ${common.platform}
framework = ${common.framework}
board = ${common.board_1m}
board_build.flash_mode = ${common.flash_mode}
lib_deps = ${common.lib_deps}
lib_ignore = ${common.lib_ignore}
build_flags = ${common.build_flags_1m0m} -DITEAD_SONOFF_POW_R2 -DDISABLE_POSTMORTEM_STACKDUMP
monitor_speed = ${common.monitor_speed}
extra_scripts = ${common.extra_scripts}


It seems like a board with 1MBytes of flash, but my sonoff has 4MBytes


C:\ProgramData\Anaconda3\Lib\site-packages>python esptool.py --baud 115200 chip_id
esptool.py v2.6
Found 1 serial ports
Serial port COM8
Connecting....
Detecting chip type... ESP8266
Chip is ESP8266EX
Features: WiFi
MAC: xxxx
Uploading stub...
Running stub...
Stub running...
Chip ID: 0x003d64e1
Hard resetting via RTS pin...

C:\ProgramData\Anaconda3\Lib\site-packages>python esptool.py --baud 115200 flash_id
esptool.py v2.6
Found 1 serial ports
Serial port COM8
Connecting....
Detecting chip type... ESP8266
Chip is ESP8266EX
Features: WiFi
MAC: xxxx
Uploading stub...
Running stub...
Stub running...
Manufacturer: ef
Device: 4016
Detected flash size: 4MB
Hard resetting via RTS pin...

Can this bring problems?

@mcspr
Copy link
Collaborator

mcspr commented Jun 22, 2019

Not big ones, and more on the technical side.
There is less OTA space, and you need to flash 4MB variant manually for that to work because updater will think that you can't write past 1MB.
Another not obvious thing is that setting are placed at the end of the flash, since eeprom_rotate library does not check for real addresses:
https://github.com/xoseperez/eeprom_rotate/blob/377935a85f005d49b66de569204cec86246999c1/src/EEPROM_Rotate.cpp#L68

// If the memory layout has more than one sector reserved use those,
// otherwise calculate pool size based on memory size.
if (EEPROMr.size() == 1) {
if (EEPROMr.last() > 1000) { // 4Mb boards
EEPROMr.size(4);
} else if (EEPROMr.last() > 250) { // 1Mb boards
EEPROMr.size(2);
}
}

@JavierAder
Copy link
Author

Well, that is really weird; I am using ESPAsyncTCP to version 1.2.0, and anyway there is the same exception..

Ok, in the trace appears ws.ino; ws.ino is related to ESP Async Webserver; is it possible that Esp Async Webserver get linked to the old ESPAsyncTCP? The version in use (commit 05306e4; version 1.1.0) don't use library.json....
What about try also update ESP Async Webserver to version 1.2.0? (commit 35adb4d; https://github.com/me-no-dev/ESPAsyncWebServer/tree/35adb4dcb44f565dadfb22abcd2e5ee113ad18e8 )

The trace:


Exception (28):
epc1=0x4021abf3 epc2=0x00000000 epc3=0x00000000 excvaddr=0x00000000 depc=0x00000000
ctx: cont
sp: 3ffffe70 end: 3fffffb0 offset: 0
>>>stack>>>
3ffffe70: 0000025e 00000000 3fff773c 4021ac0d
3ffffe80: 00000000 3fff6714 00000000 40212886
3ffffe90: 3fff773c 3fff75f4 00000000 4021ac88
3ffffea0: 3fff7bac 00000000 3fff7bac 402433e8
3ffffeb0: 3fff2f4c 3fff2f58 00000006 3fff2d60
3ffffec0: 3fff7bac 3fff2d68 3fff2d64 4021acd3
3ffffed0: 3fff7bac 3fff2d68 3fff2d64 40243f64
3ffffee0: 000002a7 6764a8c0 00000018 000002a7
3ffffef0: 00000018 00000000 3ffeb4b8 401075fc
3fffff00: 00000000 3ffecf83 3ffecf69 3fff2f50
3fffff10: 3ffecf50 3fff2f58 3fff75f4 4024246d
3fffff20: 00754208 3fff6e34 3fff6e34 3fff03d0
3fffff30: 00000000 3fff75f4 0000001c 3fff6e34
3fffff40: 3ffecf42 00000000 3fff75f4 40241865
3fffff50: 6babe222 00000303 00000000 0000000c
3fffff60: 00000002 00000018 40231053 3ffeede8
3fffff70: 3ffecf1c 3fffdcc0 3ffeb4c0 3ffeb4c0
3fffff80: 40230fc6 3ffeede8 00000000 3fff6e84
3fffff90: 3fffdc80 00000000 3fff75f4 402394e3
3fffffa0: 40000f49 3fffdab0 3fffdab0 40000f49
<<<stack<<<

decoder ;

C:\EspArduinoExceptionDecoder>python decoder.py -p ESP8266 -t
C:\Users\Javier\.platformio\packages\toolchain-xtensa\ -e "C:\espurna-flash\ESPAsyncTCP\firmware.elf"  trace-post-fix.txt 
Exception: 28 (LoadProhibited: A load referenced a page mapped with an attribute that does not permit loads)

epc1:     0x4021abf3: AsyncClient::close(bool) at c:\Users\Javier\Documents\PlatformIO\Projects\xoseperez-espurna\espurna\code/.piolibdeps\ESPAsyncTCP\src/ESPAsyncTCP.cpp:670
epc2:     0x00000000
epc3:     0x00000000
excvaddr: 0x00000000
depc:     0x00000000

ctx: cont

sp:       0x3ffffe70
end:      0x3fffffb0
offset:   0x00000000

stack:
0x4021ac0d: AsyncClient::close(bool) at c:\Users\Javier\Documents\PlatformIO\Projects\xoseperez-espurna\espurna\code/.piolibdeps\ESPAsyncTCP\src/ESPAsyncTCP.cpp:670
0x40212886: operator() at C:/Users/Javier/Documents/PlatformIO/Projects/xoseperez-espurna/espurna/code/espurna/ws.ino:528
  \-> inlined by: _M_invoke at c:\users\javier\.platformio\packages\toolchain-xtensa\xtensa-lx106-elf\include\c++\4.8.2/functional:2071
0x4021ac88: AsyncClient::_recv(tcp_pcb*, pbuf*, signed char) at c:\Users\Javier\Documents\PlatformIO\Projects\xoseperez-espurna\espurna\code/.piolibdeps\ESPAsyncTCP\src/ESPAsyncTCP.cpp:670
0x402433e8: tcp_receive at /Users/igrokhotkov/espressif/arduino/tools/sdk/lwip/src/core/tcp_in.c:1369 (discriminator 1)
0x4021acd3: AsyncClient::_s_recv(void*, tcp_pcb*, pbuf*, signed char) at c:\Users\Javier\Documents\PlatformIO\Projects\xoseperez-espurna\espurna\code/.piolibdeps\ESPAsyncTCP\src/ESPAsyncTCP.cpp:670
0x40243f64: tcp_input at /Users/igrokhotkov/espressif/arduino/tools/sdk/lwip/src/core/tcp_in.c:394 (discriminator 1)
0x401075fc: pvPortMalloc at C:\Users\Javier\.platformio\packages\framework-arduinoespressif8266\cores\esp8266/heap.c:13
0x4024246d: ip_input at /Users/igrokhotkov/espressif/arduino/tools/sdk/lwip/src/core/ipv4/ip.c:559
0x40241865: ethernet_input at /Users/igrokhotkov/espressif/arduino/tools/sdk/lwip/src/netif/etharp.c:1379
0x40231053: pp_tx_idle_timeout at ??:?
0x40230fc6: pp_tx_idle_timeout at ??:?
0x402394e3: system_get_sdk_version at ??:?

(my binary and elf in zip
firmware-ESPAsyncTCP-1.2.0.zip
)

,

@mcspr
Copy link
Collaborator

mcspr commented Jun 23, 2019

I see different things on the top:

0x4021ac0d: AsyncClient::close(bool) at c:\Users\Javier\Documents\PlatformIO\Projects\xoseperez-espurna\espurna\code/.piolibdeps\ESPAsyncTCP\src/ESPAsyncTCP.cpp:207
0x40212886: operator() at c:\Users\Javier\Documents\PlatformIO\Projects\xoseperez-espurna\espurna\code/C:/Users/Javier/Documents/PlatformIO/Projects/xoseperez-espurna/espurna/code/espurna/thinkspeak.ino:127
  \-> inlined by: _M_invoke at c:\Users\Javier\Documents\PlatformIO\Projects\xoseperez-espurna\espurna\code/c:\users\javier\.platformio\packages\toolchain-xtensa\xtensa-lx106-elf\include\c++\4.8.2/functional:2071
0x4021ac88: AsyncClient::_recv(tcp_pcb*, pbuf*, signed char) at 

Note that it does not build the other ESPAsyncTCP_ID305 directory. You can make sure of proper dependencies by running pio with --verbose flag and checking out static archives in .pioenvs/itead-sonoff-pow-r2/lib*. For example, when looking at strings .pioenvs/itead-sonoff-pow-r2/libeb2/libESPAsyncTCP.a I see .piolibdeps/ESPAsyncTCP/src/ESPAsyncTCP.cpp in the listing, which is the proper ESPAsyncTCP library sourcetree, with .git/HEAD pointing at the exact commit specified in the lib_deps. Same with ESPAsyncWebServer archive.
You can try it out, but I don't see anything related in web server :/ Only things that stands out is that espasynctcp change.

Some things that sort of related and also worth checking out:

  • changing close(true) to just close()
  • changing _tspk_client->... calls inside lambdas to c->... (since we recreate it from time to time)
  • questioning espasyncwebserver.... maybe removing DEBUG_SEND calls.
    and btw is the crash happening while web interface is open or when opening it?

edit: fix paths, i am running pio-core from git

@JavierAder
Copy link
Author

'I see different things on the top:'
mmm... your output has more sense; probably I have an old version of toolchain-xtensa; in package.json is read "version": "1.40802.0"; it's resolving address to code incorrectly (for example: in ESPAsyncTCP.cpp:670 is not method close). What version do you have?

'changing close(true) to just close()'
do you mean "to just close(false)"?

'and btw is the crash happening while web interface is open or when opening it?'
I don't use web interface; I get traces via telnet. Anyway, I think is possible that following is happening: Espurna is mode STA (not web server), my wifi router is restarted (it fails over the time), Espurna loses connection, reboots in mode AP, load the server web, and THEN excpetion occurs. (later I will try check that swtiching off and on the router and seeing if exception occurs).

Well what version ESPAsyncTCP gets linked? I am now sure that it is the correct; I run xtensa-lx106-elf-objdump with option -S; and look in 0x4021abf3 (with my interpretation of assembler in comments...):


4021abec <_ZN11AsyncClient5closeEb>:
4021abec:	f0c112        	addi	a1, a1, -16
4021abef:	21c9      	s32i.n	a12, a1, 8
4021abf1:	02cd      	mov.n	a12, a2     //exception is thrown here (epc1=0x4021abf3)
4021abf3:	0228      	l32i.n	a2, a2, 0   //or here; it probrably load in register a2 the value of _pcb
4021abf5:	11d9      	s32i.n	a13, a1, 4
4021abf7:	036102        	s32i	a0, a1, 12
4021abfa:	74d030        	extui	a13, a3, 0, 8
4021abfd:	005216        	beqz	a2, 4021ac06 <_ZN11AsyncClient5closeEb+0x1a> //that seems "if(_pcb)"; that check exists only in version 1.2.0
4021ac00:	5e1c32        	l16ui	a3, a12, 188
4021ac03:	243dc5        	call0	4023efe0 <tcp_recved>  //calls to tcp_recved(_pcb, _rx_ack_len);
4021ac06:	6d8c      	beqz.n	a13, 4021ac10 <_ZN11AsyncClient5closeEb+0x24>  // "if(now)"
4021ac08:	0c2d      	mov.n	a2, a12
4021ac0a:	fff245        	call0	4021ab30 <_ZN11AsyncClient6_closeEv> //calls to _close
4021ac0d:	000106        	j	4021ac15 <_ZN11AsyncClient5closeEb+0x29>
4021ac10:	120c      	movi.n	a2, 1
4021ac12:	ac4c22        	s8i	a2, a12, 172
4021ac15:	3108      	l32i.n	a0, a1, 12
4021ac17:	21c8      	l32i.n	a12, a1, 8
4021ac19:	11d8      	l32i.n	a13, a1, 4
4021ac1b:	10c112        	addi	a1, a1, 16
4021ac1e:	f00d      	ret.n

Ok, guessing (I don't know almost nothing about xtensa architecture) : I think that problem is not a "null access"; the problem is a more general virtual address exception; for some reason, _pcb gets stored in a invalid address and then l32i.n fails ....

@mcspr
Copy link
Collaborator

mcspr commented Jun 23, 2019

What version do you have?

Aha! I completely forgot that I installed toolchain from the platform-espressif8266@2.2.2, you might want to run some project with it as a platform. There would be toolchain-xtensa (which I used, correct paths) and old toolchain-xtensa@1.40802.0 (broken paths)

do you mean "to just close(false)"?

Yes. false is the default, so either way. With it, asynctcp waits a bit to trigger disconnect using it's internal loop

This WiFi disconnection thing also came up in #1778
But it is not something I have ever encountered, even triggering soft crash on my router or removing power from it. My only hunch is that connection is not properly closed, or something gets mixed up when AP is created. It can also be related to the SDK methods turning off WiFi, maybe it crashes with scanning on? (wifiScan 1 or 0 for on and off)

Regarding the pcb, it can also be just garbage at this point. It was not nulled but was destroyed, so we can't use that pointer anymore.

@JavierAder
Copy link
Author

" maybe it crashes with scanning on? "
In my case, scanning is on; I will try turning off, but I don't know in what manner that affects to the normal operation of Espurna.

"Regarding the pcb, it can also be just garbage at this point."
I retract; it seems be a "null access", because excvaddr is 0x00000000.

Anyway, something like "if(_pcb)" should "never" trigger an exception.

@mcspr
Copy link
Collaborator

mcspr commented Jun 26, 2019

In my case, scanning is on; I will try turning off, but I don't know in what manner that affects to the normal operation of Espurna.

Only because of WiFi.disconnect() call in scanner function. Given some weird nature of async callbacks and sdk interactions.

Anyway, something like "if(_pcb)" should "never" trigger an exception.

It shouldn't. My only explanation left is that we should not re-create AsynClient instance. The destructor does nothing and _pcb is actually this->_pcb. Can this be null? :ups

@JavierAder
Copy link
Author

It shouldn't. My only explanation left is that we should not re-create AsynClient instance.

I think exactly the same: 'this' is null. In assembler the pointer to the object is passed using register a2 (a1 is the stack, a3 is the first parameter of a method, in this case "now"); _pcb is the first member of object; then for calculate is _pcb is not null (for "if (_pcb)") the following instruction is executed:

4021abf3: 0228 l32i.n a2, a2, 0

it loads in a2 the value of memory in address a2 + 0 (it try load _pcb for the next "if" and ,because it is the first member of the object, only add and offset 0); and throws an excepction because a2+0 = 0 (ESP dont let you do a load from low addresses). It is not that _pcb is null, this is null!
(the check "if(_pcb)" anyway is important)

Probably the problem is thinkspeak.ino


void _tspkPost(String data) {

    if (_tspk_client == NULL) {
        _tspk_client = new AsyncClient();
    }

    _tspk_client->onDisconnect([](void *s, AsyncClient *c) {
        DEBUG_MSG_P(PSTR("[THINGSPEAK] Disconnected\n"));
        _tspk_client->free();
        delete _tspk_client;
        _tspk_client = NULL;
    }, 0);

    _tspk_client->onTimeout([](void *s, AsyncClient *c, uint32_t time) {
        _tspk_client->close(true);
    }, 0);

    _tspk_client->onData([](void * arg, AsyncClient * c, void * response, size_t len) {

        char * b = (char *) response;
        b[len] = 0;
        char * p = strstr((char *)response, "\r\n\r\n");
        unsigned int code = (p != NULL) ? atoi(&p[4]) : 0;
        DEBUG_MSG_P(PSTR("[THINGSPEAK] Response value: %d\n"), code);

        _tspk_last_flush = millis();
        if ((0 == code) && (--_tspk_tries > 0)) {
            _tspk_flush = true;
            DEBUG_MSG_P(PSTR("[THINGSPEAK] Re-enqueuing\n"));
        } else {
            _tspkClearQueue();
        }

        _tspk_client->close(true);

    }, NULL);

    _tspk_client->onConnect([data](void * arg, AsyncClient * client) {

        DEBUG_MSG_P(PSTR("[THINGSPEAK] Connected to %s:%d\n"), THINGSPEAK_HOST, THINGSPEAK_PORT);

        #if THINGSPEAK_USE_SSL
            uint8_t fp[20] = {0};
            sslFingerPrintArray(THINGSPEAK_FINGERPRINT, fp);
            SSL * ssl = _tspk_client->getSSL();
            if (ssl_match_fingerprint(ssl, fp) != SSL_OK) {
                DEBUG_MSG_P(PSTR("[THINGSPEAK] Warning: certificate doesn't match\n"));
            }
        #endif

        DEBUG_MSG_P(PSTR("[THINGSPEAK] POST %s?%s\n"), THINGSPEAK_URL, data.c_str());

        char buffer[strlen_P(THINGSPEAK_REQUEST_TEMPLATE) + strlen(THINGSPEAK_URL) + strlen(THINGSPEAK_HOST) + data.length()];
        snprintf_P(buffer, sizeof(buffer),
            THINGSPEAK_REQUEST_TEMPLATE,
            THINGSPEAK_URL,
            THINGSPEAK_HOST,
            data.length(),
            data.c_str()
        );

        client->write(buffer);

    }, NULL);

    #if ASYNC_TCP_SSL_ENABLED
        bool connected = _tspk_client->connect(THINGSPEAK_HOST, THINGSPEAK_PORT, THINGSPEAK_USE_SSL);
    #else
        bool connected = _tspk_client->connect(THINGSPEAK_HOST, THINGSPEAK_PORT);
    #endif

    if (!connected) {
        DEBUG_MSG_P(PSTR("[THINGSPEAK] Connection failed\n"));
        _tspk_client->close(true);
    }

}

" _tspk_client = NULL; " is very suggestive... The other handlers assumes that it is not null, but who knows when and how that events are raised...

I think that there are two possible ways:

  • never destroy _tspk_client and instead re-use it (probably that is possible, but I not totally sure)
  • never access to _tspk_client without first verifying if it is not null; that is, simply add
    'if (!_tspk_client) return;
    at the beginning of the handlers.

I will try the last.

PS: by the way, telnet.ino is probably doing similar things

@JavierAder
Copy link
Author

Hi, I saw your pull request; as soon as I have some time I will try it.
Anyway, I comment you what happen with my second idea:

  • effectively, null access disappeared
  • Thinkspeak seems to lose the state of the connection; as if under certain circumstances onConnect and onDisconnect was never called (I think that is related to dns but it's just a hypothesis)
  • but, load/store exception appears (although much less frequently), and in this case related to de desctructor of AsyncClient; exceptions in code of memory manager at the time of releasing memory... I really do not know if it has to do with thinkspeak because AsyncClients are used and deleted in other parts of Espurna (telnet,ino for example). By the way, I seems to me something really odd that a memory release can throw exceptions but this seems to be the case... memory fragmentation maybe? or a bug in memory manager in core 2.3.0? I don't know.

I put the trace, maybe it's helpful:

Exception (3):
epc1=0x4010011d epc2=0x00000000 epc3=0x00000000 excvaddr=0x40033108 depc=0x00000000
ctx: sys
sp: 3ffffe50 end: 3fffffb0 offset: 0
>>>stack>>>
3ffffe50: 3fff2774 000007da 000007da 4010020c
3ffffe60: 3fff2774 00000034 00000034 4010020c
3ffffe70: 3fff2c44 3fff6e34 3fff6eb0 4010068c
3ffffe80: 00000000 00000000 3fff0d74 4022d0cc
3ffffe90: 3ffffeb0 00000000 3fff6e34 40203d07
3ffffea0: 4000050c 00000000 3fff6e34 4022da9c
3ffffeb0: 00000000 3fff6e34 3fff6e34 4021abd3
3ffffec0: 00000000 3fff2c40 3fff0d74 40212800
3ffffed0: feefeffe 3fff2c40 3fff6e34 4021aac6
3ffffee0: ffffffff 3ffeb1b4 3fff26d0 4021ae48
3ffffef0: 00000000 3fffdad0 3fff164c 00000030
3fffff00: 00000000 3ffed5c7 3fff27d0 4021ae60
3fffff10: 3fff2774 000006e9 000006e9 4023e1f0
3fffff20: 3fff2c44 3fff2c40 478fa616 4023e2b6
3fffff30: 00000000 478f7f2d 402305e7 40105680
3fffff40: 3ffef780 183257ab 60000600 40240b79
3fffff50: 40105674 00000000 00000001 4022d3a5
3fffff60: 40235fc0 3ffef768 3ffef780 60000600
3fffff70: 1832d454 3ffef780 3ffef768 40235fcd
3fffff80: 40236012 3fffdab0 00000000 3fffdcb0
3fffff90: 3ffef798 3fffdad0 3fff164c 4022d33f
3fffffa0: 40000f49 3fffdab0 3fffdab0 40000f49
<<<stack<<<

decoder output:


Exception: 3 (LoadStoreError: Processor internal physical address or data error during load or store)

epc1:     0x4010011d: umm_assimilate_up at C:\Users\Javier\.platformio\packages\framework-arduinoespressif8266\cores\esp8266\umm_m
alloc/umm_malloc.c:1163
epc2:     0x00000000
epc3:     0x00000000
excvaddr: 0x40033108
depc:     0x00000000

ctx: cont

sp:       0x3ffffe50
end:      0x3fffffb0
offset:   0x00000000

stack:
0x4010020c: _umm_free at C:\Users\Javier\.platformio\packages\framework-arduinoespressif8266\cores\esp8266\umm_malloc/umm_malloc.c
:1287
0x4010020c: _umm_free at C:\Users\Javier\.platformio\packages\framework-arduinoespressif8266\cores\esp8266\umm_malloc/umm_malloc.c
:1287
0x4010068c: free at C:\Users\Javier\.platformio\packages\framework-arduinoespressif8266\cores\esp8266\umm_malloc/umm_malloc.c:1733

0x4022d0cc: operator delete(void*) at C:\Users\Javier\.platformio\packages\framework-arduinoespressif8266\cores\esp8266/abi.cpp:57

0x40203d07: _M_manager at C:/Users/Javier/Documents/PlatformIO/Projects/xoseperez-espurna/espurna/code/espurna/ws.ino:528
0x4022da9c: std::_Function_base::~_Function_base() at c:\users\javier\.platformio\packages\toolchain-xtensa\xtensa-lx106-elf\inclu
de\c++\4.8.2/functional:2031
0x4021abd3: ~function at c:\Users\Javier\Documents\PlatformIO\Projects\xoseperez-espurna\espurna\code/.piolibdeps\ESPAsyncTCP\src/
ESPAsyncTCP.cpp:670
  \-> inlined by: AsyncClient::~AsyncClient() at c:\Users\Javier\Documents\PlatformIO\Projects\xoseperez-espurna\espurna\code/.pio
libdeps\ESPAsyncTCP\src/ESPAsyncTCP.cpp:102
0x40212800: operator() at C:/Users/Javier/Documents/PlatformIO/Projects/xoseperez-espurna/espurna/code/espurna/ws.ino:528
  \-> inlined by: _M_invoke at c:\users\javier\.platformio\packages\toolchain-xtensa\xtensa-lx106-elf\include\c++\4.8.2/functional
:2071
0x4021aac6: std::function<void (void*, AsyncClient*)>::operator()(void*, AsyncClient*) const at c:\Users\Javier\Documents\Platform
IO\Projects\xoseperez-espurna\espurna\code/.piolibdeps\ESPAsyncTCP\src/ESPAsyncTCP.cpp:670
0x4021ae48: AsyncClient::_dns_found(ip_addr*) at c:\Users\Javier\Documents\PlatformIO\Projects\xoseperez-espurna\espurna\code/.pio
libdeps\ESPAsyncTCP\src/ESPAsyncTCP.cpp:670
0x4021ae60: AsyncClient::_s_dns_found(char const*, ip_addr*, void*) at c:\Users\Javier\Documents\PlatformIO\Projects\xoseperez-esp
urna\espurna\code/.piolibdeps\ESPAsyncTCP\src/ESPAsyncTCP.cpp:670
0x4023e1f0: dns_check_entry at /Users/igrokhotkov/espressif/arduino/tools/sdk/lwip/src/core/dns.c:680
0x4023e2b6: dns_check_entries at /Users/igrokhotkov/espressif/arduino/tools/sdk/lwip/src/core/dns.c:727
  \-> inlined by: dns_tmr at /Users/igrokhotkov/espressif/arduino/tools/sdk/lwip/src/core/dns.c:311
0x402305e7: pp_enable_noise_timer at ??:?
0x40105680: dns_timer at /Users/igrokhotkov/espressif/arduino/tools/sdk/lwip/src/core/timers.c:237
0x40240b79: sys_check_timeouts at /Users/igrokhotkov/espressif/arduino/tools/sdk/lwip/src/core/timers.c:420
0x40105674: dns_timer at /Users/igrokhotkov/espressif/arduino/tools/sdk/lwip/src/core/timers.c:233
0x4022d3a5: esp_schedule at C:\Users\Javier\.platformio\packages\framework-arduinoespressif8266\cores\esp8266/core_esp8266_main.cp
p:90
0x40235fc0: ets_timer_handler_isr at ??:?
0x40235fcd: ets_timer_handler_isr at ??:?
0x40236012: ets_timer_handler_isr at ??:?
0x4022d33f: loop_task at C:\Users\Javier\.platformio\packages\framework-arduinoespressif8266\cores\esp8266/core_esp8266_main.cpp:1
30

and elf and bin
Espurna-ESPAsyncTCP-v2.zip

mcspr added a commit that referenced this issue Jul 7, 2019
Fix random resets when connection closes prematurely
@mcspr
Copy link
Collaborator

mcspr commented Jul 7, 2019

Closing via #1806

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants