Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

on random/unpredictable bases, OpenDTU slowing down, or stop at all, working ( live view, mqtt etc ) #836

Open
iomax opened this issue Apr 24, 2023 · 51 comments
Labels
bug Something isn't working

Comments

@iomax
Copy link

iomax commented Apr 24, 2023

What happened?

From time to time, on really hard to predict bases, sometimes after days, sometimes several time in an hour OpenDTU slowly down web response as mqtt publish as well.
Trigger a reset, if I'll be able to reach the "reset" page, seems able to restore the good working condition but I'll be not able to know if that will be for days, hours or minutes.
Most of time, but it will be not a must, event log report several "2 - DTU command failed" and/or "36 - Unknown" but unfortunately it's not clear to me if that could be the cause or just the effect.
Tried to slowing down polling interval ( DTU as MQTT as well), avoid live view page opening, lowering/raising NRF24 Transmitting power but nothing seems to provide any, positive or negative, change on this cases.
The feeling is that cloudy sky, so with unstable power production, could "help" the issues to start but it also happened at night so it seems difficult to think about it.

Attached the event log screen-shoot and below few rows that I was able to get from Virtual Debug Console few minutes, just during the last occurrence of the issue
Nuova nota

To Reproduce Bug

It happen on random basis and I'm unable to identify the path to reproduce it

Expected Behavior

Work without random stop/reset ? :-)

Install Method

Self-Compiled

What git-hash/version of OpenDTU?

c078e88

Relevant log/trace output

15:23:01.302 > Fetch inverter: 1121XXXXXXXX
15:23:01.630 > TX RealTimeRunData Channel: 75 --> 15 83 21 35 84 80 17 79 68 80 0B 00 64 46 82 B4 00 00 00 00 00 00 00 00 E9 3A CC 
15:23:01.699 > Interrupt received
15:23:01.725 > RX Channel: 40 --> 95 83 21 35 84 83 21 35 84 01 00 01 01 54 03 83 0B ED 00 00 CC 15 04 BC 09 45 8B 
15:23:02.018 > Interrupt received
15:23:02.159 > RX Channel: 40 --> 95 83 21 35 84 83 21 35 84 82 13 87 0B 63 00 02 00 7B 03 E8 01 1F 00 21 6D CE E5 
15:23:02.427 > RX Period End
15:23:02.427 > Success
15:23:06.389 > Fetch inverter: 1121XXXXXXXX
15:23:06.513 > TX RealTimeRunData Channel: 3 --> 15 83 21 35 84 80 17 79 68 80 0B 00 64 46 82 B9 00 00 00 00 00 00 00 00 79 62 09 
15:23:06.630 > Interrupt received
15:23:06.673 > RX Channel: 61 --> 95 83 21 35 84 83 21 35 84 01 00 01 01 54 03 81 0B EA 00 00 CC 15 04 BC 09 31 FA 
15:23:06.745 > Interrupt received
15:23:06.823 > RX Channel: 40 --> 95 83 21 35 84 83 21 35 84 82 13 88 0B 61 00 02 00 7C 03 E8 01 1E 00 21 59 4F 5B 
15:23:06.927 > RX Period End
15:23:06.927 > Success
15:23:11.520 > Fetch inverter: 1121XXXXXXXX
15:23:11.662 > TX RealTimeRunData Channel: 23 --> 15 83 21 35 84 80 17 79 68 80 0B 00 64 46 82 BE 00 00 00 00 00 00 00 00 49 44 18 
15:23:11.762 > Interrupt received
15:23:11.861 > RX Channel: 75 --> 95 83 21 35 84 83 21 35 84 01 00 01 01 54 03 81 0B E8 00 00 CC 15 04 BC 09 4F 86 
15:23:12.156 > Interrupt received
15:23:12.464 > RX Channel: 61 --> 95 83 21 35 84 83 21 35 84 82 13 87 0B 5E 00 02 00 7A 03 E8 01 1E 00 21 1B 95 F5 
15:23:12.626 > RX Period End
15:23:12.626 > Success
15:23:16.656 > Fetch inverter: 1121XXXXXXXX
15:23:16.948 > TX RealTimeRunData Channel: 40 --> 15 83 21 35 84 80 17 79 68 80 0B 00 64 46 82 C4 00 00 00 00 00 00 00 00 2B 5F 1B 
15:23:17.096 > RX Period End
15:23:17.096 > All missing
15:23:17.096 > Nothing received, resend whole request
15:23:17.096 > TX RealTimeRunData Channel: 61 --> 15 83 21 35 84 80 17 79 68 80 0B 00 64 46 82 C4 00 00 00 00 00 00 00 00 2B 5F 1B 
15:23:17.229 > RX Period End
15:23:17.229 > All missing
15:23:17.229 > Nothing received, resend whole request
15:23:17.229 > TX RealTimeRunData Channel: 75 --> 15 83 21 35 84 80 17 79 68 80 0B 00 64 46 82 C4 00 00 00 00 00 00 00 00 2B 5F 1B 
15:23:17.352 > Interrupt received
15:23:17.454 > RX Channel: 40 --> 95 83 21 35 84 83 21 35 84 82 13 88 0B 56 00 02 00 7A 03 E8 01 1F 00 21 D7 18 B2 
15:23:17.545 > RX Period End
15:23:17.545 > Middle missing
15:23:17.545 > Request retransmit: 1
15:23:17.545 > TX RequestFrame Channel: 3 --> 15 83 21 35 84 80 17 79 68 81 01 
15:23:17.615 > Interrupt received
15:23:17.870 > RX Channel: 61 --> 95 83 21 35 84 83 21 35 84 01 00 01 01 55 03 7D 0B DE 00 00 CC 16 04 BD 09 43 43 
15:23:17.921 > RX Period End
15:23:17.921 > Success
15:23:21.135 > Fetch inverter: 1121XXXXXXXX
15:23:21.639 > TX RealTimeRunData Channel: 23 --> 15 83 21 35 84 80 17 79 68 80 0B 00 64 46 82 C9 00 00 00 00 00 00 00 00 BB 07 DE 
15:23:22.365 > Interrupt received
15:23:22.474 > RX Channel: 61 --> 95 83 21 35 84 83 21 35 84 01 00 01 01 54 03 7E 0B E0 00 00 CC 16 04 BD 09 4A 76 
15:23:22.563 > Interrupt received
15:23:22.723 > RX Channel: 75 --> 95 83 21 35 84 83 21 35 84 82 13 89 0B 57 00 02 00 7A 03 E8 01 1E 00 21 9C 1C FC 
15:23:23.421 > RX Period End
15:23:23.421 > Success
15:23:27.141 > Fetch inverter: 1121XXXXXXXX
15:23:27.141 > Request SystemConfigPara
15:23:27.393 > TX RealTimeRunData Channel: 40 --> 15 83 21 35 84 80 17 79 68 80 0B 00 64 46 82 CE 00 00 00 00 00 00 00 00 8B 21 CF 
15:23:27.499 > RX Period End
15:23:27.499 > All missing
15:23:27.499 > Nothing received, resend whole request
15:23:27.499 > TX RealTimeRunData Channel: 61 --> 15 83 21 35 84 80 17 79 68 80 0B 00 64 46 82 CE 00 00 00 00 00 00 00 00 8B 21 CF 
15:23:27.522 > Interrupt received
15:23:27.575 > RX Channel: 23 --> 95 83 21 35 84 83 21 35 84 82 13 8A 0B 5A 00 02 00 7A 03 E8 01 1E 00 21 EB B0 29 
15:23:27.673 > RX Period End
15:23:27.673 > Middle missing
15:23:27.673 > Request retransmit: 1
15:23:27.673 > TX RequestFrame Channel: 75 --> 15 83 21 35 84 80 17 79 68 81 01 
15:23:27.917 > RX Period End
15:23:27.917 > Middle missing
15:23:27.917 > Request retransmit: 1
15:23:27.917 > TX RequestFrame Channel: 3 --> 15 83 21 35 84 80 17 79 68 81 01 
15:23:27.999 > Interrupt received
15:23:28.336 > RX Channel: 40 --> 95 83 21 35 84 83 21 35 84 01 00 01 01 54 03 7F 0B E3 00 00 CC 17 04 BE 09 55 69 
15:23:28.443 > RX Period End
15:23:28.443 > Success
15:23:28.532 > TX SystemConfigPara Channel: 23 --> 15 83 21 35 84 80 17 79 68 80 05 00 64 46 82 CE 00 00 00 00 00 00 00 00 45 2E 00 
15:23:28.684 > Interrupt received
15:23:28.884 > RX Channel: 75 --> 95 83 21 35 84 83 21 35 84 81 00 01 03 E8 00 00 03 E8 FF FF FF FF 01 68 7D F8 F9 
15:23:28.948 > RX Period End
15:23:28.948 > Success
15:23:31.598 > Fetch inverter: 1121XXXXXXXX
15:23:32.143 > TX RealTimeRunData Channel: 40 --> 15 83 21 35 84 80 17 79 68 80 0B 00 64 46 82 D3 00 00 00 00 00 00 00 00 DB B4 17 
15:23:32.636 > Interrupt received
15:23:32.713 > RX Channel: 3 --> 95 83 21 35 84 83 21 35 84 01 00 01 01 54 03 82 0B E9 00 00 CC 17 04 BE 09 4F 84 
15:23:33.375 > Interrupt received
15:23:33.453 > RX Channel: 75 --> 95 83 21 35 84 83 21 35 84 82 13 8A 0B 60 00 00 00 7A 03 E8 01 1E 00 21 D1 BE 25 
15:23:33.535 > RX Period End
15:23:33.536 > Success

Anything else?

OpenDTU on AZDelivery ESP32 NodeMcu WiFi CP2102 + Hoymiles HM-350.

Maybe usesefull to add that it was the same since the first setup about 2 months ago and with various FW release since that

@iomax iomax added the bug Something isn't working label Apr 24, 2023
@DPO99
Copy link

DPO99 commented Apr 24, 2023

Little Bit Same here system worked for weeks without any Error. Since Yesterday i got a Lot of dtu Error.

Screenshot_20230424_175744_com.android.chrome.jpg

@Technikfan
Copy link

Hi, i also have this "DTU command failed" from the beginning.
I am using an HM-1500 and starts for some days with openDTU.
But i read in an forum that the hoymiles only work when there is power flowing through the panels.
And I saw these messegaes mostly then sunset, so the power comes down slowly.
I then figured that the WifI module in the HM-xxx doesn't have enough power to still transmit stably in this situation.
Or is it actually a fault after all?
Best greatings

@iomax
Copy link
Author

iomax commented Apr 24, 2023

I was not able to identify a rational and repeatable path about the issues as also hard for me to catch the right moment of start, it happened under a shining sun as well as in the middle of the night.

Today "lucky" day regarding my setup & this issues, and a new "slowdown & stop" happened just few minutes ago again and today seems that "2 - DTU command failed" event win hands down but as said, sometime it happend with few o no error in event log.

Nuova nota

19:30:07.530 > Websocket: [/livedata][10] disconnect
19:30:10.843 > Fetch inverter: 1121XXXXXXXX
19:30:11.121 > TX RealTimeRunData Channel: 3 --> 15 83 21 35 84 80 17 79 68 80 0B 00 64 46 BC 9E 00 00 00 00 00 00 00 00 82 BC 35 
19:30:11.581 > Interrupt received
19:30:11.665 > RX Channel: 40 --> 95 83 21 35 84 83 21 35 84 82 13 85 00 42 00 00 00 03 03 E9 00 D0 00 31 43 38 B0 
19:30:11.712 > RX Period End
19:30:11.712 > Middle missing
19:30:11.712 > Request retransmit: 1
19:30:11.712 > TX RequestFrame Channel: 23 --> 15 83 21 35 84 80 17 79 68 81 01 
19:30:11.757 > Interrupt received
19:30:11.882 > RX Channel: 61 --> 95 83 21 35 84 83 21 35 84 01 00 01 01 33 00 17 00 46 00 00 CE 8D 07 34 08 6A E4 
19:30:14.056 > RX Period End
19:30:14.056 > Success
19:30:16.907 > Fetch inverter: 1121XXXXXXXX
19:30:17.316 > TX RealTimeRunData Channel: 40 --> 15 83 21 35 84 80 17 79 68 80 0B 00 64 46 BC A4 00 00 00 00 00 00 00 00 23 96 84 
19:30:17.468 > RX Period End
19:30:17.468 > All missing
19:30:17.468 > Nothing received, resend whole request
19:30:17.468 > TX RealTimeRunData Channel: 61 --> 15 83 21 35 84 80 17 79 68 80 0B 00 64 46 BC A4 00 00 00 00 00 00 00 00 23 96 84 
19:30:17.828 > Interrupt received
19:30:19.671 > RX Channel: 3 --> 95 83 21 35 84 83 21 35 84 82 13 87 00 42 00 00 00 03 03 E8 00 CF 00 31 07 F2 22 
19:30:19.694 > RX Period End
19:30:19.695 > Middle missing
19:30:19.695 > Request retransmit: 1
19:30:19.695 > TX RequestFrame Channel: 75 --> 15 83 21 35 84 80 17 79 68 81 01 
19:30:19.711 > Interrupt received
19:30:19.863 > RX Channel: 40 --> 95 83 21 35 84 83 21 35 84 01 00 01 01 33 00 17 00 46 00 00 CE 8D 07 34 08 6B E5 
19:30:19.950 > RX Period End
19:30:19.950 > Success
19:30:20.078 > Fetch inverter: 1121XXXXXXXX
19:30:21.103 > TX RealTimeRunData Channel: 3 --> 15 83 21 35 84 80 17 79 68 80 0B 00 64 46 BC AA 00 00 00 00 00 00 00 00 43 DA A6 
19:30:21.842 > Interrupt received
19:30:22.947 > RX Channel: 61 --> 95 83 21 35 84 83 21 35 84 01 00 01 01 32 00 17 00 45 00 00 CE 8D 07 34 08 6B E7 
19:30:23.259 > Interrupt received
19:30:23.665 > RX Channel: 40 --> 95 83 21 35 84 83 21 35 84 82 13 87 00 42 00 00 00 03 03 E8 00 CF 00 31 60 58 EF 
19:30:25.396 > RX Period End
19:30:25.396 > Success
19:30:25.580 > Fetch inverter: 1121XXXXXXXX
19:30:26.085 > TX RealTimeRunData Channel: 23 --> 15 83 21 35 84 80 17 79 68 80 0B 00 64 46 BC AF 00 00 00 00 00 00 00 00 13 E5 CC 
19:30:28.187 > Interrupt received
19:30:28.256 > RX Channel: 75 --> 95 83 21 35 84 83 21 35 84 82 13 88 00 43 00 00 00 03 03 E8 00 CF 00 31 BA 9B F8 
19:30:28.589 > RX Period End
19:30:28.589 > Middle missing
19:30:28.589 > Request retransmit: 1
19:30:28.589 > TX RequestFrame Channel: 40 --> 15 83 21 35 84 80 17 79 68 81 01 
19:30:28.648 > Interrupt received
19:30:28.712 > RX Channel: 75 --> 95 83 21 35 84 83 21 35 84 01 00 01 01 34 00 17 00 46 00 00 CE 8D 07 34 08 6D E4 
19:30:28.732 > RX Period End
19:30:28.732 > Success
19:30:32.778 > Fetch inverter: 1121XXXXXXXX
19:30:32.901 > TX RealTimeRunData Channel: 61 --> 15 83 21 35 84 80 17 79 68 80 0B 00 64 46 BC B6 00 00 00 00 00 00 00 00 83 42 E2 
19:30:32.960 > Interrupt received
19:30:32.984 > RX Channel: 3 --> 95 83 21 35 84 83 21 35 84 01 00 01 01 34 00 17 00 46 00 00 CE 8D 07 34 08 65 EC 
19:30:33.007 > Interrupt received
19:30:33.254 > RX Channel: 23 --> 95 83 21 35 84 83 21 35 84 82 13 89 00 42 00 00 00 03 03 E8 00 CE 00 31 01 C4 1D 
19:30:33.268 > RX Period End
19:30:33.268 > Success
19:30:39.843 > Fetch inverter: 1121XXXXXXXX
19:30:44.250 > TX RealTimeRunData Channel: 75 --> 15 83 21 35 84 80 17 79 68 80 0B 00 64 46 BC BD 00 00 00 00 00 00 00 00 B3 31 AA 
19:30:44.757 > Interrupt received
19:30:45.462 > RX Channel: 61 --> 95 83 21 35 84 83 21 35 84 01 00 01 01 33 00 17 00 46 00 00 CE 8D 07 34 08 6F E1 
19:30:45.837 > Interrupt received
19:30:47.867 > RX Channel: 61 --> 95 83 21 35 84 83 21 35 84 82 13 8A 00 43 00 00 00 03 03 E8 00 CE 00 31 75 B0 1F 
19:30:48.048 > RX Period End
19:30:48.048 > Success
19:30:48.249 > Fetch inverter: 1121XXXXXXXX
19:30:48.650 > TX RealTimeRunData Channel: 3 --> 15 83 21 35 84 80 17 79 68 80 0B 00 64 46 BC C3 00 00 00 00 00 00 00 00 11 18 5F 
19:30:51.023 > Interrupt received
19:30:51.927 > RX Channel: 40 --> 95 83 21 35 84 83 21 35 84 82 13 89 00 42 00 00 00 03 03 E8 00 CF 00 31 0A 59 8A 
19:30:52.053 > RX Period End
19:30:52.053 > Middle missing
19:30:52.053 > Request retransmit: 1
19:30:52.053 > TX RequestFrame Channel: 23 --> 15 83 21 35 84 80 17 79 68 81 01 
19:30:52.542 > Interrupt received
19:30:53.077 > RX Channel: 75 --> 95 83 21 35 84 83 21 35 84 01 00 01 01 32 00 17 00 46 00 00 CE 8D 07 34 08 68 E7 
19:30:53.139 > RX Period End
19:30:53.139 > Success
19:30:53.463 > Fetch inverter: 1121XXXXXXXX
19:30:53.536 > TX RealTimeRunData Channel: 40 --> 15 83 21 35 84 80 17 79 68 80 0B 00 64 46 BC C8 00 00 00 00 00 00 00 00 21 6B 17 
19:30:53.644 > Interrupt received
19:30:53.910 > RX Channel: 3 --> 95 83 21 35 84 83 21 35 84 82 13 8A 00 42 00 00 00 03 03 E9 00 CF 00 31 C4 F5 EA 
19:30:54.165 > RX Period End
19:30:54.165 > Middle missing
19:30:54.165 > Request retransmit: 1
19:30:54.165 > TX RequestFrame Channel: 61 --> 15 83 21 35 84 80 17 79 68 81 01 
19:30:54.251 > Interrupt received
19:30:56.330 > RX Channel: 3 --> 95 83 21 35 84 83 21 35 84 01 00 01 01 33 00 17 00 46 00 00 CE 8D 07 34 08 82 0C 
19:30:59.506 > RX Period End
19:30:59.506 > Success
19:30:59.915 > Fetch inverter: 1121XXXXXXXX
19:31:00.238 > TX RealTimeRunData Channel: 75 --> 15 83 21 35 84 80 17 79 68 80 0B 00 64 46 BC CE 00 00 00 00 00 00 00 00 81 40 9A 
19:31:00.631 > Interrupt received

@tbnobody
Copy link
Owner

This Error Message most likley appears in the follwing conditions:

  • You are setting a limit too often
  • The DTU reboots very often so that the SystemConfigPara command runs to often
  • You are running a second DTU no the same inverter

@iomax
Copy link
Author

iomax commented Apr 24, 2023

Just one (Open)DTU here and never played with limit, or other "set", so the second option seems the more reasonable but it sound to me as an effect more than source of the problem.
Now the question is, what could do trigger this, or some sort of, fast loop reboot ? As mentioned, sunny or rainy days, midday or midnight didn't show much difference

Unfortunately I have not option to physically connect esp board to the pc and at the same time, keep it in link with the inverter on the roof, but if you suggest something to investigate I'll happy to do my best.

Quick list about what already explored without success :

  • different power sources
  • various NRF24 boards and/or power setting from Low to Maximum
  • Polling time, dtu and mqtt side, from 5 sec to 30 sec
  • several firmware version as available since about March early

@tbnobody
Copy link
Owner

I am running multiple DTUs with the default settings in terms of polling time and mqtt publish time without any problems. Maximum RF power will require additional capacitors and power supply. Otherwise it's most likely that it causes power issues.

You can look at Info --> System --> Uptime to see if the device rebootet very often

@Slummi
Copy link

Slummi commented Apr 25, 2023

Since the last update of OpenDTU I also get these "DTU command failed" messages. In previous versions I never got it.

But up to now the OpenDTU seems to work. I'm not sure which version was the last without errors.

@Slummi
Copy link

Slummi commented Apr 25, 2023

Since the update to v23.4.24 my OpenDTU performs also many reboots (35 times in the last 8 hours since the update).
I go back to v23.4.6 which was installed before v23.4.24 and worked well...

@tbnobody
Copy link
Owner

tbnobody commented Apr 25, 2023

What is your poll interval? Can you post an output of the serial console at the time of reboot? (Web Console is not enough, must really be the serial console) This would help to determine the issue.

@Slummi
Copy link

Slummi commented Apr 25, 2023

My poll intervall is 1 second. I've always used this interval.
I will try to log the data of the serial console. I saw the reboots in my home automation system where I log the OpenDTU status and its connectivity state.

By the way: All reboots happend during night time when the inverters were off. There were no further reboots since the first inverter was online.

@tbnobody
Copy link
Owner

I will try to log the data of the serial console. I saw the reboots in my home automation system where I log the OpenDTU status and its connectivity state.

You don't have to.... 1 second interval leads currently to a full command queue at night because all the commands run into a timeout. Change it to 5 seconds and everything will work. I am aware of this issue and will fix it soon.

@Slummi
Copy link

Slummi commented Apr 25, 2023

OK, since which version does the issue occur, because I've always used a 1 second interval without any issues?

@iomax
Copy link
Author

iomax commented Apr 25, 2023

Maximum RF power will require additional capacitors and power supply. Otherwise it's most likely that it causes power issues.
You can look at Info --> System --> Uptime to see if the device rebootet very often

Ok, added "uptime" to the "monitor" ( via mqtt ) and RF Power kept at "High ( -6 dB )" level.
I'll keep you updated if/when the problem will start again.

@tbnobody
Copy link
Owner

Could you please doublecheck version v23.4.25 ?
It should fix the issue with the queue overflow as well as a lot of dtu command failed messages (which occured because of the queue overflow)

@Slummi
Copy link

Slummi commented Apr 25, 2023

Of course, I update the DTU this evening when the last inverter is offline.

I give you feedback as soon as possible. Thanks for the quick fixes!

@iomax
Copy link
Author

iomax commented Apr 25, 2023

ok, I'll try it.

Thanks for the support :-)

@Slummi
Copy link

Slummi commented Apr 26, 2023

Last night I had no command failed messages and no reboots with v23.4.25.
Looks good so far.

Edit: Tested with 1 second poll interval.

@Technikfan
Copy link

Hi, possible it helps to find the solution.
Here some Screenshots from today:
image
image

Now i made also update to v23.4.25 and watching.
Greatings

@Omega13x
Copy link

Sieht mit der v23.4.25 gut aus. Habe bis jetzt noch keine Fehlermeldung. Gestern um diese Zeit hatte ich schon 15 Einträge. Habe zwei Wechselrichter (HM-600 und HM-300) mit Einspeiseregelung in Betrieb. Intervall 7Sek. Beobachte weiter.

image

@beneva52
Copy link

I had the same issue, but it is fixed now with your last update: no more "DTU command failed". Thanks.

@iomax
Copy link
Author

iomax commented Apr 26, 2023

Today, new version run rock solid and without issues up to 1 hour ago when I get a mqtt LWT ( slow reporting and so on ).
The event list was mostly empty, after the "inverter start" just one "46 - Unknown" at afternoon, then, for whatever reason, few hours later a new "46 - Unknown" and close to it the "long" ( just 8 in this case ) list of "2 - DTU command failed" start again .

Nothing showed from System uptime and in line with the restart triggered by last FW update.

What I noticed with this release was the ability to restore the good working order quickly and without the need for manual reset ( required most of time in the past )

Something more to investigate/check ?

Honestly, I'm starting to have some doubts around (my) WiFi environment (under investigation if ESP side or AP side) and it could justify a slow reaction from WEB page as well as MQTT disconnection, but I'm having a hard time finding the link between the WiFi problem, if any, and "DTU Command" error events ( keeping in mind that system doesn't record any, forced, restart )

Nuova nota
as well as MQTT disconnection, but I'm having a hard time finding the link between the WiFi problem, if any, and "DTU Command" error events ( keeping in mind that system doesn't record any, forced, restart )

Nuova nota

@iomax
Copy link
Author

iomax commented Apr 27, 2023

Still here, same story ... 14 hours without a glitch then plenty of "2 - DTU command failed", web response slowed down, mqtt slowed down and so on and, as "usual", no alert by uptime and in line with the last "manual" action. This time a manual "restart" was needed to come back in good working condition.

I'm open to any suggestions.

p.s. I'm asking myself if I'm alone on this not so comfortable status.

Nuova nota

@Technikfan
Copy link

Hi,
@iomax you never walk allone :-)
Since update yesterday no "DTU command failed", but im watching further.
image
image

@Technikfan
Copy link

Further no Problems anymore with v23.4.25

@iomax
Copy link
Author

iomax commented Apr 28, 2023

Further no Problems anymore with v23.4.25

Lucky man :-) ... unfortunately I'm not in the same state, and today ( even night ) sound like an happy day for the problem ... a good start with first "stop" around midnight ( so no long event list by DTU ) fixed just by manual reset then on morning startup with several, spontaneously aborted, "2 - DTU Command Failed" just before the inverter reaches "production" state.

@tbnobody a ( maybe dumb ) question ... as written in the first post, I'm building FW by myself (no custom option, just syncing with this repo and compiling locally on VSC) ... may I switch to your pre-compiled version (last chance to try) simply by OTA or might be better to start with a full clear/reset in this case?

Nuova nota

@Technikfan
Copy link

Hi, now not so lucky :-)
image
image

@tbnobody
Copy link
Owner

würdest du mir deine config.json sowie deine pin_mapping.json zusenden? Beides bitte bearbeiten und wifi credentials sowie ggf. seriennummern der inverter löschen? Bei mir tritt das aktuell nicht mehr auf. ich würde jedoch einfach mal versuchen das mit deiner config nachzuvollziehen.

@iomax
Copy link
Author

iomax commented Apr 28, 2023

Sorry not clear if the request was for me or not, anyway, in the archive, the config.json and pin_mapping.json.

Archivio.zip

@tbnobody
Copy link
Owner

I'm building FW by myself (no custom option, just syncing with this repo and compiling locally on VSC) ... may I switch to your pre-compiled version (last chance to try) simply by OTA or might be better to start with a full clear/reset in this case?

The pre-build binaries should be exactly the same as the generic environment which you build locally. the full reset will not change anything except that also the config partition will be erased. If you plan to recover your config afterwards it doesn't matter what you are doing. the result will be the same.

@jonnycastaway
Copy link

jonnycastaway commented Apr 29, 2023

Hi, now not so lucky :-) image image

Did you send Limit Commands to your Inverter(s)? If yes, how often you send Commands to the Inverter?
My Test gives me the best results with no more than one Command every 8 Seconds. That's the shortest without Problems at my System: 1 DTU, 2 Inverters
I'm also on 23.4.25 and use the API for communication. No MQTT

@Technikfan
Copy link

Hi, no there no Limit Commands.
The poll-interval to inverter is 5 seconds and i use MQTT.
But i think mqtt is not the problem.

@Slummi
Copy link

Slummi commented May 2, 2023

Today I suddenly had the "DTU command failed" messages again (during the day). I'm currently on v23.4.28.
During this time, the OpenDTU was no longer pingable, the website and web API were not accessible and MQTT didn't work. However, the values on the display of the OpenDTU continued to be updated. After 40 minutes, everything worked again without a restart. During these 40 minutes of outage, there were a few individual MQTT reconnects, but the connection was immediately terminated again after a few seconds.

OpenDTU_Command_Failed

@D3R-ST3FAN
Copy link

Can you check uptime? Does it correspond the the DTU command failed ?
I had some (moste likely) load related reboot of a unit here.

@Slummi
Copy link

Slummi commented May 2, 2023

Can you check uptime? Does it correspond the the DTU command failed ?

As I wrote there was no reboot. The uptime is running since I did the last firmware update some days ago. The error occurred today.

@DPO99
Copy link

DPO99 commented May 3, 2023

Ich hatte heute auch einen Restart/Absturz nach einer super langsamen DTU. Vorher relativ problemlos mehrere Tage gelaufen. Ganz selten mal DTU Errors und gerade eben die DTU aufgerufen (extrem langsam). Plopp Neustart und direkt DTU Errors im Log über mehrere Inverter....scheint noch nicht ganz aus der Welt zu sein das Problem :/. Zusätzlich verhält sich die DTU nach dem Neustart sehr langsam, seiten brauchen ein paar Sekunden um sich zu öffnen.

image

image

image

Interessanterweise betrifft es nur die WR die auch Angeschalten sind. Die anderen beiden (sind gerade aus, da ich keine Leistung aus dem Akku benötige) verhalten sich noch brav):

image

image

@DPO99
Copy link

DPO99 commented May 3, 2023

Habe die DTU vom Strom getrennt/trennen müssen, da gar nix mehr ging. Seitdem läuft es wieder flüssig.

@iomax
Copy link
Author

iomax commented May 3, 2023

I'd not say but last 2 days went without (big) issues and spare/few event in event list ... FW still on v23.4.25 and the only change was about the mqtt publish time doubled against dtu poll interval ( mqtt from 5sec to 10sec, dtu poll kept at 5sec ).
I hope it could help the investigation.

@jstammi
Copy link
Contributor

jstammi commented May 4, 2023

Hi, for me this happens only on some days. And then not that often in a row as reported before. I did not notice dtu degradation (but I am just monitoring it is pingable, not enough to exclude such)

Since being aware of this issue, I checked the occurrences more closely, if I notice them. And I found it happen at least with 2 situations:

  1. shortly (some minutes) after inverter start message. Question for this would be, if this error-2 can occurr with the inverter not being powered in a stable way, yet?

  2. with high load on my server running all services on, eg afais relevant ones mqtt, dns, dhcp, ntp. I record system load and dns response times (local and public). And could correlate with such. AFAIR at least once command-failed error correlated with dns-only problem. My local one tends sometimes to fail sporadically. But this does not mean that the dns-problem itself is causing the problems. But maybe the cause for it (that I do not understand, yet).
    I should be able to reproduce and have a closer look to this, but did not have the time, yet.

Unfortunately the errors do not get recorded. Perhaps one could add to the mqtt topics counters per error number? E.g.

<prefix>/dtu/errors/1: <counter>
<prefix>/dtu/errors/2: <counter>
<prefix>/dtu/errors/46: <counter>
<prefix>/dtu/errors/47: <counter>

(those are the errors, that I noticed up to now from web ui)

@Slummi
Copy link

Slummi commented May 5, 2023

A few miniutes ago I had again the issue.
This time there was just a short interuption of the MQTT connection. A few seconds later MQTT was connected again and I received current values. But the web interface was not available for several minutes (no answer from the web service, but also no HTTP error). Now everything is working, again. There was no DTU reboot. In the log I see two "DTU command failed" messages with the same timestamp. I'm still on v23.4.28.

@iomax
Copy link
Author

iomax commented May 5, 2023

After about 3 days without problems, yesterday in the late afternoon the problem reappeared and continued, with ups and downs, until this morning. In short, the feeling is that the trigger in this case was again related to the WiFi coverage so, even if the hard link between "DTU error <> WiFi" is still a mystery to me, I did a little hack on the original code to interrupt the esp32 energy saving function. The "hacked" FW (latest source v23.4.28) is running now and the start was the strongest and most reliable WiFi link (AP report also confirms higher available bitrate).

Let's see what happens.

@iomax
Copy link
Author

iomax commented May 10, 2023

Still here and no, the total stop of the energy saving on esp32 confirm the strongest and most reliable WiFi link but not solved the issue.

During these 40 minutes of outage, there were a few individual MQTT reconnects, but the connection was immediately terminated again after a few seconds.

Yes, even here the mqtt server log show various and quick connect / disconnect more or less il line with the timing of the several "2 - DTU Command Failed" event.

@jstammi
Copy link
Contributor

jstammi commented May 10, 2023

IMHO there should be no mqtt re-connecting at all. I guess the symptoms arise from that. Maybe the openDTU get's blocked with the tcp connection breaking towards the mqtt server,

Maybe some publish call is hanging until tcp timeout is finally terminating it.

Or the mqtt server process is in trouble and does not consume incoming traffic (high system load, swapping, ...).

@iomax
Copy link
Author

iomax commented May 12, 2023

Hi all, I have some good news to share. Although I wasn't able to identify the source of the hard link between the WiFi side and the DTU side (the long list of "2 - DTU Command Failed" in the DTU event list), I may have been able to address the source of the WiFi issue, at least in my setup.

By default, the ESP32 tries to negotiate a higher bandwidth available by the 2.4 GHz WiFi specification (HT40). This may sound good, but in real life scenarios (e.g. less than optimal line of sight, neighbor's network, WiFi clients), it often backfires and results in a problematic or poor performing link. Dealing with this issue can be simple if your access point/router allows you to set up HT20 instead of HT40.

In my setup, extending the WiFi up to the roof using a "WiFi extender" was the simplest way, but unfortunately, it did not allow me to play with bandwidth options ( Although this may not be at the top of the features list to be provided in a commercial product) a little hack to the openDTU source forcing "HT20" has resulted in a stable connection with no long lists of "2 - DTU Command Failed" in the event list and 0 connect/disconnect ping pong on the MQTT server log. Even after restoring the (standard) power saving setup, everything has been running rock solid for the past two days.

Please see the attached screenshot of today's event, yesterday more or less the same, which show no "2" errors and only a few "Unknown" errors (with codes 12, 36, or 46) that were sparse. I'm not celebrating too soon, but even thinking around the real bandwidth needed by this kind of devices, as well as IoT devices in general, @tbnobody let me to suggest to evaluate the option to force "HT20" config as default in the OpenDTU code.

Thank you for your attention.

Nuova nota

@Slummi
Copy link

Slummi commented May 13, 2023

These are interesting findings, but I don't think there are any problems with the WLAN connection in my case. My DTU is located 30 cm from the access point. The problem also occurs quite differently for me. Sometimes I have the errors several times within a few days, then nothing happens for days. I could try to artificially degrade the WLAN connection or check whether the access point logs any changes in the connection when the issue occurs.

@Slummi
Copy link

Slummi commented May 15, 2023

This time the error pattern was somewhat different. Since yesterday evening, the OpenDTU has suddenly been sending faulty MQTT packets without interruption. The inverter topics suddenly all contained incorrect characters, which led to problems in my home automation system. I noticed the problem about 12 hours later. I was able to open the OpenDTU web interface, but the page with the live data did not load. Only the loading animation was visible. All the other pages worked without any problems.
After I restarted the OpenDTU, everything ran normally again and the MQTT data were also OK.

Since a few versions, the DTU has been running very unreliably.

@DPO99
Copy link

DPO99 commented May 16, 2023

After 6 days. Since there a feew Errors and now this happened:

Screenshot_20230516_201242_com.android.chrome.jpg

@tbnobody
Copy link
Owner

I had the same issue the previous day. An ESP32 was connected to the AP but the webinterface was very slow and the mqtt connection always toggled it's connection state. Interesting remark, it was a Sunton display with OpenHasp firmware.... After serveral resets everything was ok....

@jstammi
Copy link
Contributor

jstammi commented May 18, 2023 via email

@iomax
Copy link
Author

iomax commented May 18, 2023

Hi all, I'm still here ... after a few "quiet days", the issues have resurfaced this morning, along with the long list of "2" events. I'm quite certain that, at least in my case, the trigger is the unreliable WiFi infrastructure ( repeater ) but, despite its inconsistencies, I'm keeping it as is to facilitate further investigations.

Let me share some findings:

  • Disabling ESP32 sleep features and reducing WiFi bandwidth demand (from HT40 to HT20) have shown a better and more responsive link/interface, narrowing down the potential triggers for the issues, but they haven't resolved them.
  • No need to perform any actions on the "OpenDTU" side (no reset, power cycle, etc.) to restore the standard working condition. Simply turning the access point's WiFi off and on seems to be sufficient (the disconnection was detected on the ESP32 this time).
  • Despite the MQTT server log showing a long list of openDTU connect/disconnect events (approximately 300 log entries in a couple of hours), the ESP only registered two (2) "ARDUINO_EVENT_WIFI_STA_DISCONNECTED" events. In short, the ESP mostly didn't initiate any actions to recover from the WiFi issues ( confirmed no performance issues on mqtt server side and openDTU is the only "problematic" user here)

I hope these pieces of information can be helpful in some way.

@iomax
Copy link
Author

iomax commented Jun 4, 2023

Since my previous feedback, around 2 weeks ago, I have been managing the WiFi link to try to improve it and find further confirmation regarding the hard link between the event "2" flood and the WiFi. So, I moved the openDTU device inside and close to the (poor)AP. This did the trick, and there have been no more "2" flood in the events log. Sometimes there are spare events like 12, 36, or 46, but hardly "2" and no more disconnection in the MQTT server log.

An additional unexpected benefit of moving from the roof to inside was a stronger signal for the "DTU", allowing me to lower the transmitting power from "Max" to "Low." ( even the "minimum" power setting works without issues, but I prefer to stay on the safe side)

It might be interesting to note that around the same time, I implemented a sort of "delay counter" in the code and discovered that the esp32 occasionally gets heavily occupied somewhere and fails to properly execute the "network loop" code for about 3 or 4 seconds.

Hoping it is useful in some way.

@stefan123t
Copy link

stefan123t commented Jul 10, 2024

Please note that we have renamed the Event Log Alarm Code 2 from DTU command failed to Time calibration.
This has been learned from the warn_code.json in the S-Miles Installer App's APK file and concluded in #121.

@iomax so IMHO this may be something like the command sent to the inverter which contains the Timestamp has been delayed too much and does not match the internal timestamp of the inverter.
This may happen when you have a bad or only somewhat working connection between the DTU and the inverter and the DTU has to repeat the command(s) very often for a longer period of time until the complete response is received from the inverter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests