Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenDTU messing up configuration during runtime #83

Closed
HacksBugsAndRockAndRoll opened this issue Aug 16, 2022 · 30 comments
Closed

OpenDTU messing up configuration during runtime #83

HacksBugsAndRockAndRoll opened this issue Aug 16, 2022 · 30 comments

Comments

@HacksBugsAndRockAndRoll
Copy link

HacksBugsAndRockAndRoll commented Aug 16, 2022

During normal runtime on version https://github.com/tbnobody/OpenDTU/commits/0cc6ce3 the inverter configuration seems to break.

image

curl gives this info /serial changed, but it is the correct one)

curl http://192.168.178.96/api/inverter/list
{"inverter":[{"id":0,"name":"SoxSolar","serial":"114181XXXXX","type":"Unknown","max_power":[0,0,0,0]}]}

mqqt publishes like this
image

OpenDTU no longer pulls data from the inverter in this state.

@tbnobody
Copy link
Owner

Could you please doublecheck the heap usage on the system overview if this error occours again?

@HacksBugsAndRockAndRoll
Copy link
Author

image

image

also the dtu is currently still in this state. Is there anything else I should look into?

@tbnobody
Copy link
Owner

The Heap usage look a little bit high, but not too much. I am currently seeing something between 108 and 110kb.

You mentioned in the other issue that this only occours for one of your two ESP's. Have you tried to reflash this one again (After a complete flash erase)?

Do you see anything special in the serial consoole before this issue happens?

@HacksBugsAndRockAndRoll
Copy link
Author

I opened the ticket because I thought it was the ESP hardware but now the second ESP also shows this behavior.
Both ESPs were flash erased ~3 days ago.
The USB is only connected to power for this ESP at the moment since I cannot get the inverters signal at my PC.
I monitored the heap usage by polling ~30s see here (captured some spikes but always seems to recover fine)
image

@tbnobody
Copy link
Owner

Today, before I flashed a new version, the uptime was > 2 days without any issues. So you maybe have some kind of different config. Are you using MQTT with TLS?

@HacksBugsAndRockAndRoll
Copy link
Author

no, just plain mqtt with HA discovery

@HacksBugsAndRockAndRoll
Copy link
Author

it seems to be appearing only with mqtt enabled so i dug a bit in that code and found #86 might be related

@HacksBugsAndRockAndRoll
Copy link
Author

finally the issue also appeared on a dtu without mqtt enabled, this was after 3d17h uptime. The dtus with mqtt enabled still seem to fail faster.

@helgeerbe
Copy link
Contributor

Found this in "WebApi_ws_live.cpp" WebApiWsLiveClass::loop()

DynamicJsonDocument root(40960);
        JsonVariant var = root;
        generateJsonResponse(var);

        size_t len = measureJson(root);
        AsyncWebSocketMessageBuffer* buffer = _ws.makeBuffer(len); //  creates a buffer (len + 1) for you.
        if (buffer) {
            serializeJson(root, (char*)buffer->get(), len + 1);
            _ws.textAll(buffer);
        }

It seems to me, that the buffer is created of the size of the Json document. But in serializeJson() buffer size + 1is used.
Looks not correct for me. The comment indicates that in earlier code len + 1was allocated.

@helgeerbe
Copy link
Contributor

OK, ignore my comment. I had to learn, that makeBuffer() indeed creates a buffer + 1 length.

@stefan123t
Copy link

@HacksBugsAndRockAndRoll are you running two ESPs in parallel and do you at least use distinct DTU_RADIO_IDs for them ?

@HacksBugsAndRockAndRoll
Copy link
Author

Yes to both of these questions.

@stefan123t
Copy link

Can you confirm this issue also occurs if you are running only one ESP ?
We have reports that two ESPs BOTH questioning a single inverter can cause misattribution & misinterpretation of replies.
Can you supply Serial Logs where this is clearly wrong or where it happens in the payload parser / decoding ?
Do we need to add some logging to the payload parser / decoding to detect such misinterpretation / misattribution ?

@tbnobody
Copy link
Owner

There should be no misinterpretation of the data when using two different DTU id's because one DTU wouldn't see the packages of the other one. Unless.... How different are the DTU id's? The RF packet only contains the lowest 4 bytes of the ID. If this bytes are identical there might be an issue. (But the chance is very small because there are 3 different CRC checksums which have to match)

@petrm
Copy link

petrm commented Sep 21, 2022

I observe the same issue. Restarting the ESP makes it go away. It appears randomly after 1-3 days of uptime. When it happens, the DTU is accessible via the web interface, but stops polling the inverters.

@tbnobody
Copy link
Owner

tbnobody commented Sep 21, 2022

@petrm which inverter(s) are you using? Are you using mqtt? If yes, what configuration? Are you using a DTU-Lite or Pro in parallel? What is your current installed Git Hash (Info --> System)?

@tbnobody
Copy link
Owner

@petrm and @HacksBugsAndRockAndRoll do you have a chance to log the output of the serial console for a longer time? it would be interessting what happens just before this issue. (If there is something special in the serial console)
I also added some additional debug output of the startup sequence in the meantime. This output would be also interessting.

I still try to reproduce this issue. Are you doing anything special? (e.g. poll the web api using curl etc? Or something else which I may not have in mind currently?) I am rebooting my ESP regulary because of development work but I also reach uptimes of 5-10 days without problems.

@petrm
Copy link

petrm commented Sep 22, 2022

Since there is no way to get any log remotely, I can only attach the serial console when I am back in about three weeks.
I have a new state of the device: this time it can read the info from the inverter, it shows it in the UI, but sends it out corrupted to MQTT. When I go to the configuration page, the serial number there is displayed correctly, but inverter can't be identified. This would suggest that the serial number displayed there is read from a different place in the memory than the one actually used to query the inverter.
What would help:

  1. add option to reboot remotely as a workaround
  2. maybe add some debugging output also to the UI

@tbnobody
Copy link
Owner

I have a new state of the device: this time it can read the info from the inverter, it shows it in the UI, but sends it out corrupted to MQTT. When I go to the configuration page, the serial number there is displayed correctly, but inverter can't be identified. This would suggest that the serial number displayed there is read from a different place in the memory than the one actually used to query the inverter.

This is absolutly correct. There is a config structure which stores the serial number etc. but when showing the type or exporting mqtt stuff the internal data structures of the hoymiles library are used. There is an vector which stores the inverters inside the hoymiles lib:

std::vector<std::shared_ptr<InverterAbstract>> _inverters;

Anywhere in the code a part of this structure gets partly overwritten. (And therefor any future functionality is totally random). But this does not happen for all users. I would suspect some issue with the packet parser but without knowing the exact received packets its a little bit hard to analyze. (And due to the memory corruption it might be also wrong in an web output)

@HacksBugsAndRockAndRoll
Copy link
Author

Since I do not have a computer in a location where the DTU is in range to the inverter I'll need to set something up probably with a raspberry - this might take some time.
On a sidenote: the corruptions also happen during the night time when the inverter was offline. See attached graph where I track the uptime of the devices using 30s interval polls to the rest api.

To further explain this, I have my fork running on these two DTUs the only change I made is a regular detection for corrupted DTU serials in the configuration - in case a corruption is found a restart is triggered ( https://github.com/HacksBugsAndRockAndRoll/OpenDTU/blob/local/configfix-workaroud/src/ConfigFix.cpp - I know this is not the solution to the problem, but it is what allows me to use openDTU for my "productive" setup as long as the bug exists ).

The blue line shows the device with MQTT enabled which seems to increase the chance of corruption - also this device shows the corruptions during the night time.

image

@tbnobody
Copy link
Owner

If the corruption also occours during the night time it's maybe not related to the response of the inverter. But then it should be sufficient if you place one of your ESP's out of range of the inverter but connected to a computer to get the serial output.

@tbnobody
Copy link
Owner

Can you maybe download your config file (Settings --> Config Management), open the .bin file using a hex editor, overwrite your WiFi password with X (do not change the length of the file, just overwrite the characters) and provide this in some way? Then I can import your config with all applied settings and see if this issue occurs.

@HacksBugsAndRockAndRoll
Copy link
Author

Sure I'll have a look. So far I can report, that my setup in my room (no inverter connectivity but active mqtt) corrupted only once since last weekend. Unfortunateyl I did not get any meaningful logs since the rebooting did not wait for the serial to flush. Since I fixed this (Serial.flush() then reboot) 4 days ago no corruption happened on this device - I can however move the whole setup into inverter range now since I found a raspberry to attach to it.
My "productive" device had several restarts triggered in the meantime so I hope actually processing radio signals will also increase the corruption frequency on my test setup.

@petrm
Copy link

petrm commented Sep 28, 2022

Here is my config
config.zip
I upgraded to https://github.com/tbnobody/OpenDTU/commits/41758ba and no crash for 6 days.

@petrm
Copy link

petrm commented Nov 3, 2022

Short update. I now have 20 days uptime with d7fe495 and so far stable, no suspicious log messages or errors.

@stefan123t
Copy link

@tbnobody havent looked into OpenDTU for the Serial.flush() buffer. But in AhoyDTU I also have some suspicions that we may need to flush the Serial buffers from time to time in order not to reach a buffer overflow.
@HacksBugsAndRockAndRoll did you try the version that @petrm has tested for 20 days being rock-solid ?

@HacksBugsAndRockAndRoll
Copy link
Author

Currently I do not have a whole lot of time for this project. I can say, that I am running https://github.com/tbnobody/OpenDTU/commits/59b87c5 which is a slightly adjusted (self reset on corrupted config) version of 9a44324 and I still see the self resets triggered.

image

I'll need to rebase my stuff and update some time.

@stefan123t
Copy link

@HacksBugsAndRockAndRoll as we are unable to reproduce this issue on other devices,
could you report that with a newer build OpenDTU v23.12.19 or something newer ?

@tbnobody
Copy link
Owner

Would close this issue as it's really old and there where a lot of code iterations. Please open a new one if the problem occours again.

@tbnobody tbnobody closed this as not planned Won't fix, can't repro, duplicate, stale Apr 12, 2024
Copy link

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 13, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants