Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regular reboots and failing auto updates #134

Closed
ozonejunkieau opened this issue Oct 29, 2023 · 24 comments
Closed

Regular reboots and failing auto updates #134

ozonejunkieau opened this issue Oct 29, 2023 · 24 comments

Comments

@ozonejunkieau
Copy link

Good Day,

Just wondering if you were able to provide some pointers on how to diagnose this one.

FWIW the unit is connected to a CTXM25RVMA unit via the S403 connector, powered from the +14V pin.

I've got a few behaviours that aren't quite right:

  1. The unit sometimes just fails to connect, but if left alone will apparently come good after a few hours. During this time the Web UI is accessible and MQTT is up, it just polls awaiting a response. Even a complete power cycle of the unit doesn't seem to convince it to connect.

  2. Frequent logging of error/spareroomac/comms {"protocol":"S21","badlength":"3","command":"SD","data":"303030"}. I assume this is just an unsupported command that hasn't timed out yet, possible related to the frequent reboots?

  3. Frequent Reboots:
    image
    This is a plot of the reported uptime values from the unit. The unit only seems to report them very infrequently, though it appears to reliably report the initial near zero value reliably?

image
This includes the reported memory usage, so it doesn't appear to be running out of memory?

Hardware Version
Relatively recent files from git ordered from JLC directly:
image

Software Version
Built using make s3 from commit 1272a2fdd7793b5dcdab628740e00f53368a868c. I've tried issuing an update command to force an upgrade to the current version on the cloud but I get an error ESP_ERR_OTA_VALIDATE_FAILED - is this related to having to generate a self signed certificate as part of setting up the environment for the make script to run? Or possibly due to using a newer version of the firmware than is currently published?

Thanks for any help you are able to provide, and additionally thank you for what appears to be a most excellent project!

@revk
Copy link
Owner

revk commented Oct 29, 2023

Very comprehensive report indeed!

The log shows it trying S21 which will fail but it should cycle through some options within a few seconds and find the right one. So odd if taking a long time. I suspect if the rebooting is solved it won't be an issue though. Even so, it should have saved the protocol to use when it next boots. So it is unexplained.

I suspect the reboot is the main problem. Can you connect a serial console and get a log of the reboot / crash when it happens. That will be the big clue as to what is wrong.

There is both serial and USB on the back of the board.

@revk
Copy link
Owner

revk commented Oct 29, 2023

Oh. And yes. The upgrade is signed. You can set otahost to point to your server to upgrade to your build.

@ozonejunkieau
Copy link
Author

Very comprehensive report indeed!

The log shows it trying S21 which will fail but it should cycle through some options within a few seconds and find the right one. So odd if taking a long time. I suspect if the rebooting is solved it won't be an issue though. Even so, it should have saved the protocol to use when it next boots. So it is unexplained.

I suspect the reboot is the main problem. Can you connect a serial console and get a log of the reboot / crash when it happens. That will be the big clue as to what is wrong.

There is both serial and USB on the back of the board.

Thanks for the pointers.

When the unit is working it reports the protocol as S21 in the status message, which I believe is expected. Apparently the only difference between the S403 pinout and the S21 pinout is that the S403 is not isolated. There is an adaptor board available from the manufacturer to provide the S21 signals in the usual isolated fashion.

On this point - I'll have to work out the best way to safely capture the data from the unit, no isolation means that I can't directly connect to a laptop. I'll see what I can work out.

Just quickly, is there an explanation in the docs somewhere on what the reported "spi" value is that is broadcast as part of the state? When the unit goes through an "unstable patch" I do see this value jump around a bit.

What is the best method for setting up a PCB that I have programmed myself to support the auto update features, or is this not really a supported configuration.

@revk
Copy link
Owner

revk commented Oct 29, 2023

Ah, OK, S403 must be S21 then, sorry, my mistake. So yes, good.

The errors could be down to where it crashes, if part way through sending it could be "out of step" in the middle of some message exchange, and take a while to get sorted.

And yes, I see the issue, if not isolated, ouch, OK.

"spi" is the amount of external RAM (psram connected via SPI). There is internal RAM and external - either getting low is bad.

You can set up the auto upgrade using your own host, the URL is just the name of the binary. But it won't load from my updated binaries as they are signed with my key. So you'd have to build whenever I issue new code. Not sure of a good way around that.

@ozonejunkieau
Copy link
Author

I left a script running last to acquire logs via MQTT from the host, and it seems the periodic reboots are occurring as the OTA update is failing.

The timestamps where the uptime resets line up with an attempt to apply an update, which fails and then reboots the device. I've attempted to turn this off this morning by setting setting/spareroomac/otaauto to 0. I'll see how that behaves today.

@revk
Copy link
Owner

revk commented Oct 29, 2023

Ah, bingo, yes, there is a catch for failing to auto update. If there is no update that is different. But any failure causes a reboot. Wow. Well spotted.

@ozonejunkieau
Copy link
Author

Yep, that was definitely the cause of the reboots! After turning off automatic updates, no reboots and the uptime plot seems well behaved - thanks very much for that! Do you mind if I submit a pull request with some improved notes on the build process that note this as likely if you build yourself?

Unfortunately, the data from the unit still doesn't seem all that reliable in its collection - I'm seeing pretty regular time periods of more than 30 minutes where the unit drops offline and no data is received at all:
image

Even when the unit is online I get the following in what appears to be every poll:

info/spareroomac/tx {"protocol":"S21","dump":"0252449603","RD":""}
info/spareroomac/rx {"protocol":"S21","dump":"0253443030302703","SD":"000"}
error/spareroomac/comms {"protocol":"S21","badlength":"3","command":"SD","data":"303030"}

I assume this is just a message that my unit doesn't support, though it's still retried every time from what I can see?

@revk
Copy link
Owner

revk commented Oct 30, 2023

Pull request is fine. Thanks.

As for comms, that is odd. An unsupported message has a specific response and my code stops sending unsupported messages.

That seems more like some sort of comms problem. It could be interference or more likely a marginal timing of some sort. It may be that some tweaks to the protocol handling are needed.

The fact it takes long to recover is also a concern. It may be worth my adding a pause when there is an issue to help things that have got out of step somehow.

@revk
Copy link
Owner

revk commented Oct 30, 2023

I have issued code with a an extra part flush and pause on any error in S21.

@ozonejunkieau
Copy link
Author

I have issued code with a an extra part flush and pause on any error in S21.

Thanks very much for that, I've just built and updated the firmware, will see how it goes over the next 12 or 24 hours.

FWIW, I'm still seeing the regular error/spareroomac/comms {"protocol":"S21","badlength":"3","command":"SD","data":"303030"} lines in the log output.

@revk
Copy link
Owner

revk commented Oct 30, 2023

That is odd, I hope it recovers faster. There is a debug setting and I think a dump setting which may offer more clues.

@ozonejunkieau
Copy link
Author

That is odd, I hope it recovers faster. There is a debug setting and I think a dump setting which may offer more clues.

Yep, one of the posts above includes the dump output that goes with that error. I'll log all the MQTT data overnight again and see if it yields any differences.

I was also reflecting on your comment about timing - as I mentioned, I'm using the non isolated S403 port to get this data, I'm wondering if the few microseconds of difference in timing that may arise from not have optoisolators in the signal path may be a factor here. Is the timing of the protocol analysis likely to be that sensitive?

@revk
Copy link
Owner

revk commented Oct 30, 2023

The timeouts are quite long, having checked, so that seems less likely now. Other than playing with an oscilloscope it may be hard to tell.

@revk
Copy link
Owner

revk commented Oct 31, 2023

The bad length looks like a red herring, seems SD response is normally 3 bytes, and the code was expecting more. The SD is only sent in debug. Revised code issued now.

@ozonejunkieau
Copy link
Author

Ahh okay, thanks for that!

I've not had a chance to go through my device log in detail yet, but the slightly modified code is definitely more reliable at reporting data. The longest period that I've seen is now 15 minutes between data points, which is substantially improved!

I do think there may be something in the interference idea too, I'm slowly gaining confidence to deploy a second one of these so will be intrigued if the behaviour is the same. At the moment I'm a bit worried about the very high input impedance of the unprotected FET gate, wondering if some parallel capacitance and a board level pull up may help - have you tested anything like this?

@revk
Copy link
Owner

revk commented Oct 31, 2023

The input should be driven both ways by the air-con so I was not expecting any problems on that - the problem is the air-con seem to vary from model to model, and the only issue I had was actually with Tx to the Daikin which needed the current circuit.

@matt-nz
Copy link

matt-nz commented Nov 1, 2023

Very comprehensive report indeed!
The log shows it trying S21 which will fail but it should cycle through some options within a few seconds and find the right one. So odd if taking a long time. I suspect if the rebooting is solved it won't be an issue though. Even so, it should have saved the protocol to use when it next boots. So it is unexplained.
I suspect the reboot is the main problem. Can you connect a serial console and get a log of the reboot / crash when it happens. That will be the big clue as to what is wrong.
There is both serial and USB on the back of the board.

Thanks for the pointers.

When the unit is working it reports the protocol as S21 in the status message, which I believe is expected. Apparently the only difference between the S403 pinout and the S21 pinout is that the S403 is not isolated. There is an adaptor board available from the manufacturer to provide the S21 signals in the usual isolated fashion.

On this point - I'll have to work out the best way to safely capture the data from the unit, no isolation means that I can't directly connect to a laptop. I'll see what I can work out.

Just quickly, is there an explanation in the docs somewhere on what the reported "spi" value is that is broadcast as part of the state? When the unit goes through an "unstable patch" I do see this value jump around a bit.

What is the best method for setting up a PCB that I have programmed myself to support the auto update features, or is this not really a supported configuration.

I've been following this thread with interest as I have a second AC with only the S403 port and I don't have the S21 adapter board (KRP413AB1S). From what I read you have connected via the S403 and it mostly works? I understand somewhat what non isolated means in electronics but I'm not clear what risks or additional precautions I should take if I wanted to try this. Have you just connected the Faikin / ESP32 directly in the same way as with the S21?

@revk
Copy link
Owner

revk commented Nov 2, 2023

Everyone please bet careful not to electrocute yourself. And don't blame me if you do!

@matt-nz
Copy link

matt-nz commented Nov 2, 2023

Haha, 😊 thanks. I'm not going to electrocute myself more likely I wondered if it's non isolated that mains voltage flows nearby / though so want to avoid a short circuit or damage AC or Faikin.

@revk revk closed this as completed Nov 3, 2023
@ozonejunkieau
Copy link
Author

Sorry for the delay in getting back to this - I think I have finally worked out a few more things to be useful!

I think the periods that I am now missing a few minutes at a time are due to a bit of a quirk in the way I was logging MQTT data. I was only graphing data that was provided by the state/spareroomac/status topic, and this would sometimes go periods of many minutes with no apparent update. Having said this, I've now gone back through my logs and I can see that the Faikin/spareroomac topic gets a more frequent update of almost the same data and it appears to be still being reported during my times of missing logged data. I've just updated my logging script to include data from the second topic, so will see how that behaves over the next few days.

Also @matt-nz, I'll try and write up some more notes tomorrow on how the connection has worked with these units using the S403 port, but yes, it appears to work as well as the S21 port from what I can tell. I now have two different models connected with only the S403 port.

The big difference on this port is that you don't have a direct +5V pin and it is NOT isolated. What this means is:

  1. The on board 10k pullup resistor to +5V does not have anything to connect to on this header. On the first unit I have installed this on I have left this pin unconnected, and I think that it it is working fine. On the 2nd unit (different model) this didn't initially appear to work.* I have just installed a module in this unit though with an additional 10k resistor from the 5V pin to +14V pin on the Faikin module, effectively providing a 20k pullup on this signal to +14V. This was an educated risk based on visually reverse engineering the Daikin conversion circuitry based on photos online and just a general assumption that 20k to 15V is only about 750uA of pullup current, which is highly unlikely to destroy any IO pin.
  • The addendum to this is that I realised during wiring the pullup resistor that one of the wires on the S403 cable was apparently not connected properly, so this may have been the reason for the initial failure to connect on this unit.
  1. The S403 pinout is not isolated. I believe the S21 interface is always isolated, which means there is no direct pathway from the provided signals to mains power. If you touch one of these when the AC is running your body can not form a path to ground and thus it is of (relatively) low risk. This is generally the safe way of providing a physically accessible low voltage interface, such as the Daikin wifi modules that may be installed outside of the main A/C units. For example, on one of my units there is a neat space that the factory unit can be clipped in that is accessible from the same door as cleaning the filter - ie. user accessible. The S403 does not provide safety to expose the Faikin module like this. The way that I mitigate this is by installing the unit on a relatively short ( 100mm ish ) wires and leaving the entire wifi unit within the metal shielding enclosure that contains the main circuit boards. This does make it harder to get a wifi signal through the box, but I've not really had any issues. It does ensure that all the wires and the circuit board are contained within an earthed enclosure. I also heatshrink the boards completely to make it all as well behaved as possible.

If you have any doubts at all please don't touch it - the "other" end of the S403 has a pin that is at +327VDC, so it is all very close to some very dangerous pins, and just to reiterate what @revk said - don't electrocute yourself!

@revk
Copy link
Owner

revk commented Nov 4, 2023

It would be interesting to know if it can work without the pull up, ie it was just a loose connection.

My experience so far is the daikin has an internal pull up. The 5V pull up was done "just in case" there is a model without.

But good write up. I may update the manuals to explain the non isolated port. So you have a pin out & picture?

@ozonejunkieau
Copy link
Author

I'll grab some photos tomorrow from the two units that I have - they have slightly different connectors (because who needs standards). I'll also grab the exact model numbers to add to the wiki page too.

I do also have some photos of how I have done the install too, one of the S403 connectors I had was a 2.0mm pitch connector that was particularly narrow, but I managed to get it working with a Dremel and a 2mm JST connector to start with.

My own adventures have basically entirely been based on this thread: https://community.openenergymonitor.org/t/hack-my-heat-pump-and-publish-data-onto-emoncms/2551/99. Someone has documented the S403 pinout and provided high resolution images of the Daikin conversion board in it.

@revk
Copy link
Owner

revk commented Nov 4, 2023

Ah, nice.

@ozonejunkieau
Copy link
Author

Just in case you would like some photos for any of the doco:

Populated S403 Connector on an FTXM71UVMA. 2.0mm pitch connector, using a heavily modified JST connector to mate with the pins.
image

Matching module, with Polymorph as strain relief and wrapped in heatshrink. Mounted within the shielded enclosure:
image

S403 connector on a CTXM25RVMA. This one is 2.54mm pitch:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants