-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I2C encountering bus errors #379
Comments
@gkasprow are you okay to have a look at this using the new Quartiq firmware? AFAICT this is the only real usability remaining with Booster so it would be good to prioritize it. @ryan-summers may have specific suggestions for diagnostics to perform, but I guess the first thing is just to probe the I2C signals with Booster running and check timing/rise-times/noise levels etc |
It's also worth mentioning that the errors appear to be quite sparse (~1 every few hours), so this may require hooking up a scope/analyzer that can trigger off I2C bus conditions so that we can see what's happening on the bus electrically when the faults occur. The simplest approach is probably to start with the NACK on the chassis fans. |
I have a decent scope with I2C analyzer. Is there a way to generate toggling on some free CPU IO when the error condition occurs? |
If you enable an RF channel, SIG_ON and EN_PWR will both disable shortly after encountering the fault. I don't know if there's easy probe points on the HW for those channels, but that would be a trigger mechanism that doesn't require custom firmware |
The question is if such an error can be detected by firmware. It could be NACK for example. In such a case you could just toggle some IO and trigger the scope. |
That's what I'm trying to say - firmware does detect the issue, and firmware does toggle an IO that you can trigger off of - it would just be the SIG_ON signal of a channel, since firmware will disable channels whenever the fault occurs |
ok, thanks :) |
can you send me a recent firmware version so I can give it a try? |
https://github.com/quartiq/booster/actions/runs/509241722 should have the latest firmware images attached as an artifact - information on how to flash is provided in the README |
@gkasprow while you're waiting for new amp chips to arrive, could you have a look at this issue? This is the main thing blocking us on Booster right now. |
Do I need to apply the RF power and load to recreate the issue? |
I will reply to myself - no :) |
No. |
@ryan-summers / @jordens may have more steer here, but IMHO the first thing is to check the voltage levels/noise/timing on the I2C bus and see if it anything looks marginal. IME these rare issues are usually a result of some specification not being met. |
This is what I want to do, but recreating the issue is critical to make sure the unit I have suffers from the same illness. |
On the unit I'm testing if you leave it long enough (circa a day with the current firmware IIRC) with all channels enabled and no RF applied it will hit this issue, panic and restart. |
I was able to see this with only two channels installed without any input or outputs connected, and it was very reproducible. However, it often took a decently long operating time before I observed a fault (~24-48 hours between each fault). Reproduction steps to make it simple:
After a fault has been observed via indication of channel 1 being enabled, you can verify that the fault occurred by using the USB port and entering the |
I connected the scope to SDA, SCL, and trigger it with the falling edge of SIG_ON_CH7. I enabled all channels. Will the issue appear in such a configuration? If not, I will modify the setup. |
I would assume that it should trip just fine in that setup (but note I have not tested that exact configuration myself). My analysis indicated that the fault was independent of RF input/output and channel configurations. |
I managed to download the firmware and upgrade the Booster. I tried with WIndows DFUse but for some reason, it didn't work with the binaries from releases. However, it works fine with Thermostat binaries. nevermind. I upgraded the CPU; before there was an open-source firmware and the EEPROMs contain the original calibration data. |
All of these behaviors are expected - by default, all channels are configured with a very low interlock threshold, which causes them to trip before any configuration is applied, which is intended as a default-safe behavior. Any existing calibration data stored in channel EEPROM is lost upon updating the firmware |
OK, what is default gate voltage? |
I typed "service" and received |
You can get the cause of the reset by connect to the USB serial terminal through booster's front panel and typing It looks like your booster encountered a watchdog reset - we haven't observed this behavior with any of the other devices. What hardware version are you running on? In any case, that's not the error we're discussing here. Also, it looks like the version of your build was a dirty git repo, so it's not entirely clear that you're running the latest release firmware. Where did you procure the image from? |
It's booster-debug.bin from the recent release. I thought that debug release generates some usefull debug data... |
I'm using 1.5 revison |
I'll look in to why it's showing the build as
Logs and git revision look like you're using the most recent 0.2.0 release edit: I now realize this refers to hardware v1.5 :) |
OK, I will update to the release version and will leave it overnight. |
I connected the trigger to the EN_PWR_CH7, falling edge; left it in the standby state. Let's see if something happens overnight. |
👍 you may find it needs a day or two to hit this issue. Check the service logs in the morning. |
I've just noticed the correlation and the reason why it reboots. It happens when I stand-up. The humidifier ran out of water and I generate massive ESD every time I sit or stand-up :D. The Booster is open and has multiple cables connected to the test points, they act as an antenna.... |
We have -15deg outside now and the air is very dry... |
it might be possible that these I2C errors are also ESD-induced... |
Do they occur also when nobody is in the same room? |
Yes |
Nothing happened neither overnight nor during the day. I left it working. |
It can take a while. Did you check the |
Yes, nothing suspicious. I left it in another room to not disturb it with ESD :) |
After 2 days I got reboot. Watchdog Detected : false However, no I2C traffic was catched because the trigger was far after the event |
@ryan-summers Did you do tests with RJ45 cable plugged in? |
I can't recall if it was plugged in for all of my tests, but a vast majority of the time I believe I had an RJ45 connector plugged in. I could have tested conditions where it was both present and not present, but I no longer recall. |
I left it working for a few days, but with RJ45 unplugged. The only errors I detected were caused by me (ESD discharge) |
@gkasprow to confirm: if you connect on the USB serial and run a |
it says what I pasted above. but the booster was rebooted by watchdog. |
I don't think that's the watchdog, but anyway thanks for the report |
Can you try calibrating all channels to 50mA please? Can you reboot the unit (to clear the logs) and enable all channels (so you see green LEDs on all channels). For good measure you might as well set an IP address and plug the ethernet in (I don't think that matters, but it would make your setup more like mine). Hopefully then you will see this issue... |
true, I got similar reports with watchdog : true. |
Also if you enable all channels then you have an obvious indication of problems: if the unit reboots for any reason the LEDs will go yellow (since the channels don't turn on at startup by default). |
I enabled all channels and used the channel disable signal as a scope trigger. |
Did you run the calibration script? |
no |
Okay, try without first. If nothing happens for a few days we can try running the cal routine, but I don't think that should be relevant. |
@gkasprow I'm keen to get this fixed. So, how about this. See if you can reproduce the issue with your Booster. If you haven't seen a crash by mid next week, I will ship you one of our Boosters that is set up correctly to demonstrate this issue. |
Closing. This was discovered to be an errata in the STM32. We have still observed NACKs, but that's not what this issue is directly related to. |
Periodically under normal booster operation (no input RF, no output RF), the I2C bus occassionally encounters failures in I2C communication using the new Quartiq firmware (which does not retry I2C transactions and instead logs the fault and resets).
Observed faults:
There may possibly be more I2C faults - these are the only two devices that are regularly communicated with in firmware when channel states remain static. No fault has yet been observed when communicating with the I2C mux.
Reference issues quartiq/booster#140 and quartiq/booster#128 for more information
The text was updated successfully, but these errors were encountered: