Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I2C encountering bus errors #379

Closed
ryan-summers opened this issue Jan 29, 2021 · 54 comments
Closed

I2C encountering bus errors #379

ryan-summers opened this issue Jan 29, 2021 · 54 comments

Comments

@ryan-summers
Copy link

Periodically under normal booster operation (no input RF, no output RF), the I2C bus occassionally encounters failures in I2C communication using the new Quartiq firmware (which does not retry I2C transactions and instead logs the fault and resets).

Observed faults:

  • While configuring fan controller speeds, NACKs have been encountered when communicating with the MAX6639
  • While reading RF channel temperature, BUS FAULT (unexpected start/stop condition) errors have been observed

There may possibly be more I2C faults - these are the only two devices that are regularly communicated with in firmware when channel states remain static. No fault has yet been observed when communicating with the I2C mux.

Reference issues quartiq/booster#140 and quartiq/booster#128 for more information

@hartytp
Copy link
Collaborator

hartytp commented Jan 29, 2021

@gkasprow are you okay to have a look at this using the new Quartiq firmware? AFAICT this is the only real usability remaining with Booster so it would be good to prioritize it.

@ryan-summers may have specific suggestions for diagnostics to perform, but I guess the first thing is just to probe the I2C signals with Booster running and check timing/rise-times/noise levels etc

@ryan-summers
Copy link
Author

It's also worth mentioning that the errors appear to be quite sparse (~1 every few hours), so this may require hooking up a scope/analyzer that can trigger off I2C bus conditions so that we can see what's happening on the bus electrically when the faults occur. The simplest approach is probably to start with the NACK on the chassis fans.

@gkasprow
Copy link
Member

I have a decent scope with I2C analyzer. Is there a way to generate toggling on some free CPU IO when the error condition occurs?

@ryan-summers
Copy link
Author

ryan-summers commented Jan 29, 2021

If you enable an RF channel, SIG_ON and EN_PWR will both disable shortly after encountering the fault. I don't know if there's easy probe points on the HW for those channels, but that would be a trigger mechanism that doesn't require custom firmware

@gkasprow
Copy link
Member

The question is if such an error can be detected by firmware. It could be NACK for example. In such a case you could just toggle some IO and trigger the scope.

@ryan-summers
Copy link
Author

That's what I'm trying to say - firmware does detect the issue, and firmware does toggle an IO that you can trigger off of - it would just be the SIG_ON signal of a channel, since firmware will disable channels whenever the fault occurs

@gkasprow
Copy link
Member

ok, thanks :)

@gkasprow
Copy link
Member

can you send me a recent firmware version so I can give it a try?

@ryan-summers
Copy link
Author

https://github.com/quartiq/booster/actions/runs/509241722 should have the latest firmware images attached as an artifact - information on how to flash is provided in the README

@hartytp
Copy link
Collaborator

hartytp commented Feb 8, 2021

@gkasprow while you're waiting for new amp chips to arrive, could you have a look at this issue? This is the main thing blocking us on Booster right now.

@gkasprow
Copy link
Member

gkasprow commented Feb 8, 2021

Do I need to apply the RF power and load to recreate the issue?

@gkasprow
Copy link
Member

gkasprow commented Feb 8, 2021

I will reply to myself - no :)

@hartytp
Copy link
Collaborator

hartytp commented Feb 8, 2021

No.

@hartytp
Copy link
Collaborator

hartytp commented Feb 8, 2021

@ryan-summers / @jordens may have more steer here, but IMHO the first thing is to check the voltage levels/noise/timing on the I2C bus and see if it anything looks marginal. IME these rare issues are usually a result of some specification not being met.

@gkasprow
Copy link
Member

gkasprow commented Feb 8, 2021

This is what I want to do, but recreating the issue is critical to make sure the unit I have suffers from the same illness.

@hartytp
Copy link
Collaborator

hartytp commented Feb 8, 2021

On the unit I'm testing if you leave it long enough (circa a day with the current firmware IIRC) with all channels enabled and no RF applied it will hit this issue, panic and restart.

@ryan-summers
Copy link
Author

This is what I want to do, but recreating the issue is critical to make sure the unit I have suffers from the same illness.

I was able to see this with only two channels installed without any input or outputs connected, and it was very reproducible. However, it often took a decently long operating time before I observed a fault (~24-48 hours between each fault).

Reproduction steps to make it simple:

  1. Load new firmware onto Booster
  2. Enable channel 1.
  3. Save configuration in Booster such that channel 1 enables on boot
  4. Disable all channels
  5. Leave booster for the night (to wait for fault to occur)
  6. Observe later that channel 1 will be enabled, indicating Booster encountered a fault

After a fault has been observed via indication of channel 1 being enabled, you can verify that the fault occurred by using the USB port and entering the service command - the command will output information about a panic encountered in regards to I2C communication.

@gkasprow
Copy link
Member

gkasprow commented Feb 8, 2021

I connected the scope to SDA, SCL, and trigger it with the falling edge of SIG_ON_CH7. I enabled all channels. Will the issue appear in such a configuration? If not, I will modify the setup.

@ryan-summers
Copy link
Author

ryan-summers commented Feb 8, 2021

I would assume that it should trip just fine in that setup (but note I have not tested that exact configuration myself).

My analysis indicated that the fault was independent of RF input/output and channel configurations.

@gkasprow
Copy link
Member

gkasprow commented Feb 8, 2021

I managed to download the firmware and upgrade the Booster. I tried with WIndows DFUse but for some reason, it didn't work with the binaries from releases. However, it works fine with Thermostat binaries. nevermind. I upgraded the CPU; before there was an open-source firmware and the EEPROMs contain the original calibration data.
Do I need to calibrate the channels? The yellow LEDs are on together with green ones and Interlock reset does not clear them. This is what I would expect when DAC offsets are not calibrated.
What is the default voltage applied to the 3-rd power amplifier gate? Wrong voltage may overheat the power stage.

@ryan-summers
Copy link
Author

I managed to download the firmware and upgrade the Booster. I tried with WIndows DFUse but for some reason, it didn't work with the binaries from releases. However, it works fine with Thermostat binaries. nevermind. I upgraded the CPU; before there was an open-source firmware and the EEPROMs contain the original calibration data.
Do I need to calibrate the channels? The yellow LEDs are on together with green ones and Interlock reset does not clear them. This is what I would expect when DAC offsets are not calibrated.
What is the default voltage applied to the 3-rd power amplifier gate? Wrong voltage may overheat the power stage.

All of these behaviors are expected - by default, all channels are configured with a very low interlock threshold, which causes them to trip before any configuration is applied, which is intended as a default-safe behavior. Any existing calibration data stored in channel EEPROM is lost upon updating the firmware

@gkasprow
Copy link
Member

gkasprow commented Feb 8, 2021

OK, what is default gate voltage?
I've noticed that the booster self-reset once (the fans started to spin and LEDs went off). Is that the result of the I2C error?

@gkasprow
Copy link
Member

gkasprow commented Feb 8, 2021

I typed "service" and received
Version : v0.2.0
Git revision : eb2c887cb0e3fd9a7a40db5056660538c5fbb5d1-dirty
Features :
Panic Info : None
Watchdog Detected : true

@ryan-summers
Copy link
Author

You can get the cause of the reset by connect to the USB serial terminal through booster's front panel and typing service - this will show any fault that caused a reset. Default power amplifier bias voltage is -3.2V.

It looks like your booster encountered a watchdog reset - we haven't observed this behavior with any of the other devices. What hardware version are you running on? In any case, that's not the error we're discussing here. Also, it looks like the version of your build was a dirty git repo, so it's not entirely clear that you're running the latest release firmware. Where did you procure the image from?

@gkasprow
Copy link
Member

gkasprow commented Feb 8, 2021

It's booster-debug.bin from the recent release. I thought that debug release generates some usefull debug data...

@gkasprow
Copy link
Member

gkasprow commented Feb 8, 2021

I'm using 1.5 revison

@ryan-summers
Copy link
Author

ryan-summers commented Feb 8, 2021

It's booster-debug.bin from the recent release. I thought that debug release generates some usefull debug data...

I'll look in to why it's showing the build as dirty tomorrow. I would recommend using the release version - debug just provides debug symbols, but I don't think it's going to be useful in this analysis.

I'm using 1.5 revison

Logs and git revision look like you're using the most recent 0.2.0 release

edit: I now realize this refers to hardware v1.5 :)

@gkasprow
Copy link
Member

gkasprow commented Feb 8, 2021

OK, I will update to the release version and will leave it overnight.

@gkasprow
Copy link
Member

gkasprow commented Feb 8, 2021

I connected the trigger to the EN_PWR_CH7, falling edge; left it in the standby state. Let's see if something happens overnight.
I was looking at the I2C transactions and the levels are just fine.

@hartytp
Copy link
Collaborator

hartytp commented Feb 8, 2021

👍 you may find it needs a day or two to hit this issue. Check the service logs in the morning.

@gkasprow
Copy link
Member

gkasprow commented Feb 8, 2021

I've just noticed the correlation and the reason why it reboots. It happens when I stand-up. The humidifier ran out of water and I generate massive ESD every time I sit or stand-up :D. The Booster is open and has multiple cables connected to the test points, they act as an antenna....

@gkasprow
Copy link
Member

gkasprow commented Feb 8, 2021

We have -15deg outside now and the air is very dry...

@gkasprow
Copy link
Member

gkasprow commented Feb 8, 2021

it might be possible that these I2C errors are also ESD-induced...

@gkasprow
Copy link
Member

gkasprow commented Feb 8, 2021

Do they occur also when nobody is in the same room?

@hartytp
Copy link
Collaborator

hartytp commented Feb 8, 2021

Yes

@gkasprow
Copy link
Member

gkasprow commented Feb 9, 2021

Nothing happened neither overnight nor during the day. I left it working.

@hartytp
Copy link
Collaborator

hartytp commented Feb 9, 2021

It can take a while. Did you check the service log?

@gkasprow
Copy link
Member

gkasprow commented Feb 9, 2021

Yes, nothing suspicious. I left it in another room to not disturb it with ESD :)

@gkasprow
Copy link
Member

After 2 days I got reboot.
the status is
service
Version : v0.2.0
Git revision : eb2c887cb0e3fd9a7a40db5056660538c5fbb5d1-dirty
Features :
Panic Info : panicked at 'called Result::unwrap() on an Err value: Interface(BUS)', src/rf_channel.rs:798:14

Watchdog Detected : false

However, no I2C traffic was catched because the trigger was far after the event

@gkasprow
Copy link
Member

@ryan-summers Did you do tests with RJ45 cable plugged in?

@ryan-summers
Copy link
Author

I can't recall if it was plugged in for all of my tests, but a vast majority of the time I believe I had an RJ45 connector plugged in. I could have tested conditions where it was both present and not present, but I no longer recall.

@gkasprow
Copy link
Member

I left it working for a few days, but with RJ45 unplugged. The only errors I detected were caused by me (ESD discharge)

@hartytp
Copy link
Collaborator

hartytp commented Feb 18, 2021

@gkasprow to confirm: if you connect on the USB serial and run a service command, what does it say?

@gkasprow
Copy link
Member

it says what I pasted above. but the booster was rebooted by watchdog.

@hartytp
Copy link
Collaborator

hartytp commented Feb 18, 2021

I don't think that's the watchdog, but anyway thanks for the report

@hartytp
Copy link
Collaborator

hartytp commented Feb 18, 2021

Can you try calibrating all channels to 50mA please? Can you reboot the unit (to clear the logs) and enable all channels (so you see green LEDs on all channels). For good measure you might as well set an IP address and plug the ethernet in (I don't think that matters, but it would make your setup more like mine). Hopefully then you will see this issue...

@gkasprow
Copy link
Member

true, I got similar reports with watchdog : true.
the booster rebooted in my presence. I will do what you proposed.

@hartytp
Copy link
Collaborator

hartytp commented Feb 18, 2021

Also if you enable all channels then you have an obvious indication of problems: if the unit reboots for any reason the LEDs will go yellow (since the channels don't turn on at startup by default).

@gkasprow
Copy link
Member

I enabled all channels and used the channel disable signal as a scope trigger.

@hartytp
Copy link
Collaborator

hartytp commented Feb 18, 2021

Did you run the calibration script?

@gkasprow
Copy link
Member

no

@hartytp
Copy link
Collaborator

hartytp commented Feb 18, 2021

Okay, try without first. If nothing happens for a few days we can try running the cal routine, but I don't think that should be relevant.

@hartytp
Copy link
Collaborator

hartytp commented Feb 19, 2021

@gkasprow I'm keen to get this fixed. So, how about this. See if you can reproduce the issue with your Booster. If you haven't seen a crash by mid next week, I will ship you one of our Boosters that is set up correctly to demonstrate this issue.

@ryan-summers
Copy link
Author

Closing. This was discovered to be an errata in the STM32. We have still observed NACKs, but that's not what this issue is directly related to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants