Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removed arbitration errors from socket error mask #264

Closed
wants to merge 3 commits into from

Conversation

MCFurry
Copy link

@MCFurry MCFurry commented Jan 18, 2018

Since arbitration errors are not fatal, removed them from the mask

@mathias-luedtke
Copy link
Member

mathias-luedtke commented Jan 18, 2018

Since arbitration errors are not fatal

IMHO lost arbitration errors are fatal. They should not occur on a well running CAN bus.

In any case these error should get reported.
The actual handling is done herehere, so this would be the right location to customize the fatality of errors.

I could imagine to add a whitelist somehow.
However, at the moment there is no way to pass additional parameters (#104).

@MCFurry
Copy link
Author

MCFurry commented Jan 18, 2018

From what I understand, arbitration lost errors are common on a CAN bus (https://www.cancapture.com/article/arbitration-lost-error-messages). Whenever the ROS node tries to send a message on the CAN bus at the same time another node tries to send a message with higher priority it loses arbitration.
I agree they could be reported, but the problem now is the socketcan_bridge goes into error state telling it is unable to send messages on the bus. However manually sending them works fine and when neglecting the error as I did in my PR, the node runs fine and communication continues perfectly.
Isn't going into error state and stopping communication completely overreacting?

@mathias-luedtke
Copy link
Member

mathias-luedtke commented Jan 18, 2018

From what I understand, arbitration lost errors are common on a CAN bus

This really depends on your use case.

Isn't going into error state and stopping communication completely overreacting?

This is the safest reaction ;)
At least for our CANopen busses this works quite well.

I agree they could be reported, but the problem now is the socketcan_bridge goes into error state telling it is unable to send messages on the bus.

With "reporting" I mean passing the error frame to other layers. Your patch prevents this.
I am sorry, I have linked the wrong code lines, the error state is entered here.
It is okay to apply a whitelist filter there, but is must be configurable configurable and not discard errors by default.

@MCFurry
Copy link
Author

MCFurry commented Jan 25, 2018

Are you sure that the devices with which you tested actually report these arbitration errors?

We tested a PEAK usb to CAN converter for example and this device simply doesn't report arbitration errors at all. Another device we are using (integrated CAN controller in a SOC) does however which is how we found out the problem.

@DasRoteSkelett
Copy link

Hi!

I am with MCFurry. Some chips seem to report Arbitration errors, while some do not even bother. And the message gets sent anyways (confirmed with sja1000 chip, which reported Arb Lost, while message could be received on Peak CAN Usb controller, hence, this error is by no means fatal, no need, to recover anything. Please accept PR...

Copy link
Member

@mathias-luedtke mathias-luedtke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that you want to prevent the driver from stopping on lost arbitration errors.
However, this patch fixes this in the wrong location.
And in addition I'd like the driver to stop per default with an opt-out. Rationale behind: Keep the old behavior!

So, please add the proper handling in line 205 instead.

@DasRoteSkelett
Copy link

Hey @ipa-mdl,

I am not sure if I got you right on this. I do not consider the LOSTARB an error, rather an information, since the packet gets sent, just delayed shortly. If the bus gets flooded, this might be a hint, but usually, this means, my message gets delayed by one CAN Message. Most drivers will not even bother to report this. Therefore, the "old" behaviour is, on sja1000 based chips, socketcan_interface is not usable. From my point of view, I would even consider this a bug.

I wonder, why CAN_ERR_BUSERROR is not considered an error, although it is used more widely throughout the linux CAN drivers.
I would prefer, that I do not even have to bother with receiving and interpreting a CAN_ERR_LOSTARB frame, since it happens quite frequently (25% bus load, about 0.5% CAN_ERR_LOSTARB). Of course, this could be made configurable, maybe with an override to init(). This involves either exposing can/error.h or some constants to the user. And, of course, it should be documented, that for sja1000 cards, you should call init this way if you do not want to restart CAN every couple of seconds...

To make a long story short, I am fine with just patching socketcan_interface, since I install it with a yocto recipe anyways. It is just that other users might want to save the hassle to debug and look through issues and pull requests.

Maybe @MCFurry will do the job?!

Regards

@MCFurry
Copy link
Author

MCFurry commented Feb 2, 2018

I made the error handling of arbitration errors now configurable as a ROS parameter ("lost_arbitration_is_error").
By default it still has the old behavior, but the user can choose to ignore those errors instead.
Please review if this is the proper handling you're suggesting.

@MCFurry
Copy link
Author

MCFurry commented Feb 5, 2018

We figured a boolean is a bit of a rough parameter.
The user can now select the log-level for arbitration errors: error,warning,debug or info.
By default it is on Error, so with existing launch files everything still has the old behavior.

So now you can optionally configure the parameter "lost_arbitration_reporting_level" to have values:
ARBITLOST_IS_DEBUG
ARBITLOST_IS_INFO
ARBITLOST_IS_WARN
ARBITLOST_IS_ERROR

@MCFurry
Copy link
Author

MCFurry commented Feb 15, 2018

@ipa-mdl Are these the changes you aimed for?

@mathias-luedtke
Copy link
Member

@MCFurry: Thanks for your additions. I was quite busy in the last weeks. I will try to review it tomorrow.

Copy link
Member

@mathias-luedtke mathias-luedtke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid I cannot merge this as-is.

If you want me to, I can give a full review with comments for all issues.
But I don't think that this would help a lot right now.

First of all, I would not mix the error mask feature and the reporting in socketcan_bridge in one PR. This way we can find a way to handle the arbitration problems you face.

I'd like to fix it the following way:

  • introduce an API to set the error mask (=error frame reported to the application/driver)
  • introduce an API to set the handling behavior (stopping on critical error)

The API should be:

  • backwards-compatible (if possible without ABI breaks)
  • properly encapsulated (no public variables!)
  • integrated with the canopen packages (later..)

@Entropy512
Copy link

Would a mask variable that determined whether to warn or die with error be sufficient? There's the human factors/documentation issue of having a numeric mask instead of named options, but using a mask would be much simpler.

I'm not sure if there's a need to change the low-level error mask as long as the error handling behavior in readFrame can be made less aggressive. Also, not just arbitration errors but protocol violations (e.g. error frame due to an EMI transient such as an electromagnetic brake dropping) should be options. For example, I'm working with a system where the ROS instance ideally does not reboot if the vehicle is quickly rekeyed (e.g. off then back on within a certain number of seconds) - but we always get two error frames on the bus shortly after powerup. The current systems on the bus are not negatively affected by this powerup behavior but the ROS node without masking CAN_ERR_PROT goes out to lunch until restart.

@mathias-luedtke
Copy link
Member

Would a mask variable that determined whether to warn or die with error be sufficient?

This is more less what I want at code level. There you can even build the mask our of the kernel constants.

There's the human factors/documentation issue of having a numeric mask instead of named options

The param could be a numeric or a string list, that's just a matter of the parser :)

@Timple
Copy link

Timple commented Apr 1, 2018

Would a mask variable that determined whether to warn or die with error be sufficient?

In my opinion info and debug are still useful. I disagree with the statement that 'lost arbitration errors are fatal'.
For example if all nodes send heartbeats but the clocks are (very slightly) misaligned, at some point a collision will occur. Which is no big deal because the node that lost the arbitration (lower prio node) will just resend. And everything is fine.

Or we misunderstand the arbitration lost concept, but then I'd be interested why it should be a fatal error.

@mathias-luedtke
Copy link
Member

Or we misunderstand the arbitration lost concept, but then I'd be interested why it should be a fatal error.

It might be fatal in some use cases. Do all devices try to send the message again after this error occured?

@Timple
Copy link

Timple commented May 29, 2018

Is wikipedia a trusted source here? :-)
https://en.wikipedia.org/wiki/CAN_bus

A node that loses arbitration re-queues its message for later transmission

I cannot access the official CANopen specification documents because of a required login. However their can_dictionary document also states:

If at the very same moment several
nodes try to access the bus, an arbitration
process is necessary to control which
node may transmit while the oth-er nodes
have to delay their transmis-sion.

So they delay their transmission, meaning sending later.

Do you have an example or experience of nodes which does not re-transmit?

@mathias-luedtke
Copy link
Member

mathias-luedtke commented May 29, 2018

@Timple: Thanks for the clarification!
Wikipedia is trusted enough for now ;)
Even with a login there is no document on the CAN bus itself because it's an SO standard.

I would suggest to blacklist this failues in the handler, so it won't stop the network anymore.
However, it should log the error and show an error in ROS (without configuration options for now).

More fine-grained contol can be added later if needed.

@koenlek
Copy link

koenlek commented Jun 5, 2018

For us, being able to make CAN_ERR_LOSTARB and CAN_ERR_CRTL non-fatal (via e.g. a rosparam) would be great!

@MCFurry: Could you consider adding CAN_ERR_CRTL to this PR too as a configurable error?

For more info, see #285

@mathias-luedtke
Copy link
Member

While CAN_ERR_LOSTARB might be just an indication, CAN_ERR_CRTL reports some serious problems with either sending or receiving or both.

@koenlek
Copy link

koenlek commented Jun 5, 2018

While CAN_ERR_LOSTARB might be just an indication, CAN_ERR_CRTL reports some serious problems with either sending or receiving or both.

I think in practice we have to accept that we cannot be that strict. We are dealing with a vehicle from an external supplier. When we communicate on its CAN bus we get both CAN_ERR_LOSTARB and CAN_ERR_CRTL errors. Its a long CAN network with many clients (of which some are branched too far off, making it almost more like a star network than a bus). We cannot fix that, that's just how the vehicle is. Manufacturers simply often don't make things according to the best practices & standards, and we have to be able to cope with that. Making the escalation of errors configurable seems like a sensible choice.

However, the split of code into ROS and ROS agnostic parts make it a bit cumbersome to pass rosparam configurations into the scope of socketcan_interface/include/socketcan_interface/socketcan.h

@Timple
Copy link

Timple commented Jul 23, 2018

I agree with ipa-mdl that a CAN_ERR_CRTL is more severe and should not be easily ignored (but it should be possible to do so e.g. in the use case of koenlek.

But the arbitration errors are still an issue. Today I was working on a different vehicle / different canbus etc with a canbusload of 31% on average.

Now I try to send values over the canbus with a high canid at a 10hz rate. Of course after a couple of seconds I get the lost arbitration error again. Statistically this makes sense since if I send the values at random, I have only a 69% chance of the bus being clear for transmission. (If I understand the statistics correctly).

And I only have this problem with our (very special I am beginning to think now) can device. An other usb-can dongle works perfectly because it simply does not report the arbitration lost number.

@ipa-mdl : Can you verify or acknowledge that you ever had an arbitration lost error which was actually an issue? I kind of get the feeling your devices don't report them. (Either that or I don't understand the statistics).

edit: Still looking for time / permission / priority to implement this properly, but I rather just add the arbitration lost to the mask as initially proposed here if it is a non-issue.

@Timple
Copy link

Timple commented Oct 17, 2019

So, a year has passed.
More forks have appeared that ignore this arbitration error.

Which has a very unfortunate naming because it is not an error (see my posts above, it is a matter of statistics if you have multiple unsynchronised publishers).

What do we need to do to get this error ignored? And how do we identify people with actual usecases which do require this setting? In my opinion the two slashes in the original commit is all it takes.

@DasRoteSkelett
Copy link

I completely agree with @Timple,
@ipa-mdl look at the source of the drivers in the kernel, I also tested that the message is sent anyways.
This is a clear mistake to treat this as Error.

@mathias-luedtke
Copy link
Member

@hartkopp: Thank you very much for your explanations!
In order to log the information (in ROS), we have to keep the bit in the error mask, otherwise they will not get reported at all.

Of course, not all error frames are fatal errors in all cases..
However, in my experience, protocol errors are a reliable indicator for bad wiring / connection and gradual failures of controllers. That is why they are handled conservatively.
Better safe than sorry, especially if machines get shipped to non-expert customers.

That being said, we need to improve the API/config for users, which have to suppress these error frames for any reason.

@MCFurry: My review still holds.

Instead of these new functions and so many clause etc., I prefer to keep it much simpler:

  • one bit mask for errors to just warn about (using the new logging functions), default: 0
  • one bit mask for errors to ignore (these we can remove from the error mask), default: 0
  • treat the rest as errors

We can discuss changing the defaults for ROS noetic.

@hartkopp
Copy link

That being said, we need to improve the API/config for users, which have to suppress these error frames for any reason.

Yes. Either the CAN_ERR_LOSTARB and CAN_ERR_BUSERROR might create a remarkable load and therefore should be disabled by the in-kernel filters in normal operation. Filtering these in user space will still create the load to read content from sockets (kernel space -> user space) and check (or log) that information.

The only real problem that should be taken care of is CAN_ERR_BUSOFF as it disables the CAN controller until it is either reset by the user or by some auto restart functionality (if enabled at configuration time).

@mathias-luedtke
Copy link
Member

Either the CAN_ERR_LOSTARB and CAN_ERR_BUSERROR might create a remarkable load and therefore should be disabled by the in-kernel filters in normal operation

This does not matter if the socket gets closed after the first error frame anyway ;)
So the warning mask does not make much sense..

The only real problem that should be taken care of is CAN_ERR_BUSOFF

I see, so we should not allow to disable it.
(Until support for CAN_ERR_RESTARTED was iplemented)

@hartkopp
Copy link

Either the CAN_ERR_LOSTARB and CAN_ERR_BUSERROR might create a remarkable load and therefore should be disabled by the in-kernel filters in normal operation

This does not matter if the socket gets closed after the first error frame anyway ;)
So the warning mask does not make much sense..

Argh. That was not the plan when implementing the mask.
Or was it ironic?

@mathias-luedtke
Copy link
Member

Argh. That was not the plan when implementing the mask.
Or was it ironic?

It was ironic, but not in this sense.
The driver really closes the socket (and shuts down the application logic) on every error frame, for historical and practical reasons.
At least for our robots (different dongles, different CAN controllers) this was the most robust option.
(In the past one manipulator even shutdown all communication on emergeny stops.)

@mathias-luedtke
Copy link
Member

@MCFurry: Please have a look at #362

@MCFurry
Copy link
Author

MCFurry commented Oct 24, 2019

@MCFurry: Please have a look at #362

Thx @ipa-mdl for picking this up! Much appreciated.
I'm a bit busy coming days but I'll try to help you out.

@Timple
Copy link

Timple commented Oct 28, 2019

@ipa-mdl: thanks for the work on #362

However I'm still not convinced for the usecase to close a socket connection on an arbitration error. How can this be the most robust option? What would happen if you kept listening?
Just trying to understand the usecase. Of course we would already be glad with an option to mask everything which is not (meant as) an error according to @hartkopp

Timple added a commit to nobleo/ros_canopen that referenced this pull request Oct 28, 2019
@mathias-luedtke
Copy link
Member

However I'm still not convinced for the usecase to close a socket connection on an arbitration error

As I said, we can come up with a better approach.

How can this be the most robust option?

Stopping for all errors makes the hardware wiring more robust ;)

What would happen if you kept listening?

For the lost arbitration error I don't know.
It looks like our USB dongles don't signal them. And they should me less common for CANopen masters, but this depends on the configuration.
On reason for lost arbitration could be tha

But I can tell you what would happen in case of TX overflow:
Either the network catches - then we would "just" loose control for some time.
Or the bus will go off ultimately.

If you want to do some synchronous motion in a highly constrained environment, then you just don't want to loose control.
If you would not stop immediately while prototyping, the problem will go unnoticed and might arise at your next demonstration.
IMHO ignoring warnings (and errors) must be usecase-specific, deliberate decision.
(Until now there was no choice, but only the safe option)

@hartkopp
Copy link

Stopping for all errors makes the hardware wiring more robust ;)

I would rephrase it: Stopping for relevant errors may help the user to identify potential hardware-related problems.

For the lost arbitration error I don't know.

I know! Loosing arbitration is normal in CAN operations.
https://en.wikipedia.org/wiki/CAN_bus#Data_transmission
"When this happens, the node with the ID of 16 knows it transmitted a 1, but sees a 0 and realizes that there is a collision and it lost arbitration. Node 16 stops transmitting which allows the node with ID of 15 to continue its transmission without any loss of data."

Is there anything left to be clarified?

It looks like our USB dongles don't signal them.

Which means what? Does it mean when people use a different CAN hardware they might get a different behaviour in normal CAN operation as you treat an "arbitration lost notification" as an error and stop operation?

@DasRoteSkelett
Copy link

@hartkopp ,

I fully agree, that is why a for was needed or rather a patch. Even when CAN_ERR_LOSTARB is signalled, the message gets sent on the bus. This is purely information an closing the socket on this is clearly wrong. As much as I agree on "fail early, fail hard" to make mistakes visible: This does clearly not fall into this domain.

Instead of wasting time to improve upstream (which is not appreciated, that is my feeling), it is sad but one has really to consider forking here.

Regards.

@hartkopp
Copy link

Instead of wasting time to improve upstream (which is not appreciated, that is my feeling), it is sad but one has really to consider forking here.

Argh ...

I would prefer convincing @ipa-mdl by technical means to follow the technical correct approach. Hint hint ;-)

@mathias-luedtke
Copy link
Member

I usually tend to treat all warnings as error unless there is a good reason to suppress them.
#362 now adds a suppression feature.

For the arbitration lost warning, it seems to make sense to suppress this by default.
However, the other warnings I don't want to suppress (by default), because at least my code is relying on the current behavior.

In general, ROS melodic is a LTS distro, the code was released 1.5 years ago.
I am okay with providing an opt-out mechanism (and opt-in for arbitration lost).
For noetic and ROS2 we can discuss better alternatives.

Is there anything left to be clarified?

@hartkopp:
In which cases can messages get lost or significantly (>10ms) delayed?
Which errors/warning could indicate underlying hardware/wiring problems?

Or the other way around: Which "errors" are purely informational, will not affect the control loop and can never be the result of hardware/wiring problems?

@mathias-luedtke
Copy link
Member

@DasRoteSkelett, @Timple: It would be great if you could help @MCFurry and me to finalize #362.

@hartkopp
Copy link

hartkopp commented Nov 4, 2019

Or the other way around: Which "errors" are purely informational, will not affect the control loop and can never be the result of hardware/wiring problems?

CAN_ERR_LOSTARB

The CAN is really robust and even when some more serious errors show up it takes an amount of consecutive errors until the CAN controller gets into a BUS_OFF state and terminates operation.

When "fail early and hard" is your paradigm to detect problems and leads to a termination of your application, this is your choice. When I would run this in a (remote) production system I would probably write more serious errors to a logfile - but continue operation unless running into BUS_OFF. But that would be my choice ...

@Timple
Copy link

Timple commented Nov 6, 2019

If the general consensus is now to suppress arbritration errors by default, can the first commit of this PR still be accepted?

#362 would be a new feature where users have the ability to ignore error frames as desired. I can see if I can make some time available to look into #362, won't be on short notice though.

@mathias-luedtke
Copy link
Member

@hartkopp: Thanks! I see. I am okay with suppressing "lost arbitration" per default in #362.
However, there are still some use cases in which no arbitration lost warning should occur for the CAN master node.

One example is the Schunk SDH, which use a strict polling approach.
Lost arbitration would only occur if the master gets started twice and two CAN network have been connected by accident.

can the first commit of this PR still be accepted?

No, not without an opt-out, i.e. only as part of #362
And first for ROS melodic (target of active development) only. A backport might be possible, if we finalize #362 before I backport the ROS dashing fixes.

@hartkopp
Copy link

hartkopp commented Nov 7, 2019

Just for my personal curiosity:

One example is the Schunk SDH, which use a strict polling approach.

You can only poll in the receiving side right?

If so you will never see a lost arbitration as this is only visible in the sending CAN controller that had the arbitration lost effect.

Lost arbitration would only occur if the master gets started twice and two CAN network have been connected by accident.

Why? When sending the same CAN ID from two nodes (which is an illegal thing from CAN perspective) you will not get an arbitration lost state. Both CAN controllers GET the arbitration and start to transmit their data content. This finally leads to an error frame (when the CAN data content is different).

@Timple
Copy link

Timple commented Mar 31, 2020

With the last statement from @hartkopp it seems that there is no use case left for an arbitration error detection mechanism.

Can we merge the initial commit?
Introducing an opt-out as per #362 involves a lot of code (and thus risk?) for a feature with no use case.

@Timple
Copy link

Timple commented May 28, 2020

Just to keep track, since my last count another fork appeared with a fix
This really needs addressing.
Can we keep momentum on #388?

@DasRoteSkelett
Copy link

Hi!

@Timple I really have respect for your endurance in this topic. @ipa-mdl is not willing to be open to technical arguments though. Even looking at the kernel code of the driver and knowing, that even on an CAN_ERR_LOSTARB error the data still gets send ok does not convince him, it is time to fork or patch later, as others have noticed.

Regards,

Matthias

@Timple
Copy link

Timple commented Jun 2, 2020

Of course we're operating from our fork, none of our robots would last long with the socketcan interface halting on arbitration "errors".

The reason I'm making this an effort is two-fold:

  1. I don't like to compile this stack always from source (every robot, every developer, every commit [CI])

But most importantly:

  1. Every new user of this stack has to spend quite a lot of time (days) debugging what is going on with his/her canbus. Why this node is always failing. And finally have to trace down which line is doing this and create a fork again. This is just a waste of human effort.

@mathias-luedtke
Copy link
Member

@Timple: Sry, I did not have much time to work on this topic.. My plan is to have a solution ready (merged..) until World ROS-Industrial day (July 7) :)

@DasRoteSkelett: I am convinced that the driver should not stop on CAN_ERR_LOSTARB, but it should be still part of the error mask, which solely enables the driver to forward it to an error listener.

@Timple
Copy link

Timple commented Jun 16, 2020

Great! Having a deadline seems promising 🙂


it should be still part of the error mask

This means the error mask should be configurable right? So people can subscribe to arbitration errors if they desire so. Always forwarding to the error listener "might create a remarkable load and therefore should be disabled by the in-kernel filters in normal operation"

@Timple
Copy link

Timple commented Jul 2, 2020

@ipa-mdl: A subtle ping since the deadline is three working days from now.
Is there anything we could assist with?

@mathias-luedtke
Copy link
Member

superseded by #362

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants