-
Notifications
You must be signed in to change notification settings - Fork 274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removed arbitration errors from socket error mask #264
Conversation
IMHO lost arbitration errors are fatal. They should not occur on a well running CAN bus. In any case these error should get reported. I could imagine to add a whitelist somehow. |
From what I understand, arbitration lost errors are common on a CAN bus (https://www.cancapture.com/article/arbitration-lost-error-messages). Whenever the ROS node tries to send a message on the CAN bus at the same time another node tries to send a message with higher priority it loses arbitration. |
This really depends on your use case.
This is the safest reaction ;)
With "reporting" I mean passing the error frame to other layers. Your patch prevents this. |
Are you sure that the devices with which you tested actually report these arbitration errors? We tested a PEAK usb to CAN converter for example and this device simply doesn't report arbitration errors at all. Another device we are using (integrated CAN controller in a SOC) does however which is how we found out the problem. |
Hi! I am with MCFurry. Some chips seem to report Arbitration errors, while some do not even bother. And the message gets sent anyways (confirmed with sja1000 chip, which reported Arb Lost, while message could be received on Peak CAN Usb controller, hence, this error is by no means fatal, no need, to recover anything. Please accept PR... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that you want to prevent the driver from stopping on lost arbitration errors.
However, this patch fixes this in the wrong location.
And in addition I'd like the driver to stop per default with an opt-out. Rationale behind: Keep the old behavior!
So, please add the proper handling in line 205 instead.
Hey @ipa-mdl, I am not sure if I got you right on this. I do not consider the LOSTARB an error, rather an information, since the packet gets sent, just delayed shortly. If the bus gets flooded, this might be a hint, but usually, this means, my message gets delayed by one CAN Message. Most drivers will not even bother to report this. Therefore, the "old" behaviour is, on sja1000 based chips, socketcan_interface is not usable. From my point of view, I would even consider this a bug. I wonder, why CAN_ERR_BUSERROR is not considered an error, although it is used more widely throughout the linux CAN drivers. To make a long story short, I am fine with just patching socketcan_interface, since I install it with a yocto recipe anyways. It is just that other users might want to save the hassle to debug and look through issues and pull requests. Maybe @MCFurry will do the job?! Regards |
I made the error handling of arbitration errors now configurable as a ROS parameter ("lost_arbitration_is_error"). |
We figured a boolean is a bit of a rough parameter. So now you can optionally configure the parameter "lost_arbitration_reporting_level" to have values: |
Since arbitration errors are not fatal, removed them from the mask
…or ignore and throw warning
7f63faf
to
6178d0a
Compare
@ipa-mdl Are these the changes you aimed for? |
@MCFurry: Thanks for your additions. I was quite busy in the last weeks. I will try to review it tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am afraid I cannot merge this as-is.
If you want me to, I can give a full review with comments for all issues.
But I don't think that this would help a lot right now.
First of all, I would not mix the error mask feature and the reporting in socketcan_bridge
in one PR. This way we can find a way to handle the arbitration problems you face.
I'd like to fix it the following way:
- introduce an API to set the error mask (=error frame reported to the application/driver)
- introduce an API to set the handling behavior (stopping on critical error)
The API should be:
- backwards-compatible (if possible without ABI breaks)
- properly encapsulated (no public variables!)
- integrated with the canopen packages (later..)
Would a mask variable that determined whether to warn or die with error be sufficient? There's the human factors/documentation issue of having a numeric mask instead of named options, but using a mask would be much simpler. I'm not sure if there's a need to change the low-level error mask as long as the error handling behavior in readFrame can be made less aggressive. Also, not just arbitration errors but protocol violations (e.g. error frame due to an EMI transient such as an electromagnetic brake dropping) should be options. For example, I'm working with a system where the ROS instance ideally does not reboot if the vehicle is quickly rekeyed (e.g. off then back on within a certain number of seconds) - but we always get two error frames on the bus shortly after powerup. The current systems on the bus are not negatively affected by this powerup behavior but the ROS node without masking CAN_ERR_PROT goes out to lunch until restart. |
This is more less what I want at code level. There you can even build the mask our of the kernel constants.
The param could be a numeric or a string list, that's just a matter of the parser :) |
In my opinion info and debug are still useful. I disagree with the statement that 'lost arbitration errors are fatal'. Or we misunderstand the arbitration lost concept, but then I'd be interested why it should be a fatal error. |
It might be fatal in some use cases. Do all devices try to send the message again after this error occured? |
Is wikipedia a trusted source here? :-)
I cannot access the official CANopen specification documents because of a required login. However their can_dictionary document also states:
So they delay their transmission, meaning sending later. Do you have an example or experience of nodes which does not re-transmit? |
@Timple: Thanks for the clarification! I would suggest to blacklist this failues in the handler, so it won't stop the network anymore. More fine-grained contol can be added later if needed. |
While |
I think in practice we have to accept that we cannot be that strict. We are dealing with a vehicle from an external supplier. When we communicate on its CAN bus we get both However, the split of code into ROS and ROS agnostic parts make it a bit cumbersome to pass rosparam configurations into the scope of |
I agree with ipa-mdl that a But the arbitration errors are still an issue. Today I was working on a different vehicle / different canbus etc with a canbusload of 31% on average. Now I try to send values over the canbus with a high canid at a 10hz rate. Of course after a couple of seconds I get the lost arbitration error again. Statistically this makes sense since if I send the values at random, I have only a 69% chance of the bus being clear for transmission. (If I understand the statistics correctly). And I only have this problem with our (very special I am beginning to think now) can device. An other usb-can dongle works perfectly because it simply does not report the arbitration lost number. @ipa-mdl : Can you verify or acknowledge that you ever had an arbitration lost error which was actually an issue? I kind of get the feeling your devices don't report them. (Either that or I don't understand the statistics). edit: Still looking for time / permission / priority to implement this properly, but I rather just add the arbitration lost to the mask as initially proposed here if it is a non-issue. |
So, a year has passed. Which has a very unfortunate naming because it is not an error (see my posts above, it is a matter of statistics if you have multiple unsynchronised publishers). What do we need to do to get this error ignored? And how do we identify people with actual usecases which do require this setting? In my opinion the two slashes in the original commit is all it takes. |
I completely agree with @Timple, |
@hartkopp: Thank you very much for your explanations! Of course, not all error frames are fatal errors in all cases.. That being said, we need to improve the API/config for users, which have to suppress these error frames for any reason. @MCFurry: My review still holds. Instead of these new functions and so many clause etc., I prefer to keep it much simpler:
We can discuss changing the defaults for ROS |
Yes. Either the CAN_ERR_LOSTARB and CAN_ERR_BUSERROR might create a remarkable load and therefore should be disabled by the in-kernel filters in normal operation. Filtering these in user space will still create the load to read content from sockets (kernel space -> user space) and check (or log) that information. The only real problem that should be taken care of is CAN_ERR_BUSOFF as it disables the CAN controller until it is either reset by the user or by some auto restart functionality (if enabled at configuration time). |
This does not matter if the socket gets closed after the first error frame anyway ;)
I see, so we should not allow to disable it. |
Argh. That was not the plan when implementing the mask. |
It was ironic, but not in this sense. |
@ipa-mdl: thanks for the work on #362 However I'm still not convinced for the usecase to close a socket connection on an arbitration error. How can this be the most robust option? What would happen if you kept listening? |
As I said, we can come up with a better approach.
Stopping for all errors makes the hardware wiring more robust ;)
For the lost arbitration error I don't know. But I can tell you what would happen in case of TX overflow: If you want to do some synchronous motion in a highly constrained environment, then you just don't want to loose control. |
I would rephrase it: Stopping for relevant errors may help the user to identify potential hardware-related problems.
I know! Loosing arbitration is normal in CAN operations. Is there anything left to be clarified?
Which means what? Does it mean when people use a different CAN hardware they might get a different behaviour in normal CAN operation as you treat an "arbitration lost notification" as an error and stop operation? |
I fully agree, that is why a for was needed or rather a patch. Even when CAN_ERR_LOSTARB is signalled, the message gets sent on the bus. This is purely information an closing the socket on this is clearly wrong. As much as I agree on "fail early, fail hard" to make mistakes visible: This does clearly not fall into this domain. Instead of wasting time to improve upstream (which is not appreciated, that is my feeling), it is sad but one has really to consider forking here. Regards. |
Argh ... I would prefer convincing @ipa-mdl by technical means to follow the technical correct approach. Hint hint ;-) |
I usually tend to treat all warnings as error unless there is a good reason to suppress them. For the arbitration lost warning, it seems to make sense to suppress this by default. In general, ROS
@hartkopp: Or the other way around: Which "errors" are purely informational, will not affect the control loop and can never be the result of hardware/wiring problems? |
@DasRoteSkelett, @Timple: It would be great if you could help @MCFurry and me to finalize #362. |
CAN_ERR_LOSTARB The CAN is really robust and even when some more serious errors show up it takes an amount of consecutive errors until the CAN controller gets into a BUS_OFF state and terminates operation. When "fail early and hard" is your paradigm to detect problems and leads to a termination of your application, this is your choice. When I would run this in a (remote) production system I would probably write more serious errors to a logfile - but continue operation unless running into BUS_OFF. But that would be my choice ... |
If the general consensus is now to suppress arbritration errors by default, can the first commit of this PR still be accepted? #362 would be a new feature where users have the ability to ignore error frames as desired. I can see if I can make some time available to look into #362, won't be on short notice though. |
@hartkopp: Thanks! I see. I am okay with suppressing "lost arbitration" per default in #362. One example is the Schunk SDH, which use a strict polling approach.
No, not without an opt-out, i.e. only as part of #362 |
Just for my personal curiosity:
You can only poll in the receiving side right? If so you will never see a lost arbitration as this is only visible in the sending CAN controller that had the arbitration lost effect.
Why? When sending the same CAN ID from two nodes (which is an illegal thing from CAN perspective) you will not get an arbitration lost state. Both CAN controllers GET the arbitration and start to transmit their data content. This finally leads to an error frame (when the CAN data content is different). |
Just to keep track, since my last count another fork appeared with a fix |
Hi! @Timple I really have respect for your endurance in this topic. @ipa-mdl is not willing to be open to technical arguments though. Even looking at the kernel code of the driver and knowing, that even on an CAN_ERR_LOSTARB error the data still gets send ok does not convince him, it is time to fork or patch later, as others have noticed. Regards, Matthias |
Of course we're operating from our fork, none of our robots would last long with the socketcan interface halting on arbitration "errors". The reason I'm making this an effort is two-fold:
But most importantly:
|
@Timple: Sry, I did not have much time to work on this topic.. My plan is to have a solution ready (merged..) until World ROS-Industrial day (July 7) :) @DasRoteSkelett: I am convinced that the driver should not stop on |
Great! Having a deadline seems promising 🙂
This means the error mask should be configurable right? So people can subscribe to arbitration errors if they desire so. Always forwarding to the error listener "might create a remarkable load and therefore should be disabled by the in-kernel filters in normal operation" |
@ipa-mdl: A subtle ping since the deadline is three working days from now. |
superseded by #362 |
Since arbitration errors are not fatal, removed them from the mask