Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torque timeout while running wholeBodyDynamicsDevice throught yarprobotinterface #449

Closed
traversaro opened this issue Apr 20, 2017 · 16 comments

Comments

@traversaro
Copy link
Member

@traversaro commented on Tue Feb 28 2017

Today on iCubGenova02 we experienced the following failure:

12976	4937,704660	INFO	icub/cartesianController/right_arm: Checking if cartesian solver cartesianSolver/right_arm is alive... not yet
12977	4938,655083	INFO	icub/cartesianController/left_arm: Checking if cartesian solver cartesianSolver/left_arm is alive... not yet
12978	4938,715888	INFO	icub/cartesianController/right_arm: Checking if cartesian solver cartesianSolver/right_arm is alive... not yet
12979	4939,676715	INFO	icub/cartesianController/left_arm: Checking if cartesian solver cartesianSolver/left_arm is alive... not yet
12980	4939,737185	INFO	icub/cartesianController/right_arm: Checking if cartesian solver cartesianSolver/right_arm is alive... not yet
12981	4940,433463	ERROR	 from BOARD 10.0.1.5 (EB5), src LOCAL, adr 0, time 5096s 254m 359u: (code 0x02000007, par16 0x0000 par64 0x0000000000000000) -> MC: torque sensor timeout. The joint is in a compliant interaction mode or torque control mode, and data from torque sensor have been unavailable for more than 100 ms. par16 = ID of joint. + .
12982	4940,433529	ERROR	 from BOARD 10.0.1.5 (EB5), src LOCAL, adr 0, time 5096s 254m 503u: (code 0x02000007, par16 0x0001 par64 0x0000000000000000) -> MC: torque sensor timeout. The joint is in a compliant interaction mode or torque control mode, and data from torque sensor have been unavailable for more than 100 ms. par16 = ID of joint. + .
12983	4940,433550	ERROR	 from BOARD 10.0.1.5 (EB5), src LOCAL, adr 0, time 5096s 254m 637u: (code 0x02000007, par16 0x0002 par64 0x0000000000000000) -> MC: torque sensor timeout. The joint is in a compliant interaction mode or torque control mode, and data from torque sensor have been unavailable for more than 100 ms. par16 = ID of joint. + .
12984	4940,433567	ERROR	 from BOARD 10.0.1.9 (EB9), src LOCAL, adr 0, time 5096s 271m 344u: (code 0x02000007, par16 0x0000 par64 0x0000000000000000) -> MC: torque sensor timeout. The joint is in a compliant interaction mode or torque control mode, and data from torque sensor have been unavailable for more than 100 ms. par16 = ID of joint. + .
12985	4940,433583	ERROR	 from BOARD 10.0.1.9 (EB9), src LOCAL, adr 0, time 5096s 271m 493u: (code 0x02000007, par16 0x0001 par64 0x0000000000000000) -> MC: torque sensor timeout. The joint is in a compliant interaction mode or torque control mode, and data from torque sensor have been unavailable for more than 100 ms. par16 = ID of joint. + .
12986	4940,433605	ERROR	 from BOARD 10.0.1.7 (EB7), src LOCAL, adr 0, time 5096s 232m 116u: (code 0x02000007, par16 0x0000 par64 0x0000000000000000) -> MC: torque sensor timeout. The joint is in a compliant interaction mode or torque control mode, and data from torque sensor have been unavailable for more than 100 ms. par16 = ID of joint. + .
12987	4940,433623	ERROR	 from BOARD 10.0.1.7 (EB7), src LOCAL, adr 0, time 5096s 232m 265u: (code 0x02000007, par16 0x0001 par64 0x0000000000000000) -> MC: torque sensor timeout. The joint is in a compliant interaction mode or torque control mode, and data from torque sensor have been unavailable for more than 100 ms. par16 = ID of joint. + .
12988	4940,433639	WARNING	 from BOARD 10.0.1.5 (EB5), src LOCAL, adr 0, time 5096s 254m 949u: (code 0x0000000d, par16 0x018c par64 0x0004001f00350233) -> SYS: the TX phase of the control loop has last more than wanted. In par16: TX execution time [usec]. In par64: num of tx can2 and can1 frames and latest previous execution times of TX, RX, DO [usec] + .
12989	4940,686341	INFO	icub/cartesianController/left_arm: Checking if cartesian solver cartesianSolver/left_arm is alive... not yet
12990	4940,747019	INFO	icub/cartesianController/right_arm: Checking if cartesian solver cartesianSolver/right_arm is alive... not yet
12991	4941,708888	INFO	icub/cartesianController/left_arm: Checking if cartesian solver cartesianSolver/left_arm is alive... not yet
12992	4941,769570	INFO	icub/cartesianController/right_arm: Checking if cartesian solver cartesianSolver/right_arm is alive... not yet


@traversaro commented on Thu Apr 20 2017

We had a discussion with @valegagge back in march. She told me that they in the past they also experienced problems in the communication between the PC104 and the EMS, so the problem could also be related to that.


@DanielePucci commented on Thu Apr 20 2017

This problem is happening again: @pi-q @gabrielenava @S-Dafarra are going to provide support to @valegagge to care of this problem (e.g. post the logs of the recent occurrences)


@traversaro commented on Thu Apr 20 2017

As this is probably a system problem, related to the interaction of the wholeBodyDynamicsDevice with the rest of the iCub ETH interface, I think it make more sense to move this issue to icub-support.

@traversaro
Copy link
Member Author

traversaro commented Apr 20, 2017

To recap the nominal behaviour of the torque estimation infrastructure.

Every 10 ms, the run method of the WholeBodyDynamicsDevice should run, reading the measurements and publishing the torque estimates calling the UpdateMeasure method of the IVirtualAnalogSensor interface.
With the kind help of the virtualAnalogRemapper device, this call is redirected to an updateMeasure of the embObjMotionControl device.
The embObjMotionControl then sends the estimated torque value to the EMS board. On the EMS board firmware, a watchdog ensures that a torque estimate is available at least every 100 ms if the joint level controller uses it (i.e. if the joint is in Torque control mode or in compliant interaction mode). If the estimation is not available, then the joint is set in hardware fault control mode.

@marcoaccame
Copy link

Here are some comments:

  1. about:
    @traversaro commented on Thu Apr 20 2017:
    We had a discussion with @valegagge back in march. She told me that they in the past `
    they also experienced problems in the communication between the PC104 and the EMS,
    so the problem could also be related to that.

    I can say:
    I recall that there were no communication problems in the meaning of packet losses but rather that the relevant method of embObjMotionControl was not regularly called to transmit the UDP message.
    I recall @barbalberto putting some diagnostic messages on the PC104 which showed the above.
    The reason of that I don't remember. Maybe some network latency in the gigabit connection or scheduling issues of Linux.

  2. Í don't think there is a problem below the PC104 because there are diagnostics messages which tell if a UDP packet is lost and I don't see any of such messages. They can be either of packets simply being lost (typically dropped by IP stack in Linux) or of ETH connection being down'and up again (typically because of an electric problem).

  3. From past experience, it is more probable that the message is delayed inside Linux.

Recommendation:

I would start by adding a logger in embObjMotionControl:: updateMeasure() which measures the statistics of the actual time gap between successive calls for the same joint. Or more simply: print a yWarning() if the time gap is more than a threshold (say 90 ms).

Then we can analyse these stats and attempt to make a fine tuning where required or investigate deeper if the measured time gap is always 10 ms and there are torque sensor timeout messages.

@traversaro
Copy link
Member Author

traversaro commented Apr 21, 2017

I would start by adding a logger in embObjMotionControl:: updateMeasure() which measures the statistics of the actual time gap between successive calls for the same joint. Or more simply: print a yWarning() if the time gap is more than a threshold (say 90 ms).

That seems to be relatively easy, as just involves some changes on the pc104/icub-head side. I think it makes sense to do it an bit of a systematic way, so we can leave it enabled and get feedback from the system whenever there are scheduling problems. If we are able to encapsulate this nice functionality in class, I would add the watchdog also to the thread that calls updateMeasure, i.e. the WholeBodyDynamicsDevice thread.

@marcoaccame
Copy link

I am thinking of a class (or a set of classes) which produces a histogram for a generic measure. This class will be efficiently written and be able to compile and run on YARP but also on the EMS / MC4PLUS etc.

In this way we can measure statistics of the frequency that we expect to be 10 ms in several places of the communication chain.

@traversaro
Copy link
Member Author

Given that the problem seems to be related to thread scheduling, I would also check the processes that run on the icub-head when performing the YOGA++ demo. It is possible that for some reason a process is taking 100 % of the CPU, and this could cause scheduling problem.

Slightly unrelated, but we should also make sure that this important threads run with high priority w.r.t. less time-critical threads, especially when running in Linux with the RT_PREEMPT patch enabled.

@marcoaccame
Copy link

marcoaccame commented Apr 27, 2017

Hi, I have created a new branch https://github.com/robotology/icub-main/tree/checkFreqOfTRQmeasures where I have modified embObjMotionControl:: updateMeasure() to keep tracks of all its calls for every joint.

Every time the function is called its time is logged, the delay between calls for the same joint is computed and its histogram is produced.

If the delay is too high, say > 60 ms, a yWarning() is instantly triggered.

Moreover, every 60 seconds the full probability density function (PDF) of the random variable delta(j) is printed.

The variable delta(j) is the time between two consecutive calls for updateMeasure() for the joint j quantized with a staircase transfer function:


OUT [ms]

125 |                                         
    |
    |
    |   
 25 |           ___ ......
 15 |       ___|
  5 |   ___|
  0 |__|_____________________________________    IN [ms]
       5   15  25     ...... 125     

In ideal case of 0 jitter, the PDF is an impulse centered in [5, 15), but if there are jitters we can have more pulses. As an example, in case of presence of jitters of 10 ms we expect to see pulses also in [0, 5) and [15, 25) because sometimes there are two consecutive calls spaced by 20 ms.

Starting from tomorrow we can test the new code and then decide if to keep it always online by merging the branch into devel.

@marcoaccame
Copy link

A quick update:
Last week we run some tests.
We saw that the PDF is much spread beyond the expected bin of [5, 15] and that sometimes there are values up to [85, 95] even if with very small probability.

Hence, for long simulations I think that there is possibility that delay gets > 100 ms

There is some activity now, carried out by @diegoferigo, to change the scheduling priority of threads which are involved in the processing chain.

@DanielePucci
Copy link

This issue occurred also on iCubGenova04 and iCubNancy01 (see #454), so it seems definitely to confirm what we diagnosed so far (e.g. platform independency).

@traversaro
Copy link
Member Author

traversaro commented May 15, 2017

To clarify: it is still under investigation if the issues affecting iCubGenova04 (with wholebodynamics running inside the yarprobotinterface) and iCubNancy01 (with the legacy wholeBodyDynamics running outside the robot) have actually the same root.

@DanielePucci
Copy link

DanielePucci commented May 15, 2017

@marcoaccame and @valegagge increased the watchdog to 150ms as a temporary workaround to decrease demo failures. Today, we have been running (almost the whole day) demos, and this issue did not occur.

Can you confirm?

@gabrielenava @S-Dafarra @pi-q

P.S.
As a general advice, I kindly invite you to report on issues any update. This is very, but very important. For instance, it seems to me that no report on Round Robin scheduling policy was made @diegoferigo , nor the update of the watchdog was reported @S-Dafarra @pi-q @gabrielenava

@diegoferigo
Copy link
Member

diegoferigo commented May 16, 2017

@DanielePucci The value I changed in Round Robin scheduling priority was chosen very randomly and it didn't solve the problem. After chatting with @marcoaccame, I understood better how is handled the (optional) support of the scheduler priority in yarp (e.g. yarprobotinterface). However, the right priority of the wholeBodyDynamicsDevice thread is still under investigation. I'll keep this issue updated.

@S-Dafarra
Copy link

#449 (comment) I can confirm. No fallings due to HW_FAULTS on all joints.

@diegoferigo
Copy link
Member

A quick update on the issue.

The method yarp::dev::WholeBodyDynamicsDevice::run(), on average, currently takes about 15ms on the iCubGenova04, when the rate of the thread should be 10ms. The culprit is the WholeBodyDynamicsDevice::readSensors() methods, and particularly the IMU section. As described in the PR robotology/yarp#1229, there is a mutex issue in the IMU's class that causes that delay.

Despite this behavior, it does not fully explain the occurrence of the hardware fault, that happens when the delay of the torque estimation is greater than 100ms. It is still not clear if the IMU delay also blocks the torque measurements acquisition. In the next days @gabrielenava will test the yoga demo, and we'll use the yarp branch with the IMU fix.

@serena-ivaldi my last comment does not work for your case. I was informed that iCubNancy01 doesn't have the Bosch IMU. So, two options: 1) This error is caused by something else, or 2) your issue is not related to our. I suggest to profile the time spent by all the functions inside WholeBodyDynamicsDevice::run() (check these new classes robotology/yarp#1230), I'd be curious to see if the thread's rate is exactly what is supposed to be in your case.

@serena-ivaldi
Copy link
Member

@olivierrochel-inria can you help on this? I will be away the next week..

@traversaro
Copy link
Member Author

robotology/yarp#1231 should have fixed this issue.
@serena-ivaldi given that your robot does not use a BoschIMU, I don't think you are affected by this issue.

@diegoferigo
Copy link
Member

I close this issue since we tracked down and fixed the problem.

@serena-ivaldi Since your issue has a different origin, when you have more information please put them in #454.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants