[JTC] conceptual issues #697

gwalck · 2023-07-13T23:36:56Z

While implementing #696 I faced numerous issues that led me to dig deep into the implementation of JTC to understand why some "intuitive" or let say theoretical expectations I had for a JTC were not met. I did not want to create tons of separate issues as they are mostly linked together.

Disclaimer :

I might have some rusted theoretical knowledge in controllers, so please just point at my mistakes if any and close the issue if this is not an issue.

Problems

dt in update Controller

already fixed in Fixed period computation in test #693

traj sampling

JTC update() called at the time_from_start of a single trajectory point will generate a desired output that matches the trajectory point.
However, in the tests evaluating reached state, since no further calls to update() is done , the current read from hardware state (or state_msgs.feedback) that gets evaluated will ALWAYS be the last state read from hardware and so will always be the state before the last update call (even with mirrored cmd/state ifs) and so will never be the desired state when evaluated.
Current tests introduced allowed deltas to compensate for the situation and highly depend on how tiny the delay (dt) between 2 calls of update loop (in updateController) is, hence variability in CI. Currently the dt is as small as a few microseconds which is unrealistic to a real robot and is variable depending on CPU load.
I tried to remove the need for a clock and add larger dt and this really showed the problem see below with a dt = 50 ms, and the test of a trajectory with a point to be reached (and evaluated) at 250 ms.

Command interfaces are [position velocity] and state interfaces are [position velocity].

current pos 1.100000              -> call 1   at time 0+ (+ means some EPS due to where now is placed) for time range 0 to 0.05
current vel   0.000000              ->    current velocity not yet changed, this is normal
desired pos 1.100000              -> no desired position  yet  -> THIS seems wrong
desired vel  0.000000              -> same 

current pos 1.100000              -> call 2   at time ~0.05+  for time range 0.05 to 0.1
current vel   0.000000
desired pos 1.090396              -> first desired movement !
desired vel  -0.356063             -> first desired velocity

current pos 1.090396              -> call 3  at time ~0.1+ for time range 0.1 to 0.15
current vel -0.356063               -> current vel (and pos) mirrored from previous desired at call 2
desired pos 1.067190              
desired vel -0.544039

current pos 1.067190              -> call 4  at time ~0.15+ for time range 0.15 to 0.20
current vel  -0.544039             -> current vel (and pos) mirrored from previous desired at call 3
desired pos 1.038749
desired vel  -0.563885

current pos 1.038749            -> call 5   at time ~0.20+ for time range 0.20 to 0.25
current vel  -0.563885           -> current vel (and pos) mirrored from previous desired at call 4
desired pos 1.013536
desired vel  -0.415289
                                                 -> no next call as 0.20+ +dt > wait_time

 rolling-ctrl-ws/src/ros2_controllers/joint_trajectory_controller/test/test_trajectory_controller.cpp:1098: Failure
1: The difference between 1.0 and joint_pos_[0] is 0.013536343640363491, which exceeds COMMON_THRESHOLD,

one sees that the test is with the desired state, and not with the current state. Correct. My implementation of the dt and the clock.now() < end_time led to no call 6 at 0.25
With a call 6 at time 0.25 the joint_pos_[0] will be correct, and match 0.1
BUT only the previous state of call 5 will be read from hardware and does not match 1.0 awaited at the 0.250 time. So the JTC with a perfect robot did not manage to reach the position on time.
Calling update once more at the exact last point (call 6) lead to other issues probably due to hold position nullifying effort interfaces etc..., did not dig more yet)
The first update of the controller after it was activated does not generate any command, this is also weird.
The desired output at call 4 should be the point to be reached so that at the exact time_from_start, with an ideal robot (cmd/state ifs mirrored) the state is reached.

How to solve :
The test should work for any dt, and so I dug deeper and found that JTC in ros_controllers samples the trajectory spline for the desired command at time+period when the ROS 2 implementation samples at current time. This means in ROS 2 JTC the desired state is not a "planed-ahead" command but the command that should have occurred at the time of update (so too late already). I think this is theoretically wrong and leads to all sorts of checks that require allowed-deltas where the delta in a mirrored robot should be at most coming from float approximation errors.

I think the correct way is to sample at time T + period dt to always plan one step ahead. Early test showed good results but not yet fully tested with reduced allowed-delta, and maybe there other consequences of planning ahead.

Tolerance checks

The current implementation of tolerance check in ROS 2 JTC reads current state, samples current expected state on the spline at time T and compares the error tracking between what is reality and what should be awaited at time T. This is correct. However JTC in ros_controllers is comparing current state with future state (at T + dt), and this seems wrong. This will always lead to a tracking or goal error what ever perfect robot (mirrored cmd/state ifs) there is.
If the above sampling time at T + dt is implemented in ROS 2 JTC, we end up in the same situation as in ROS JTC.

I suppose a solution is to do a mix of the 2 (ROS) worlds. sample at T for the tracking error, sample at T+dt for the desired state to send to the output (and compute the planned error to also generate a command in the PID adapters)

`state_msg` used to evaluate internal state of JTC

when trying to remove the clocks and use only a loop with T+dt in the tests, state_msgs started to get lost (not transmitted) due to too fast call to update, and publisher try_lock failing.
this leads to crashes (due to non protected state_msg access that could be null) and also state_msgs is not necessarily the last state (what ever the amount of wait and sping AFTER updateController because problem is in updateController
adding spin_some in the updateController, etc does not solve the problem, only very very tiny calls to updateController might permit to have luckily a state_msg close enough (because in so many calls to update one of the state_msg will go through) to the goals/expected values.
only one test (state_consistency test) really tests the state_msg has expected values, other test seem to use it to read the internal state/last state.
We have seen above the last state is already wrong due to sampling issue, but it also might not be the last state.

@destogl suggested to used FRIEND_TEST to access the protected variables and evaluate the expected values with those.
Remains the case what to do with state consistency check. Denis suggested to use a try_lock and unlock to check if the publisher is ready for a next call to update, and then thread sleep in between. Might work (first trials look promising)

time from start of `state_current`

Why is time set to "almost" zero there and not zero ?
indeed I think set__sec(0) does not reset the nanosecond part.

Conclusion

I think JTC and its tests can be improved, especially speed and stability of the tests in CI.
I know the code base well now, and did many trials, I am ready to create separate issues and corresponding PR once problems are confirmed and directions to go are known.

The text was updated successfully, but these errors were encountered:

destogl · 2023-07-14T08:08:49Z

Sounds reasonable to me. Let's start with simple fixes, e.g., set_nsec and then move to the more complex ones.

Many this are much better to understand if you see it in the code and their consequence in the tests.

gwalck · 2023-07-14T09:12:38Z

Okay for the set_nanosec and maybe for the getting rid of the state_msgs when possible in the tests.
The rest has more consequences I would prefer to get more control expert's feedback on (validate the theory)

christophfroehlich · 2023-07-17T21:15:36Z

Thanks to the detailed and verbose description.
I totally agree to state_msg and state_current issues.

The first update of the controller after it was activated does not generate any command, this is also weird.

You mean after activation of the controller? This is a known issue, see the start_with_holding parameter from #558. Currently, there won't be any output after the first trajectory is sent.

My thoughts about trajectory sampling:

For the tolerance checks you are right, one has to use the current time to evaluate it for sure.
For sampling the next desired values for the feedforward part, one might sample it at the next time instance. But I tend to say, one has to use the current time instance for calculating errors for feedback commands. I'll think again and discuss this with colleagues tomorrow.

gwalck · 2023-07-18T00:14:08Z

@christophfroehlich thanks for looking at my description.

I used the wrong term, I did not mean controller activated (because I did not look at the states and output at activation, I looked during update call (logging within update). So I mean after/during first update after the goal was already processed, so when there is a valid trajectory, but it generated no motion (See the example). I was expecting a desired state in the future, exactly like a time+dt would do.

for the sampling.
In theory, if the closed-loop PID was outside JTC (hopefully in the future), example with chaining with PID_Controller. The PID_controller will necessarily receive the current_state at time T as well from its state_ifs and the command from JTC on his command_ifs.
lets see the 2 possibles choices :

JTC sends desired at T+dt (JTC plans ahead). The following PID controller will receive a point to execute and reach at t+dt in the future an will then have an error with the current state. It can try to compensate for that tracking error overtime (a P cannot but a PI can reduce the tracking_error).
With a good PI, the error can be very small, so can a mirroring robot like in tests. (a PID controller that is perfect, will compensate for the error immediately and output exactly the desired state for the next value of state_ifs).
In the ideal case (mirror or well tuned PID) I expect the point in JTC to never have a tracking error, and this will be the case if we compare with spline at current time
JTC sends desired at T (like now). If that point is used for the internal PID error, or for sending it to an external PID, both PID will see a zero error in case of a perfect robot, meaning no motion for one time step. However, at the next iteration, since there was no plan ahead, and there was no error, the next iteration will see no motion occurred and hence create a tracking error. If the PID compensates perfectly for it, there will be again a zero error, hence generating no motion. and so on, oscillating between no error and an error.

So in the case of a perfect robot, I think one should plan ahead one step AND use that plan-ahead value to compute the PID error as if it was external.

A perfect robot should be able to reach the goal in time and have zero tracking error.

christophfroehlich · 2023-07-18T08:01:33Z

only one test (state_consistency test) really tests the state_msg has expected values, other test seem to use it to read the internal state/last state.

That is not true btw, waitAndCompareState uses the state_msg and is used throughout test_trajectory_controller.cpp

gwalck · 2023-07-18T08:07:27Z

only one test (state_consistency test) really tests the state_msg has expected values, other test seem to use it to read the internal state/last state.

That is not true btw, waitAndCompareState uses the state_msg and is used throughout test_trajectory_controller.cpp

Hum. My point is not where it is currently used, but to what purpose it is used in the tests.
For me, the state_msg should be tested for its consistency, and there is a test for that, and all other tests, doing tests on the expected values of some internal values should be done on those internal values, state_msg is in fact just for the user, not for control.
Hence me thinking one could replace the state_msg with internal access (via FRIEND_TEST) for all but one test.

gwalck · 2023-07-18T08:08:59Z

I did it partially already, just need more time that I don't have before tonight at best.

christophfroehlich · 2023-07-18T08:10:25Z

I'm sorry, I haven't read your statement properly. I agree 🙈

christophfroehlich · 2023-07-18T14:45:20Z

I discussed this with some colleagues today, and the common opinion is: it depends 😆

Using a time-discrete controller, you are already loosing the assumption of an "ideal robot"
Most important thing is that the measurements are taken at the same time. With different sensor types/modalities this is usually not the case. Typically you loose this where you use position encoders, where the velocity signal is a discrete difference quotient -> you are already half of a time stamp late.
If the tests (and hopefully the controller itself) are deterministic: One can test the goal at the correct time, no matter if it is time or time+dt
Shifting the desired trajectory should not introduce any instability to the system if applied correctly. It just starts earlier.
If the tolerances are harmed within one update interval, one has a different problem and should increase controller update rate.
Your statement, that a P controller cannot compensate a tracking error is not 100% correct if your system has integrating behavior, i.e., velocity command input and position feedback.

Summarized: It is a design decision, and both methods are fine for me if implemented consistently.

gwalck · 2023-07-18T17:36:58Z

@christophfroehlich Thanks for the time investigating and discussing. You are right I over simplified the problem of tracking error and P/ PI to the academic typical example I remembered. I could have spared myself some ridicule by not talking about PID possibilities as it does not help my point. I wanted mostly to say that a PID that can compensate for the tracking error will result in having a state_if = sampled point at the time of reading the state_if, so also a zero tracking error although it received a reference from sample at time+dt.

Your explanation of real life scenario is known to me, and there are delays everywhere. I think in real life there will be little consequences if dt is small indeed. I would like to focus on one of your remark that actually led me here.

If the tolerances are harmed within one update interval, one has a different problem and should increase controller update rate.

The whole problem is there. With sufficiently small deltas dt the CI will pass all the test. But dt had to be 1e-6 to pass the allowed_delta of 0.0011. I tried realistic dt, for instance 1e-3 and tests would fail.

Digging deeper with LARGER dt to see stuff happening without too long prints, I obviously uncovered that problem of "final desired point = target point" of the controller sent at final time and hence never reached in time, even in the ideal case.

Now I think that the design decision will lead to have to require unrealistic dt update to reach goals within allowed_deltas.
With a known design decision, the question is what is expected (tests), and one could change the tests to expect the "desired" state at time T, instead of expecting the current state to reach the final point.

This is where consistency is broken in the implementation of JTC + its tests.

I really hope to provide a PR with tests modified then (simpler than risking changing the design decision), and that pass with a realistic dt (but should also then work for non realistically large dt because all is perfect in the test.)

destogl · 2023-07-19T12:38:32Z

After your discussion here, I am just more convinced that we need T+dt. It was like this in ros_control and everybody were happy. Currently, we try to do work around something that complicates so many things. It is obvious that T+dt will simplify many things, and I am quite certain that some strange issues we are currently encountering with industrial robots may be caused by the current way of calculation.

It would be good to have a PR with "T+dt" implementation, so one can actually test this on HW.

bmagyar · 2023-08-23T13:41:07Z

I've moved the comment from @zengxuhe and the reaction by @gwalck to a new issue and deleted from here.

gwalck added the bug label Jul 13, 2023

bmagyar mentioned this issue Aug 23, 2023

1000hz JTC issue #755

Open

ros-controls deleted a comment from gwalck Aug 23, 2023

ros-controls deleted a comment from zengxuhe Aug 23, 2023

christophfroehlich mentioned this issue Nov 26, 2023

[JTC] Improve update methods for tests #858

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JTC] conceptual issues #697

[JTC] conceptual issues #697

gwalck commented Jul 13, 2023

destogl commented Jul 14, 2023

gwalck commented Jul 14, 2023

christophfroehlich commented Jul 17, 2023

gwalck commented Jul 18, 2023

christophfroehlich commented Jul 18, 2023

gwalck commented Jul 18, 2023

gwalck commented Jul 18, 2023

christophfroehlich commented Jul 18, 2023

christophfroehlich commented Jul 18, 2023

gwalck commented Jul 18, 2023

destogl commented Jul 19, 2023

bmagyar commented Aug 23, 2023

[JTC] conceptual issues #697

[JTC] conceptual issues #697

Comments

gwalck commented Jul 13, 2023

Disclaimer :

Problems

dt in update Controller

traj sampling

Tolerance checks

state_msg used to evaluate internal state of JTC

time from start of state_current

Conclusion

destogl commented Jul 14, 2023

gwalck commented Jul 14, 2023

christophfroehlich commented Jul 17, 2023

gwalck commented Jul 18, 2023

christophfroehlich commented Jul 18, 2023

gwalck commented Jul 18, 2023

gwalck commented Jul 18, 2023

christophfroehlich commented Jul 18, 2023

christophfroehlich commented Jul 18, 2023

gwalck commented Jul 18, 2023

destogl commented Jul 19, 2023

bmagyar commented Aug 23, 2023

`state_msg` used to evaluate internal state of JTC

time from start of `state_current`