Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple TF publishers can clear TF buffer in rospy kinetic #341

Closed
redstorm45 opened this issue Nov 13, 2018 · 7 comments
Closed

Multiple TF publishers can clear TF buffer in rospy kinetic #341

redstorm45 opened this issue Nov 13, 2018 · 7 comments
Labels

Comments

@redstorm45
Copy link

I recently had a problem with TFs clearing with a message similar to "Detected jump back in time of 0.000130s. Clearing TF buffer.", wich caused all of the nodes relying on TF information to break.
Looking at the code, it seems this pull request #329 was a fault, with the backport in #336 to Kinetic.

To confirm the issue, I tested with a simple package with multiple TF broadcasters and a single TF listener in python.
The package is available here: https://github.com/redstorm45/poc-tf-kinetic
I was able to reproduce the behavior, but testing on melodic showed no error.
I suspect a change in the way subscribers are handled between the two versions.

I was able to stop the problem from happening by adding a lock to the whole check_for_reset function here:

def check_for_reset(self):
now = rospy.Time.now()
if now < self.last_update:
rospy.logwarn("Detected jump back in time of %fs. Clearing TF buffer." % (self.last_update - now).to_sec())
self.buffer.clear()
self.last_update = now

Therefore, I think the last_update time is updated while another thread is checking against it, causing a buffer reset (which is not needed in that case).

I don't think this is the right solution, but the bug needs to be addressed in Kinetic.

@tfoote tfoote added the bug label Nov 14, 2018
@tfoote
Copy link
Member

tfoote commented Nov 14, 2018

Thanks for the instructions how to reproduce it and debugging into it. The resolution with a lock protecting the now check does suggest there's a race condition. I'm surprised that it's being hit here, but definitely if there's two in flight and it context switches in the middle of this method to another instance it could trigger the reset inappropriately.

@brinij
Copy link

brinij commented Nov 15, 2018

I have the same bug in Indigo...

@Achllle
Copy link

Achllle commented Nov 16, 2018

It looks like apt-get only holds the most recent and broken version of tf2. Is there any way to fix this without installing from source?

tfoote added a commit that referenced this issue Nov 16, 2018
Fixes #341

This could incorrectly trigger a buffer clear if two concurrent callbacks were invoked.
tfoote added a commit that referenced this issue Nov 16, 2018
Fixes #341

This could incorrectly trigger a buffer clear if two concurrent callbacks were invoked.
@tfoote
Copy link
Member

tfoote commented Nov 16, 2018

This effects all versions. There's no older version to roll back to in the binaries. I'll roll a new release shortly. PRs are in for both active development branches. #344 and backport to #345

I believe that the lock is the right thing to do. It's necessary for the check and set. I looked at refactoring the clear call to be outside the lock since it will take more time, but it doesn't really make sense to test for reset while it's still clearing anyway so I kept it inside the lock.

tfoote added a commit that referenced this issue Nov 16, 2018
Fixes #341

This could incorrectly trigger a buffer clear if two concurrent callbacks were invoked.
tfoote added a commit that referenced this issue Nov 16, 2018
Fixes #341

This could incorrectly trigger a buffer clear if two concurrent callbacks were invoked.
@carlosjoserg
Copy link

Hi @tfoote is it possible that the same issue occurs in the C++ side?

I see the check_for_reset equivalent here https://github.com/ros/geometry2/blob/noetic-devel/tf2_ros/src/transform_listener.cpp#L106-L111 wihtout the protection applied in #344

And we are getting the message in https://github.com/ros/geometry2/blob/noetic-devel/tf2_ros/src/transform_listener.cpp#L108 and running out of ideas for the cause.

Thanks

@tfoote
Copy link
Member

tfoote commented Feb 11, 2021

I guess there's no reason it shouldn't be able to happen in c++ too. If you're getting that error that's about the only option as long as you believe time is monotonic. Have you validated that you're not actually getting time jumps possibly from something like an NTP client?

It's a very short window and would have to be a race likely with thread preemption. If you can add a lock there to protect and that resolves it then it's more of an issue than I'd have imagined.

@carlosjoserg
Copy link

Hi @tfoote

We actually have an update on it, and yes, you are right, it is a problem with both, not having monotonic time in some parts and also NTP corrections, in our case.

While investigating, we implemented the same python test mentioned above in C++ and in fact we didn't experience the issue without any lock-protection in the C++ side, fortunately.

For the monotonic part, we replicated the current ros::Rate using a SteadyClock, like this.

All together, we are good so far. Many thanks again for the support, and confirming our findings. You can keep the issue closed ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants